SAF Policy on Ethical Data Acquisition and Post-Release Responsibility for Publicly Released Models
SkyTeam Aerospace Foundation maintains the following policy for any model it publicly releases, whether through public access, commercial deployment, external testing, or any other form of distribution outside internal development.
No publicly released SAF model will be trained, aligned, evaluated, or refined using data unless SAF has a lawful, documented, and ethically supportable basis for its use.
This policy applies to all data categories, including creative works, literary works, visual media, audio materials, software and technical documentation, educational resources, manuals, research materials, structured datasets, archives, internally produced materials, commissioned works, and contributor-supplied content. It applies regardless of format, medium, or subject matter.
Authorized Data Sources
For any publicly released SAF model, data may be used only where at least one of the following conditions is satisfied:
the data was created internally by SAF
the data was lawfully purchased and the intended use is permitted by the applicable terms
the data was lawfully licensed for the intended use
the data was provided with express permission or informed consent for the intended use
the data was obtained from a source whose governing terms expressly permit the intended use
the data is otherwise clearly authorized for the intended use under applicable law
Public accessibility alone does not constitute authorization for model training or related development use. Availability on the internet, in a database, or through public access channels will not be treated by SAF as automatic permission.
No Inherited Training Corpora
For publicly released SAF models, SAF will not rely on preassembled third-party training corpora, mixed-source model corpora, or legacy aggregate datasets where source-level authorization, permission status, or consent records cannot be independently verified under SAF’s documented internal verification standard.
This includes prior-created training corpora associated with other model developers or public AI research collections where SAF cannot trace the underlying materials to clear records of provenance, applicable terms of use, licensing status, consent status, or other lawful authorization.
Examples of datasets falling within this concern include mixed-source aggregate corpora such as The Pile, as well as similar large-scale compiled datasets assembled from numerous third-party sources without source-by-source permission records that SAF can independently review, verify, and document.
SAF’s position is straightforward: ethical sourcing cannot be outsourced. If provenance, authorization, and record integrity cannot be established by SAF itself, the corpus will not be used for a publicly released SAF model.
Instead, SAF will build its corpus from the ground up using sources for which SAF maintains its own records regarding provenance, authorization, licensing, purchase, consent, contribution status, or other lawful basis for use.
Scope of Use
This policy governs data used for:
pretraining
post-training
fine-tuning
alignment
retrieval-supported model behavior where underlying source data is incorporated into the public system
benchmarking and evaluation where retained datasets materially influence public model development
any other developmental process that materially contributes to a publicly released model
SAF applies the same acquisition standard across all covered data types. No category of material is exempt solely because it is educational, technical, factual, instructional, or publicly available.
Attribution and Source Integrity
Where a contributor, licensor, publisher, institution, or other source has provided data through purchase, license, consent, commission, or direct contribution, SAF will maintain records sufficient to identify the source and the basis for use.
Where public acknowledgment is permitted and appropriate, SAF will provide citation or credit. Where confidentiality, anonymity, contractual restriction, privacy obligation, or similar limitation applies, SAF will honor that limitation.
SAF may maintain a non-public corpus or non-public sourcing records for reasons including proprietary protection, information security, contractual obligations, privacy protection, or system integrity. A non-public corpus does not exempt SAF from maintaining internal documentation regarding source provenance and authorization.
Prohibited Practices for Publicly Released Models
For publicly released SAF models, SAF will not rely on the following as a basis for data use:
unauthorized scraping as a substitute for license, permission, or lawful authorization
ambiguous access conditions treated as implied permission
undocumented third-party datasets where provenance and use rights cannot be established under SAF’s documented internal verification standard
inherited model corpora or aggregate training datasets lacking verifiable source-level records
data acquisition practices that materially depend on avoiding attribution, consent, payment, or licensing where such obligations apply
Internal Verification Standard
Before data is incorporated into a publicly released SAF model, SAF’s standard is that the source, provenance, and basis for use must be identifiable, reviewable, and defensible under a documented internal verification standard. If lawful authorization, applicable terms, provenance, or record integrity cannot be established under that standard, the data will not be used in a publicly released model.
Post-Release Responsibility for Adaptive Learning Systems
Some publicly released SAF systems may include adaptive learning capabilities that allow the system to learn, update, retain knowledge, or otherwise change over time through user inputs, user-provided data, environmental interaction, connected tools, or deployment-specific context.
Where a publicly released SAF system continues to learn or adapt after release, SAF’s responsibility is limited to the released base system, its intended architecture, and the functionality provided by SAF at the time of release.
SAF is not responsible for post-release changes in behavior, knowledge, outputs, retained information, decision patterns, or other adaptive developments arising from:
user-provided inputs or data
user-enabled integrations, tools, or external connections
user-directed configuration, modification, or retraining
deployment-specific environments outside SAF’s control
continued learning processes occurring after public release
The user, operator, deployer, or controlling entity of such a system assumes responsibility for post-release oversight, configuration, retained knowledge, data handling, operational use, and consequences arising from adaptive development outside SAF’s direct control.
For any publicly released adaptive learning system, the user is responsible for monitoring how that system evolves in use and for ensuring that such use remains lawful, appropriate, and consistent with the user’s own operational, legal, and ethical obligations.
Policy Position
SAF’s position is that publicly released models must be built on data acquired through lawful, documented, and ethically supportable means. This standard applies to all covered data categories, not only creative works.
If SAF releases a model to the public, the data materially used to develop that model must be internally traceable to a legitimate basis for use, including creation, purchase, license, consent, express permission, or other clearly authorized grounds.
SAF will not substitute another party’s aggregate corpus for its own due diligence. SAF will build and maintain its own corpus records from the ground up.
Where a publicly released SAF system continues to adapt after release, responsibility for that post-release adaptation rests with the user or operator to the extent such changes occur outside SAF’s direct control.