The application of copyright law to AI training is complex.
Around the world, there are varying exceptions and limitations to copyright that permit AI training. Jurisdictions also often consider the specific uses of the copyrighted work within AI training and outputs.
*Reminder* CC licenses are copyright licenses.
That means that applying CC licenses to AI training is just as complex.
In an effort to untangle some of this complexity, we have prepared the following primer that tackles the issues from two distinct angles.
- Practical guidance for how to follow the CC licenses for training data, even in situations where copyright may not require it.
- A legalistic approach, analyzing when compliance with the CC licenses is legally required.
This primer may also be a useful tool for creators applying CC licenses to their work to better understand how their works may be used in AI training.
How to Follow the CC Licenses for Training Data*
*Even in situations where copyright may not require it.
Keep in mind: following this guidance is likely to lead to overcompliance with both copyright law and CC license terms. It assumes the most restrictive legal interpretation for those who wish to take a conservative approach and minimize risk.
What follows is a quick look at CC-licensed content and AI training. Please review the full analysis for more details.
Attribution (BY)
All CC licenses require attribution to the creator(s) of the licensed material. For AI model training, attribution could be a simple link to the source of the dataset used to train the model. Where retrieval-augmented generation (RAG) or other methods are available, providing attribution to the CC-licensed work tied to the particular model output with a link to the source is ideal.
ShareAlike (SA)
CC BY-SA and CC BY-NC-SA require that adaptations be shared under the same license. If AI models or outputs are based on ShareAlike content and they will be shared publicly, following the ShareAlike condition would require AI developers to use the same CC license as the original works.
NonCommercial (NC)
CC BY-NC and CC BY-NC-SA give permission for NonCommercial uses only. If AI training data includes the NonCommercial restriction, then following the NC restriction would require that all stages, from copying the data during training to sharing the trained model, must not be for commercial gain.
NoDerivatives (ND)
CC BY-ND and CC BY-NC-ND prohibit creating derivative works. Following the NoDerivatives restriction would require that ND-licensed content not be used as training data.
An Analysis of When Compliance with the CC Licenses is Legally Required
I’m an AI developer. Do I Have to Comply With the CC License?
It depends. CC licenses apply only when copyright permission is required. If exceptions or limitations apply, then the CC license terms don’t apply.
Step 1 – Does copyright apply?
There are two main aspects to the analysis of whether copyright applies when using CC-licensed works as inputs to training.
- Dataset Acquisition and Preparation for Training
The process of training AI models usually requires copying the works that are used as training data. When those copied works are copyrighted, the training step can have copyright implications. Around the world, copyright law varies regarding whether and how such copying can be permissible.
- Memorization in Training
Sometimes, AI models reproduce and store certain copyrightable expressions of the works on which they were trained (as opposed to simply copying data about them, for example). This is commonly referred to as memorization. Currently, the only way to know whether a model has memorized content is by identifying model outputs that are substantially similar to the original.
This means that the extent of memorization is impossible to quantify in the abstract. It is also important to note that not all instances of memorization are infringing. To varying degrees, model developers make efforts to adapt their methods to try to avoid memorization, and this can mitigate, but may not eliminate, the risk of copyright infringement.
Step 2 – When Do CC License Conditions Apply?
- The attribution (BY) and ShareAlike (SA) conditions, and NoDerivatives (ND) restriction are triggered only when works or adaptations of works are publicly shared.
- The NonCommercial (NC) restriction applies to all uses requiring permission under copyright.
For a deeper dive, see our detailed legal analysis and flow chart.