The rapid advancement of Artificial Intelligence (AI) has sparked a fascinating debate at the intersection of technology and law. As AI systems, particularly generative AI, become more sophisticated, they challenge our traditional notions of creativity and copyright.
Understanding AI: Beyond Simple Computation
At its core, AI refers to computer systems designed to perform tasks that typically require human intelligence. These are not just advanced calculators or search engines; we are talking about systems that can recognize faces, understand natural language, and make decisions based on complex data.
Generative AI takes this a step further. These systems can create content – images, text, music, and more. It is as if we have given computers the ability to produce novel outputs, raising intriguing questions about the nature of creativity itself.
The Training Process: A Digital Digestion of Data
Step 1: The Data Feast
First, large language models (LLMs) that power AI systems are fed a massive amount of data – websites, books, images, etc.- which often includes copyrighted works.
Step 2: Chopping It Up
Whole books or images cannot be “digested” within the AI system in one go; Instead, they’re broken down into a stream of pieces called “tokens.” For text, these tokens can vary in size, from parts of words to multiple words, depending on the AI model. Many popular models use subword tokenization, which can break words into meaningful parts. For instance, “understanding” might become “under” and “standing.” For simplicity in our discussion, we’ll pretend that tokens are whole “words,” but remember that they can be smaller or larger units. For images, it’s typically a sequence of pixel groups representing patterns of colors and shapes. This stream of tokens allows the AI to efficiently “digest” vast amounts of information, processing the data as a continuous flow.
Step 3: Encoding and Embedding – The AI’s Digital Preservation Process
This step is where encoding and embedding happens. Encoding is like translating text into a list of numbers for the AI system to process. Embedding goes a step further, turning these numbers into vector representations that capture the various contexts in which words occur. A vector, in this case, is simply a list of numbers that represents a word’s relationships to other words, which indirectly captures aspects of its usage and context. It’s important to note that the AI doesn’t truly understand the meaning of words as humans do. Imagine taking a book and converting each word into a unique code, then transforming that code into a set of coordinates in a vast, multidimensional space. This process allows AI systems to work with words based on their usage patterns and contextual relationships. For example, the vectors for “bank” in “river bank” and “bank account” would be positioned differently in this space, reflecting their distinct contexts.
Step 4: Representation Learning – Building the AI Knowledge Map
As AI processes increasing amounts of data, it is not just collecting these encoded pieces—it is learning how to represent them in the most useful way. This is called representation learning.
Think of it like creating a massive, multidimensional map. Each piece of data gets a spot on this map, and similar pieces are placed close together. The AI learns to navigate this map, understanding the relationships between different pieces of information.
For example, in this map, the encoded versions of “cat” and “kitten” might be close together, while “dog” is a bit further away, and “automobile” is in a completely different region.
Step 5: Constructing the Neural Network – The AI’s Brain
As the AI learns these representations, it builds and adjusts its neural network—a complex web of interconnected nodes, similar to neurons in a brain.
This network is a highly sophisticated, encoded library of all the works the AI was trained on. While the original data is not there in its human-readable form, a transformation of it—preserving its essential features and content (in copyright terms, we call this a substantial reproduction)— has been encoded into the very structure of the AI model.
The Result:
What we end up with is not just a set of abstract patterns or relationships. It is a complex system that has preserved transformed versions (in copyright terms we call this a derivative work) of all its training data. When the AI generates something new, it accesses and recombines these encoded representations.
This is why AI outputs can mimic the style or content of works in its training data, and just one of the many reasons why the use of copyrighted works in AI training raises significant legal and ethical questions. The works may not be stored in its original form, but the essence remains, transformed and encoded, within the AI’s neural network.
The Path Forward:
By embracing the fact that responsible AI starts with licensing, society can have its AI cake and eat it too. Innovators will continue innovating, rightsholders will receive fair remuneration, and creators will be incentivized to continue creating. It is not about putting the brakes on progress; it is about making sure the ride is fair, safe, and sustainable.