Copyright Laws Could Prevent AI Systems From Proper Training, Leading to a Rapid Decline in Development

Lon Harris
Lon Harris is a contributor to dot.LA. His work has also appeared on ScreenJunkies, RottenTomatoes and Inside Streaming.
Copyright Laws Could Prevent AI Systems From Proper Training, Leading to a Rapid Decline in Development
Evan Xie

One perhaps under-scrutinized aspect of the AI revolution concerns how these various applications are trained. “ChatGPT learned about language by reading the internet” is the casual, tossed-off, shorthand version of the explanation. But the specific content that’s fed into these apps goes a long way in determining what kinds of outputs they ultimately generate. So if an AI application is only as good as the content that’s fed into it, would it then be fair to say that the creators and owners of that content deserve compensation?


This isn’t a purely academic or rhetorical question, but a pressing real-world issue that will soon require an answer. On Tuesday, news aggregation website Reddit announced plans to begin charging companies for access to its API, an early indication that it hopes to earn money in exchange for providing training materials to companies like OpenAI.

Reddit’s data is particularly appealing to OpenAI and other designers of so-called Large Language Models (LLMs). Unlike Google search results, Wikipedia pages, or other vast collections of writing and information, Reddit threads are already made up of real human beings engaged in conversations. They’re naturally going to be helpful in designing a chatbot that mimics real human speech and strives to create authentic interactions.

Additionally, Reddit content is constantly refreshed by its own users. They add headlines the moment news stories are posted and conversation threads get consistently refreshed with new commentary and real-time updates. This also helps LLMs and other AI systems to produce better, more accurate results. Both Google’s Bard and OpenAI’s ChatGPT were partly trained on Reddit data. In fact, ChatGPT cites Reddit as one of the primary sources for its training information.

It’s fair and accurate to say that Stable Diffusion and Midjourney produce original artwork, sure, and ChatGPT and Google’s Bard compose their own prose from scratch. But they’re only able to complete these tasks after scanning thousands of original drawings and millions of original sentences written by humans. In some ways, this is less “Artificial Intelligence” and more “Extensively Mashed-Up and Remixed Human Intelligence.” But that makes for a less appealing acronym.

From the perspective of an independent site like Reddit, or an image hosting service like Shutterstock, AI applications represent not just a way to squeeze additional financial value out of their pre-existing content libraries. Over time, these applications could emerge as potential rivals. ChatGPT could certainly one day power its own version of Reddit, scraping the web for fascinating news stories, writing attention-grabbing headlines, and posting them in forums to encourage reactions and discourse. OpenAI’s DALL-E is already used as an illustrator for web content. Obviously, a sufficiently advanced image generation tool could replace a stock photo library. So by charging up-front for their data, Reddit is also preparing early for a future in which its human users square off against automated rivals.

That said, Reddit’s not alone in their concern about being scraped, and efforts to potentially do something about it. Elon Musk has threatened to sue Microsoft over the use of his Twitter data. In January, Getty Images filed a lawsuit against Stable Diffusion creators Stability AI, alleging that the company “unlawfully copied and processed millions of images protected by copyright” in order to train their AI systems. Researchers from the University of Chicago have introduced Glaze, a beta application that adds imperceptible “perturbations” to artwork, thus preventing AI applications from scraping them and learning to copy that artist’s style and aesthetic.

According to emails obtained by the Financial Times, Universal Music Group – one of the industry’s largest labels – has asked streaming services like Spotify and Apple Music to limit AI access to their content, in an effort to prevent apps from scraping their songs and artists. This of course is no longer purely academic either. That “Heart On My Sleeve” AI-generated song that appears to feature both Drake and The Weeknd was only possible to create because the app had trained on Drake and Weeknd songs. As well, a number of tools and tutorials to stop ChatGPT from scraping your content have been released on the web.

From a legal perspective, issues around AI remain largely unresolved, including who can copyright the results of a human working with a generative AI application. Training adds yet another level of complexity to this question. Even if we one day establish that a person can copyright a piece of art that they created with help from an AI application… what if that AI application trained on work produced by a different human? Does the original artist whose creation was used to develop the software also own a piece of the final result? How are they compensated, if at all?

A Forbes editorial from February, written by a legal expert in AI, warns that generative AI is “rife with potential ethnical and legal conundrums,” particularly when it comes to plagiarism and copyright infringement. Perhaps, over time, these issues will simply be worked out by judges and juries to everyone’s mutual benefit. But it’s also possible this could present a genuine roadblock for either artists and creators or the future of AI development. If AI companies are allowed to scrape whatever they please without compensation, and build the next generation of internet applications without input from humans, that leaves a lot of individual artists, writers, and creators out in the cold.

Conversely, if AI companies are not allowed to use anyone’s work to train their system without payment, we could be looking at the end of the lightning-fast development we’ve come to expect from the entire field. As we’ve seen repeatedly, AI applications are only as good as the library of content on which they’re trained. This is why China, with its internet pockmarked by banned content and censorship, has yet to create a true ChatGPT rival.

Subscribe to our newsletter to catch every headline.

College Grads Are Turning Their Backs on the Tech Industry

Lon Harris
Lon Harris is a contributor to dot.LA. His work has also appeared on ScreenJunkies, RottenTomatoes and Inside Streaming.
College Grads Are Turning Their Backs on the Tech Industry
Evan Xie

A new report in Bloomberg suggests that younger workers and college graduates are moving away from tech as the preferred industry in which to embark on their careers. While big tech companies and startups once promised skilled young workers not just the opportunity to develop cutting-edge, exciting products, but also perks and – for the most talented and ambitious newcomers – a relatively reliable path to wealth. (Who could forget the tales of overnight Facebook millionaires that fueled the previous dot com explosion? There were even movies about it!)

Read moreShow less

Are a Vehicle’s Features More Important Than It Being Electric?

David Shultz

David Shultz reports on clean technology and electric vehicles, among other industries, for dot.LA. His writing has appeared in The Atlantic, Outside, Nautilus and many other publications.

Are a Vehicle’s Features More Important Than It Being Electric?
Photo by Jannes Glas on Unsplash

The state of California wants 100% of new passenger vehicles sales to be fully electric by 2035. Last year, the state hit a nation-leading 16%. That’s pretty good, but 84% is still a long way to go.

A new study, published Monday in Proceedings of the National Academy of Sciences, investigates which factors have been responsible for the rise in new EV sales nationally. The findings indicate that consumers are increasingly likely to choose an electric vehicle, and nearly all of the gains can be explained simply by improving technologies.

Read moreShow less

Colleen Wachob On Navigating Her Wellness Journey As An Entrepreneur

Decerry Donato

Decerry Donato is a reporter at dot.LA. Prior to that, she was an editorial fellow at the company. Decerry received her bachelor's degree in literary journalism from the University of California, Irvine. She continues to write stories to inform the community about issues or events that take place in the L.A. area. On the weekends, she can be found hiking in the Angeles National forest or sifting through racks at your local thrift store.

Colleen Wachob On Navigating Her Wellness Journey As An Entrepreneur
Courtesy of Behind Her Empire

On this episode of Behind Her Empire, mindbodygreen co-founder and co-CEO Colleen Wachob shares her perspective on managing stress and navigating self-worth as an entrepreneur and the importance of celebrating the wins in your business.

Read moreShow less
RELATEDEDITOR'S PICKS
LA TECH JOBS
interchangeLA
Trending