Could Lack of Access to Training Data Block Lightning-Fast AI Development?
One perhaps under-scrutinized aspect of the AI revolution concerns how these various applications are trained. “ChatGPT learned about language by reading the internet” is the casual, tossed-off, shorthand version of the explanation. But the specific content that’s fed into these apps goes a long way in determining what kinds of outputs they ultimately generate. So if an AI application is only as good as the content that’s fed into it, would it then be fair to say that the creators and owners of that content deserve compensation?
This isn’t a purely academic or rhetorical question, but a pressing real-world issue that will soon require an answer. On Tuesday, news aggregation website Reddit announced plans to begin charging companies for access to its API, an early indication that it hopes to earn money in exchange for providing training materials to companies like OpenAI.
Reddit’s data is particularly appealing to OpenAI and other designers of so-called Large Language Models (LLMs). Unlike Google search results, Wikipedia pages, or other vast collections of writing and information, Reddit threads are already made up of real human beings engaged in conversations. They’re naturally going to be helpful in designing a chatbot that mimics real human speech and strives to create authentic interactions.
Additionally, Reddit content is constantly refreshed by its own users. They add headlines the moment news stories are posted and conversation threads get consistently refreshed with new commentary and real-time updates. This also helps LLMs and other AI systems to produce better, more accurate results. Both Google’s Bard and OpenAI’s ChatGPT were partly trained on Reddit data. In fact, ChatGPT cites Reddit as one of the primary sources for its training information.
It’s fair and accurate to say that Stable Diffusion and Midjourney produce original artwork, sure, and ChatGPT and Google’s Bard compose their own prose from scratch. But they’re only able to complete these tasks after scanning thousands of original drawings and millions of original sentences written by humans. In some ways, this is less “Artificial Intelligence” and more “Extensively Mashed-Up and Remixed Human Intelligence.” But that makes for a less appealing acronym.
From the perspective of an independent site like Reddit, or an image hosting service like Shutterstock, AI applications represent not just a way to squeeze additional financial value out of their pre-existing content libraries. Over time, these applications could emerge as potential rivals. ChatGPT could certainly one day power its own version of Reddit, scraping the web for fascinating news stories, writing attention-grabbing headlines, and posting them in forums to encourage reactions and discourse. OpenAI’s DALL-E is already used as an illustrator for web content. Obviously, a sufficiently advanced image generation tool could replace a stock photo library. So by charging up-front for their data, Reddit is also preparing early for a future in which its human users square off against automated rivals.
That said, Reddit’s not alone in their concern about being scraped, and efforts to potentially do something about it. Elon Musk has threatened to sue Microsoft over the use of his Twitter data. In January, Getty Images filed a lawsuit against Stable Diffusion creators Stability AI, alleging that the company “unlawfully copied and processed millions of images protected by copyright” in order to train their AI systems. Researchers from the University of Chicago have introduced Glaze, a beta application that adds imperceptible “perturbations” to artwork, thus preventing AI applications from scraping them and learning to copy that artist’s style and aesthetic.
According to emails obtained by the Financial Times, Universal Music Group – one of the industry’s largest labels – has asked streaming services like Spotify and Apple Music to limit AI access to their content, in an effort to prevent apps from scraping their songs and artists. This of course is no longer purely academic either. That “Heart On My Sleeve” AI-generated song that appears to feature both Drake and The Weeknd was only possible to create because the app had trained on Drake and Weeknd songs. As well, a number of tools and tutorials to stop ChatGPT from scraping your content have been released on the web.
From a legal perspective, issues around AI remain largely unresolved, including who can copyright the results of a human working with a generative AI application. Training adds yet another level of complexity to this question. Even if we one day establish that a person can copyright a piece of art that they created with help from an AI application… what if that AI application trained on work produced by a different human? Does the original artist whose creation was used to develop the software also own a piece of the final result? How are they compensated, if at all?
A Forbes editorial from February, written by a legal expert in AI, warns that generative AI is “rife with potential ethnical and legal conundrums,” particularly when it comes to plagiarism and copyright infringement. Perhaps, over time, these issues will simply be worked out by judges and juries to everyone’s mutual benefit. But it’s also possible this could present a genuine roadblock for either artists and creators or the future of AI development. If AI companies are allowed to scrape whatever they please without compensation, and build the next generation of internet applications without input from humans, that leaves a lot of individual artists, writers, and creators out in the cold.
Conversely, if AI companies are not allowed to use anyone’s work to train their system without payment, we could be looking at the end of the lightning-fast development we’ve come to expect from the entire field. As we’ve seen repeatedly, AI applications are only as good as the library of content on which they’re trained. This is why China, with its internet pockmarked by banned content and censorship, has yet to create a true ChatGPT rival.
- Art Created By Artificial Intelligence Can’t Be Copyrighted, US Agency Rules ›
- AI Is So Cool. Why Is The Conversation Around It So Dumb? ›
- 'Open Letter' Proposing 6-Month AI Moratorium Continues to Muddy the Waters Around the Technology ›
- How Influencers Are Using AI for Styling Advice - dot.LA ›
- Heywire Believes AI Could Help Save Struggling Newsrooms - dot.LA ›
- LA Tech Week: How To Create Responsible AI in Entertainment - dot.LA ›