Vocal Deepfakes Question Who You're Really Listening To

Vocal Deepfakes Are Here To Make You Question Who You're Listening To

Midjourney

This is the web version of dot.LA’s daily newsletter. Sign up to get the latest news on Southern California’s tech, startup and venture capital scene.

This week's newsletter sponsor is WeWork. Unlock coworking spaces near you with WeWork All Access. Get 25% off WeWork All Access monthly membership fees for 5 months. Terms apply. Visit wework.com to get started.

Last week, a team from Microsoft posted a new voice synthesis machine learning model called VALL-E to GitHub. That’s a lot of words, but in simpler terms, the idea is basically vocal deepfakes, or software capable of accurately imitating the real sound of a human voice, even a specific human voice.

This is not exactly new technology, but the latest iteration in the ongoing development of true synthetic speech. The UK’s Papercup has been providing natural human-sounding AI dubs in multiple languages for major media brands for several years, including Sky News, Discovery, Cinedigm, and Business Insider. Their tech was recently used to translate 30 seasons of Bob Ross’ instructional series “The Joy of Painting” for streaming platforms around the globe.

Back in September, Disney announced plans to use machine learning models from a Ukrainian company called Respeecher to reproduce James Earl Jones’ distinctive vocal performance as Darth Vader. Meanwhile, Seattle’s WellSaid Labs can quickly generate hours of high-quality audio content in up to 15 different voices and a number of different languages in 50% of real time. (So a minute of speech only takes around 35 seconds to generate.) Berkeley’s synthetic speech startup LOVO closed a $4.5 pre-series A funding round in 2021 to help them create better-quality “voice skins” for digital assistants like Alexa and Siri. And London’s Sonantic generates eerily lifelike results by incorporating non-speech sounds into its audio simulations, like tiny scoffs, small intakes or breath, or chuckles.

Even AI software that can accurately impersonate humans has been around for a little while already. In 2017, Montreal-based Lyrebird introduced a “voice imitation algorithm” that could mimic a real person’s speech based on only about a minute of initial audio.

Microsoft’s new VALL-E software, however, adds a new “neural codec language model” to the process that takes an innovative approach to rendering voices. That means VALL-E can create highly-accurate “personalized speech” based on just three seconds of clear audio of a speaker, while paying attention to smaller nuances like tone, timbre, accents, and the original audio’s “acoustic environment” with a level of sophistication that older AI models can’t really touch. (So if the audio was recorded on a cell phone in a busy restaurant, for example, that effect can be used.)

Obviously, there’s major potential for bad actors to use software like VALL-E nefariously. (Say, by posting fake audio clips of notable public figures admitting to crimes or making controversial statements.) While this has been possible for more than five years now using other variations of synthetic and mimicking speech software, it only gets increasingly likely as these products become higher-quality and more sophisticated. And as machine-learning AI systems continue to proliferate, they will only improve with use and time. Tweaking and fine-tuning the entire complicated system isn’t even required to improve results at this point; audio editors can make specific adjustments on the fly, which the system then remembers for next time.

The technology obviously also poses a threat to up-and-coming actors, particularly if more celebrities like James Earl Jones get on board and agree to license their familiar voices. Why hire a relative unknown to come into the studio and spend hours recording a fresh voice performance when software that can perfectly imitate Eddie Murphy could simply be purchased from a vendor? (No, I’m seriously asking. Why would you?)

But of course, it’s not all bad news. The potential upsides of truly accurate AI-generated synthetic speech are numerous, even beyond Darth Vader sounding like himself forever and native Telugu speakers receiving painting lessons from the great Bob Ross. AI voice modeling is a massive potential time and money-saver in the right context. (For example, an author mimicking their own voice to record their entire library of audiobooks.) Audio engineers working on important public health messages or vital safety notifications could utilize a variety of different voices and speaking styles until the find the one that’s the most effective and widely heard.

Actor Val Kilmer – who permanently lost his voice after undergoing treatments for throat cancer in 2014 – has partnered with Sonantic to create an AI-powered speaking voice for himself in everyday life. (A different process was used for his recent performance in the film “Top Gun: Maverick.”) There’s just no way to take the good without the bad.

It’s a very similar debate, in many ways, to those that raged a few years back about visual Deepfake technology. Only this time we know that ethical or legal concerns aren’t going to hold synthetic speech developments back in any considerable way. After capturing the public’s attention in the late ‘10s, and setting off wave after wave of concerned editorials, Deepfakes are more popular than ever, whether they’re being used to impersonate Tom Cruise on social media or powering new films and TV projects from the “South Park” guys. -Lon Harris

How the LA Public Library Is Leveraging TikTok To Bring People Back

BookTok—the subsect of TikTok users dedicated to literature—has been credited with bringing young readers back into bookstores.

LA Venture: Dangerous Ventures’ Gaby Darbyshire On ‘Shining a Bright Light’ on Difficult Problems

Dangerous Ventures founder and General Partner Gaby Darbyshire explains how her background as the co-founder of a pioneering digital publisher set the stage for her interest in climate technology.

What We’re Reading...

- TikTok added portals specifically for talent managers to its Creator Marketplace, making it easier for brands to connect with the platform’s “megastars.”

- DirecTV plans to lay off around 10% of its management staff next week, in its latest response to the ongoing cord-cutting trend.

- LA’s EVgo announced a new maintenance program to guarantee quality standards across its network of electric vehicle fast-charging stations.

How Are We Doing? We're working to make the newsletter more informative, with deeper analysis and more news about L.A.'s tech and startup scene. Let us know what you think in our survey, or email us!

Subscribe to our newsletter to catch every headline.

How the LA Public Library Is Leveraging TikTok To Bring People Back

LA Venture: Dangerous Ventures’ Gaby Darbyshire On ‘Shining a Bright Light’ on Difficult Problems

What We’re Reading...