29.2 C
New York
Thursday, July 4, 2024

AI Can Clone Your Favorite Podcast Host’s Voice

One day this year, you’ll start listening to a podcast and realize something is a little off. The host, whose voice is familiar to you, will sound different. Sentences may be stilted or some words will have an odd tone. And so you’ll ask, Is this actually the host talking or their AI voice clone?

Just as artificial intelligence has proven adept at generating lifelike images, effective videos, and cogent text, similar technologies can convincingly mimic the voices of podcast hosts, content creators, and other media professionals. A new set of tools from a growing list of startups is expected to hasten AI’s conquest of our audio feeds.

Our ears are already familiar with computer-generated speech. Artificial voices are playing DJ and answering your phone calls. Technologists have cloned the voices of celebrities alive and dead and reconstructed the voices of those who have lost their ability to speak due to illness. Someday soon, AI-powered speech tools will be able to bring back the voices of our dead relatives.

When it comes to producing podcasts, machines have proven able to lend a hand in the editing room. Editing services like Descript offer machine learning features that clean up an audio recording of human speech by removing awkward pauses and filler words such as “um” and “like.”

Lately, even more options are emerging to take care of the really messy part of making a podcast: the talking. Descript offers a feature called Overdub, which creates a virtual voice that can be used in production editing. If a host mispronounces somebody’s name or gets a date wrong, a producer can task the robot with saying it correctly, then paste in the correction.

Newer tools go even further. In January, Podcastle, a startup that offers a suite of podcasting software, released an AI-powered voice cloning tool called Revoice that can create a digital simulacrum of a human host. The company is positioning Revoice as a way for producers to create any aspect of an audio production—from ad reads to voiceovers to audiobooks—just by typing in the words they want the virtual version of the host to say.

Creating a digital copy of your voice takes a bit of work. While some AI services can emulate voices by studying audio clips of the person talking, Podcastle requires users to read off a script of around 70 phrases, selected to capture a variety of mouth movements and phonemes. The process takes 30 to 45 minutes, depending on how particular you are about getting the intonations right.

“The idea was always that it should be very close to your original voice,” Podcastle CEO Artavazd Yeritsyan says of the resulting voice clone. “Not a beautification or making your voice even better than it is, but very accurate in how you pronounce the words.”

It’s a lofty goal, but voice AI doesn’t always sound quite as melodious as an actual human voice would. The tone (at least in my experimentation) comes across as monotonous and robotic, with weird stutters and synthetic artifacts throughout.

Most PopularGearThe 15 Best Electric Bikes for Every Kind of Ride

Adrienne So

GearThe Best Lubes for Every Occasion

Jaina Grey

GearThe iPhone Is Finally Getting USB-C. Here’s What That Means

Julian Chokkattu

Gear11 Great Deals on Sex Toys, Breast Pumps, and Smart Lights

Jaina Grey

I'll show you an example, starting with my actual speaking voice.

Next, my simulation.

Those imperfections in rhythm and inflection are inevitable, says Vijay Balasubramaniyan. He’s CEO of the company Pindrop, which analyzes voices in audio and phone calls to prevent fraud. “Your voice is something that’s developed over 10,000 years of evolution,” he says. “So you’ve developed certain things that are very hard for machines to replicate.”

AI voice companies are working to enhance the humanness of their clones. Mati Staniszewski, the CEO of ElevenLabs, says its models are trained to interpret the context of the language that you want the voice to speak. Depending on how the sentence is written, the model can then manipulate the tone and pacing of the resulting audio to approximate a more human inflection. That can give it a much more realistic feeling, but can also lead to it getting a lot more chaotic.

The above clip was made with ElevenLabs speech synthesis model. The pacing and inflection in the first half sounds on par with the original audio, but the latter half is louder and more frantic than any of the voice clips used to build the model.

You Talk Like Me

Audio AI may feel only slightly more realistic than AI video at the moment, but the results from the current set of tools are good enough to make security experts nervous. There are very good reasons you’d want to hide your voice for the sake of security and privacy; it can be used to authenticate your identity, and machines can determine identifying factors like your age, ethnicity, gender, and economic status just by listening to you speak.

Balasubramaniyan says voice AI services need to offer security on par with that of other companies that store personal data, like financial or medical information.

Most PopularGearThe 15 Best Electric Bikes for Every Kind of Ride

Adrienne So

GearThe Best Lubes for Every Occasion

Jaina Grey

GearThe iPhone Is Finally Getting USB-C. Here’s What That Means

Julian Chokkattu

Gear11 Great Deals on Sex Toys, Breast Pumps, and Smart Lights

Jaina Grey

“You have to ask the company, ‘how is my AI voice going to be stored? Are you actually storing my recordings? Are you storing it encrypted? Who has access to it?’” Balasubramaniyan says. “It is a part of me. It is my intimate self. I need to protect it just as well.”

Podcastle says the voice models are end-to-end encrypted and that the company doesn’t keep any recordings after creating the model. Only the account holder who recorded the voice clips can access them. Podcastle also doesn’t allow other audio to be uploaded or analyzed on Revoice. In fact, the person creating a copy of their voice has to record the lines of prewritten text directly into Revoice’s app. They can’t just upload a prerecorded file.

“You are the one giving permission and creating the content,” Podcastle’s Yeritsyan says. “Whether it’s artificial or original, if this is not a deepfaked voice, it’s this person’s voice and he put it out there. I don’t see issues.”

Podcastle is hoping that being able to render audio in only a consenting person’s cloned voice would disincentivize people from making themselves say anything too horrible. Currently, the service doesn’t have any content moderation or restrictions on specific words or phrases. Yeritsyan says it is up to whatever service or outlet publishes the audio—like Spotify, Apple Podcasts, or YouTube—to police the content that gets pushed onto their platforms.

“There are huge moderation teams on any social platforms or any streaming platform,” Yeritsyan says. “So that’s their job to not let anyone else use the fake voice and create something stupid or something not ethical and publish it there.”

Even if the very thorny issue of voice deepfakes and nonconsensual AI clones is addressed, it’s still unclear whether people will accept a computerized clone as an acceptable stand-in for a human. 

At the end of March, the comedian Drew Carey used ElevenLabs' tool to release a whole episode of a radio show that was read by his voice clone. For the most part, people hated it. Podcasting is an intimate medium, and the distinct human connection you feel when listening to people have a conversation or tell stories is easily lost when the robots step to the microphone.

But what happens when the technology advances to the point that you can’t tell the difference? Does it matter that it’s not really your favorite podcaster in your ear? Cloned AI speech has a ways to go before it’s indistinguishable from human speech, but it’s surely catching up quickly. Just a year ago, AI-generated images looked cartoonish, and now they’re realistic enough to fool millions into thinking the Pope had some kick-ass new outerwear. It’s easy to imagine AI-generated audio will have a similar trajectory.

There’s also another very human trait driving interest in these AI-powered tools: laziness. AI voice tech—assuming it gets to the point where it can accurately mimic real voices—will make it easy to do quick edits or retakes without having to get the host back into a studio.

“Ultimately, the creator economy is going to win,” Balasubramaniyan says. “No matter how much we think about the ethical implications, it’s going to win out because you’ve just made people’s lives simple.”

Update, April 12 at 3:30 pm EDT: Shortly after this story published, we were granted access to ElevenLabs' voice AI tool, which we used to generate a third voice clip. The story was updated to include the results.

Related Articles

Latest Articles