There’s a video that pops up periodically on my YouTube feed. It’s a conversation between rappers Snoop Dogg and 50 Cent bemoaning the fact that, compared to their generation, all modern hip-hop artists apparently sound the same. “When a person decides to be themselves, they offer something no-one else can be,” says 50 Cent. “Yeah, ‘cos once you be you — who can be you but you?” Snoop responds.

Snoop Dogg impersonates today's rappers sound-alike flow

When the video was uploaded in October 2014, that may have broadly been true. But just a few years later it certainly isn’t. In a world of audio deepfakes, it’s possible to train an A.I. to sound eerily similar to another person by feeding it an audio corpus consisting of hours of their spoken data. The results are unnervingly accurate.

“We can repurpose a lot”

Certain individuals have, of course, long sold their voices in the form of recording commercials or voiceovers, singing songs, and countless other forms of monetization. But these endeavors all required the person to actually say the words. What Veritone’s solution promises to do is to make this individually scalable.

What if, for instance, it was possible for Kevin Hart to license his voice out to a luxury brand that could then use it to create personalized ads featuring the name of the viewer, the location of their nearest brick-and-mortar sales outlet, and the particular product they could be most likely to buy? Rather than spending literally days in the recording booth, A.I. could allow this to be done with little more (on Hart’s part, at least) than signing on the dotted line to agree for his voice likeness to be harnessed by said third party. While he was off shooting a movie, or doing a comedy tour, or taking a vacation, or even sleeping, his digital voice could be raking in the cash.

“We can repurpose a lot,” Steelberg explained, regarding the training process. “People who are already speaking a ton, if they’re producing a podcast or in the media, there’s a lot of data out there. We probably have a ton of it already if they happen to be a customer of ours.”

“What we find so fascinating about this new category of A.I. is the extensibility and the variability.”

Steelberg said that the voice-as-a-service idea occurred to Veritone several years ago. However, at the time he was unconvinced that machine learning models were able to create the hyper-realistic synthetic voices he was looking for. This is especially important when it comes to voices we know intimately, even if we’ve never actually met the speaker in question. The results could be some kind of audible uncanny valley, with every wrong sound alerting listeners to the fact that they’re listening to a fake. But here in 2021 he is convinced that things have advanced to the point where this is now possible. Hence Marvel.ai.

Steelberg speaks in excited buzzwords about the massive potential of the technology, talking up its possible plethora of “modalities of execution.” Veritone can create models for text-to-speech. It can also build models for speech-to-speech, whereby a voice actor can “drive” a vocal performance by reading the words with suitable inflection and then having the finished voice overlaid at the end like a Snapchat filter. The company can also fingerprint each voice so it can tell if a piece of apparently real audio that pops up someplace was created using its technology.

“The more you think about it … you’ll literally come up with 50 more [possible use-cases],” he said. “What we find so fascinating about this new category of A.I. is the extensibility and the variability.”

Consider some others. A famous athlete might be a god on the basketball court, but a devil when it comes to reading lines in a script in a way that sounds natural. Using Veritone’s technology, their part in video game cutscenes or reading an audio book of their memoir (which they may also not have written) could be performed by a voice actor, which is then digitally tweaked to sound like the athlete. As another possibility, a movie could be translated for other countries with the same actor voice now reading the lines in French, Mandarin, or any other one of a number of languages, even if the actor doesn’t actually speak them.

How will the public react?

Image used with permission by copyright holder

A big question hanging over all of this, of course, is how members of the public are going to respond to it all. This is the tricky, unpredictable bit. Celebrities today must play a complex role: Both larger-than-life figures worthy of having their face plastered on billboards, and also relatable individuals who have relationship problems, tweet about watching TV in their pajamas, and make silly faces when they eat hot sauce.

What happens, then, when ads appear that not only feature a celebrity reading lines, but in cases when we know that said performer never actually said those lines, but rather had their voice programmatically utilized to bring us a targeted ad? Steelberg said that it is little different to a celebrity handing over control of their social media to a third party account manager. If we see Taylor Swift tweet, we know that it’s quite possibly not Taylor herself tapping out the message, especially if it’s an endorsement or piece of promotional content.

But voice is, in a very real way, different, precisely because it’s more personal. Especially if it’s accompanied by a degree of personalization, which is one of the use-cases that makes the most sense. The truth is that, to quote the screenwriter William Goldman, nobody knows what the public response will be — precisely because nobody has done exactly this before.

“It’s going to run the spectrum, right?” Steelberg said. “[Some] people are going to say, ‘I’m going to use this tool a little bit to augment my day to help me save time.’ Others are going to say, full-blown, ‘I want my voice everywhere to extend my brand, and I’m going to license it out.’”

His best guess is that acceptance will be on a case-by-case basis. “You need to be in tune with the reaction of your audience, and if you see things are working or not working,” he said. “They may love it. They may say, ‘You know what? I love the fact that you’re putting out 10 times more content or more personal content to me, even though I know you used synthetic content to augment it. Thank you. Thank you.’”

Think about the future

As for the future? Steelberg said that “We want to work with all the major talent agencies. We think anybody who is in the business of making money around a scarce brand should be thinking about their voice strategy.”

And don’t expect it to remain purely about audio, either. “We’ve always been fascinated by the potential of using synthetic content to either extend, augment, or potentially completely replace some of the legacy forms of content production,” he continued. “Be that in an audio sense or, ultimately in the future, a video sense.”

That’s right: Once it has cornered the market in the world of audio deepfakes, Veritone plans to go one step further and enter the world of fully realized virtual avatars that both sound and look indistinguishable from their source.

Suddenly those personalized ads from Minority Report sound a whole lot less like science fiction.

“We can repurpose a lot”

How will the public react?

Think about the future

Editors’ Recommendations