Back in 2019, OpenAI refused to release its full research into the development of GPT2 over fears that it was “too dangerous” to release publicly. On Thursday, OpenAI’s biggest financial backer, Microsoft, made a similar pronouncement about its new VALL-E 2 voice synthesizer AI.
The VALL-E 2 system is a zero-shot text-to-speech synthesis (TTS) AI, meaning that it can recreate hyper-realistic speech based on just a few seconds of sample audio. Per the research team, VALL-E 2 “surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks.”
The system reportedly can even handle sentences that are difficult to pronounce because of their structural complexity or repetitive phrasing, such as tongue twisters.
There are a host of potential beneficial uses for such a system, like enabling people suffering from aphasia or Amyotrophic lateral sclerosis (commonly known as ALS or Lou Gehrig’s disease) to speak again, albeit through a computer, as well as use in education, entertainment, journalism, chatbots and translation, or as accessibility features and “interactive voice response systems,” like Siri. However, the team also recognizes numerous opportunities for the public to misuse its technology, “such as spoofing voice identification or impersonating a specific speaker.”
As such the AI will only be available for research purposes. “Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public,” the team wrote. ” If you suspect that VALL-E 2 is being used in a manner that is abusive or illegal or infringes on your rights or the rights of other people, you can report it at the Report Abuse Portal.”
Microsoft is hardly alone in its efforts to train computers to speak as humans do. Google’s Chirp, ElevenLabs’ Iconic Voices, and Voicebox from Meta all aim to perform similar functions.
However, such systems have come under ethical scrutiny as they have repeatedly been used to scam unsuspecting victims by emulating the voice of a loved one or a well-known celebrity. And unlike generated images, there’s currently no way to effectively “watermark” AI generated audio.