While subtle, one of the biggest advances portrayed in movies like Her or Ex Machina was that the AI began to really sound like a fellow human. And in the realm of real-life tech, Google’s AI focus in recent years has similarly been to make computers sound more like us. And they’re getting much better at it.
The latest development to come out of Google’s Deepmind AI is called WaveNet and it samples different parts of human speech and models its own waveforms after the way they sound. It’s not perfect yet, but we’re definitely getting closer to voices that sound like they come from a person’s mouth, rather than from a computer’s speaker.
While it still sounds strange, the new AI speech certainly flows better than the kinds of responses you’ll get from Siri or Cortana, which chop up human speech and paste it back together in a way that makes individual pronunciations correct, but the flow of the speech is completely off. (That technique is known as concatenative text to speech, just so you know.)
The WaveNet option flows much better because it uses something called parametric text to speech, which generates it from scratch. Where it differs from traditional uses of that technique though is that Google’s AI models its audio on the waveforms of real human voices.
That’s difficult, because typically there are around 16,000 potential voice samples to be taken with every second of speech — that takes a lot of processing power to handle. To cut back on that, WaveNet uses a prediction engine to estimate what sample should come next in natural speech, using everything that has gone before as a guide.
The results are impressive. To give you a comparison, here’s a classic concatenative text-to-speech system:
That sounds like the sort of digital assistant voices we’re become used to in recent years. But here’s the new WaveNet system that Google has developed:
The cadence of the speech is much more realistic and though there is a general fuzziness to the audio, it’s not hard to imagine that being cleaned up post development.
The process can even be used to simulate different kinds of voices, for example, male and female:
The only problem now is that even though its predictions reduce the amount of required processing for this technique, it still takes too much processing to imagine standard smartphone hardware being capable of doing it in real time. At least for now.
For more information on these techniques, Google’s blog post offers a lot more detail and samples and it even posted a couple of papers on it here.