This Realistic Synthesized Speech Could Be The Future of Audiobooks

Synthesized voices like those used by Siri and Alexa are fine for telling us the day’s weather forecast or how many minutes remain on a cooking timer, but would you really want their flat, monotonous tones reading you audiobooks? Probably not, which is why most of us turn to human-voiced services like Audible to get our audiobook fix. Human voice actors might not get the nod for too much longer, however, due to to the pioneering work of a London-based startup called DeepZen.

Using artificial intelligence algorithms, augmented by the technological firepower of IBM’s Power A.I. and Watson technologies, DeepZen has developed text-to-speech tools that not only sound human at first listen, but can also pick up on the emotional cues needed for reading text in a compelling manner. In doing so, the company claims that it could reduce the time and cost to produce audiobooks by up to 90%.

Recommended Videos

“Our system is truly revolutionary,” Taylan Kamis, CEO and co-founder of DeepZen, told Digital Trends. “It works using deep learning and neural networks to understand how a human talks and reads. We then train the system so it can recognize where to apply the right emotions and intonation when reading a piece of text. The result is humanlike speech very closely resembling the real thing.”

Inevitably, work like this can be cast as yet another example of cutting-edge A.I. tools threatening a human profession. In this case, that profession involves actors who, despite what a few high-profile figures are able to achieve, don’t have the most steady, stable careers as it is. It would be naive to think that software such as this won’t have an impact on the future of voice actors, but, as Kamis points out, there are plenty of scenarios in which tools such as DeepZen’s could be a net positive for humanity.

For example, it could make possible the creation of audiobooks based on works by new and emerging writers, or from publishers who don’t have the luxury of big budgets. It could also be used to help develop superior text-to-speech tools for people who have dyslexia or otherwise have trouble reading.

“As for the future, we are also looking at producing voice-overs for the video production industry, as well as gaming, where there is a need for real-time text-to-speech to enhance the player experience,” Kami said. “We are also looking at other languages.”

You can check out a sample of the system here.

Editors’ Recommendations