Google’s AI just got ears

By Fionna Agomuoh Published April 9, 2024

Google

AI chatbots are already capable of “seeing” the world through images and video. But now, Google has announced audio-to-speech functionalities as part of its latest update to Gemini Pro. In Gemini 1.5 Pro, the chatbot can now “hear” audio files uploaded into its system and then extract the text information.

The company has made this LLM version available as a public preview on its Vertex AI development platform. This will allow more enterprise-focused users to experiment with the feature and expand its base after a more private rollout in February when the model was first announced. This was originally offered only to a limited group of developers and enterprise customers.

Recommended Videos

1. Breaking down + understanding a long video

I uploaded the entire NBA dunk contest from last night and asked which dunk had the highest score.

Gemini 1.5 was incredibly able to find the specific perfect 50 dunk and details from just its long context video understanding! pic.twitter.com/01iUfqfiAO

— Rowan Cheung (@rowancheung) February 18, 2024

Please enable Javascript to view this content

Google shared the details about the update at its Cloud Next conference, which is currently taking place in Las Vegas. After calling the Gemini Ultra LLM that powers its Gemini Advanced chatbot the most powerful model of its Gemini family, Google is now calling Gemini 1.5 Pro its most capable generative model. The company added that this version is better at learning without additional tweaking of the model.

Gemini 1.5 Pro is multimodal in that it can interpret different types of audio into text, including TV shows, movies, radio broadcasts, and conference call recordings. It’s even multilingual in that it can process audio in several different languages. The LLM may also be able to create transcripts from videos; however, its quality may be unreliable, as mentioned by TechCrunch.

When first announced, Google explained that Gemini 1.5 Pro used a token system to process raw data. A million tokens equate to approximately 700,000 words or 30,000 lines of code. In media form, it equals an hour of video or around 11 hours of audio.

There have been some private preview demos of Gemini 1.5 Pro that demonstrate how the LLM is able to find specific moments in a video transcript. For example, AI enthusiast Rowan Cheung got early access and detailed how his demo found an exact action shot in a sports contest and summarized the event, as seen in the tweet embedded above.

However, Google noted that other early adopters, including United Wholesale Mortgage, TBS, and Replit, are opting for more enterprise-focused use cases, such as mortgage underwriting, automating metadata tagging, and generating, explaining, and updating code.

Topics

Computing Writer

Fionna Agomuoh is a Computing Writer at Digital Trends. She covers a range of topics in the computing space, including…

Computing

xAI’s Grok-3 is impressive, but it needs to do a lot more to convince me

Tool-picker dropdown for Grok-3 AI.

Elon Musk-led xAI has announced their latest AI model, Grok-3, via a livestream. From the get-go, it was evident that the company wants to quickly fill all the practical gaps that can make its chatbot more approachable to an average user, rather than just selling rhetoric about wokeness and understanding the universe.

The company will be releasing two versions of its latest AI model viz. Grok-3 and Grok-3 mini. The latter is trained for low-compute scenarios, while the former will offer the full set of Grok-3 perks such as DeepSearch, Think, and Big Brain.
What’s all the fuss about

Computing

Perplexity one-ups Gemini and ChatGPT with a fantastic AI freebie

Model picker for Deep Research on Perplexity Model picker for Deep Research on Perplexity

What if you tell an AI chatbot to search the web, look up a certain kind of source, and then create a detailed report based on all the information it has gleaned? Well, Gemini can do it, for $20 a month. Or $200 each month, if you prefer ChatGPT.

Perplexity will do it for free. A few times each day, that is. Perplexity is calling its latest tool, Deep Research. Just like OpenAI. And Google Gemini before it.

Computing

Google gives memory superpowers to Gemini for more natural chats

Google Gemini running on an Android phone.

Google is finally bringing a crucial new feature to Gemini that will solve a key pain point of interacting with its AI chatbot. The company is enabling a memory feature which allows Gemini to pull up details from a past conversation.

“Whether you’re asking a question about something you’ve already discussed, or asking Gemini to summarize a previous conversation, Gemini now uses information from relevant chats to craft a response,” says a Google update.