Skip to main content

Gemini AI is making robots in the office far more useful

An Everyday Robot navigating through an office.
Everyday Robot

Lost in an unfamiliar office building, big box store, or warehouse? Just ask the nearest robot for directions.

A team of Google researchers combined the powers of natural language processing and computer vision to develop a novel means of robotic navigation as part of a new study published Wednesday.

Essentially, the team set out to teach a robot — in this case an Everyday Robot — how to navigate through an indoor space using natural language prompts and visual inputs. Robotic navigation used to require researchers to not only map out the environment ahead of time but also provide specific physical coordinates within the space to guide the machine. Recent advances in what’s known as Vision Language navigation have enabled users to simply give robots natural language commands, like “go to the workbench.” Google’s researchers are taking that concept a step further by incorporating multimodal capabilities, so that the robot can accept natural language and image instructions at the same time.

For example, a user in a warehouse would be able to show the robot an item and ask, “what shelf does this go on?” Leveraging the power of Gemini 1.5 Pro, the AI interprets both the spoken question and the visual information to formulate not just a response but also a navigation path to lead the user to the correct spot on the warehouse floor. The robots were also tested with commands like, “Take me to the conference room with the double doors,” “Where can I borrow some hand sanitizer,” and “I want to store something out of sight from public eyes. Where should I go?”

Or, in the Instagram Reel above, a researcher activates the system with an “OK robot” before asking to be led somewhere where “he can draw.” The robot responds with “give me a minute. Thinking with Gemini …” before setting off briskly through the 9,000-square-foot DeepMind office in search of a large wall-mounted whiteboard.

To be fair, these trailblazing robots were already familiar with the office space’s layout. The team utilized a technique known as “Multimodal Instruction Navigation with demonstration Tours (MINT).” This involved the team first manually guiding the robot around the office, pointing out specific areas and features using natural language, though the same effect can be achieved by simply recording a video of the space using a smartphone. From there the AI generates a topological graph where it works to match what its cameras are seeing with the “goal frame” from the demonstration video.

Then, the team employs a hierarchical Vision-Language-Action (VLA) navigation policy “combining the environment understanding and common sense reasoning,” to instruct the AI on how to translate user requests into navigational action.

The results were very successful with the robots achieving “86 percent and 90 percent end-to-end success rates on previously infeasible navigation tasks involving complex reasoning and multimodal user instructions in a large real world environment,” the researchers wrote.

However, they recognize that there is still room for improvement, pointing out that the robot cannot (yet) autonomously perform its own demonstration tour and noting that the AI’s ungainly inference time (how long it takes to formulate a response) of 10 to 30 seconds turns interacting with the system a study in patience.

Andrew Tarantola
Andrew has spent more than a decade reporting on emerging technologies ranging from robotics and machine learning to space…
Microsoft Copilot: how to use this powerful AI assistant
Man using Windows Copilot PC to work

In the rapidly evolving landscape of artificial intelligence, Microsoft's Copilot AI assistant is a powerful tool designed to streamline and enhance your professional productivity. Whether you're new to AI or a seasoned pro, this guide will help you through the essentials of Copilot, from understanding what it is and how to sign up, to mastering the art of effective prompts and creating stunning images.

Additionally, you'll learn how to manage your Copilot account to ensure a seamless and efficient user experience. Dive in to unlock the full potential of Microsoft's Copilot and transform the way you work.
What is Microsoft Copilot?
Copilot is Microsoft's flagship AI assistant, an advanced large language model. It's available on the web, through iOS, and Android mobile apps as well as capable of integrating with apps across the company's 365 app suite, including Word, Excel, PowerPoint, and Outlook. The AI launched in February 2023 as a replacement for the retired Cortana, Microsoft's previous digital assistant. It was initially branded as Bing Chat and offered as a built-in feature for Bing and the Edge browser. It was officially rebranded as Copilot in September 2023 and integrated into Windows 11 through a patch in December of that same year.

Read more
Google strikes back with its own lightweight AI model
Google Gemini on smartphone.

Google announced Thursday that it is releasing Gemini 1.5 Flash, it's snack-sized large language model and ChatGPT-4o mini competitor, to all users regardless of their subscription level.

The company promises "across-the-board improvements" in terms of response quality and latency, as well as "especially noticeable improvements in reasoning and image understanding."

Read more
What is Gemini Advanced? Here’s how to use Google’s premium AI
Google Gemini on smartphone.

Google's Gemini is already revolutionizing the way we interact with AI, but there is so much more it can do with a $20/month subscription. In this comprehensive guide, we'll walk you through everything you need to know about Gemini Advanced, from what sets it apart from other AI subscriptions to the simple steps for signing up and getting started.

You'll learn how to craft effective prompts that yield impressive results and stunning images with Gemini's built-in generative capabilities. Whether you're a seasoned AI enthusiast or a curious beginner, this post will equip you with the knowledge and techniques to harness the power of Gemini Advanced and take your AI-generated content to the next level.
What is Google Gemini Advanced?

Read more