Researchers Help Robots Navigate Turning Visual Data Into Language

MIT and IBM's LangNav approach combines text descriptions with visual data, enhancing robots' navigation understanding and instruction following

Ben Wodecki, Junior Editor - AI Business

June 18, 2024

2 Min Read
Machine learning, artificial intelligence, ai, deep learning blockchain neural network concept art.
Getty Images

Researchers have developed a novel way to help robots better navigate environments using only natural language instructions and descriptions instead of complex visual processing.

Researchers from MIT CSAIL, the MIT-IBM Watson AI Lab and Dartmouth College created the LangNav method, which converts visual information into text captions that are then used to instruct robots on how to navigate environments.

In a recently published paper, the researchers suggested that their language-based approach outperformed traditional vision-based navigation methods, enabling improved task transfer abilities.

“We show that we can learn to navigate in real-world environments by using language as a perceptual representation,” the paper reads. “Language naturally abstracts away low-level perceptual details, which we find to be beneficial for efficient data generation and sim-to-real transfer.”

Training a robot to perform a task such as picking up an object requires considerable amounts of visual information to provide them with instructions.

The researchers propose that instead of visual information, language could prove a viable alternative, generating trajectories that guide a robot to its goal.

Instead of directly using raw visual observations, the researchers converted the visual inputs into text descriptions using off-the-shelf computer vision models for image captioning (BLIP) and object detection (Deformable DETR). 

The text descriptions of the visual scenes were converted into natural language and input into a large pre-trained language model, fine-tuned for navigation tasks. 

The resulting method generated text-based instructions for a robot, offering detailed guidance on how to navigate a specific path. For example: “Go down the stairs and straight into the living room. In the living room walk out onto the patio. On the patio stop outside the doorway.”

By representing the visual scene through language, the method enables a robot to better understand the path it’s required to take, requiring its hardware to process less information.

The paper suggests the LangNav approach outperformed traditional robotic navigation methods that rely on solely using visual information.

The language-based approach even worked well in low-data settings, where only a few expert navigation examples were available for training.

“Our approach is found to improve upon baselines that rely on visual features in settings where only a few expert trajectories are available, demonstrating the potential of language as a perceptual representation for navigation,” the paper reads.

While the researchers described the approach as promising, they noted that LangNav is somewhat limited. Some of the visual information can be lost when transferred into the language model, harming its ability to understand entire scenes.

About the Author(s)

Ben Wodecki

Junior Editor - AI Business

Ben Wodecki is the junior editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to junior editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others.

Sign Up for the Newsletter
The most up-to-date news and insights into the latest emerging technologies ... delivered right to your inbox!

You May Also Like