MIT’s Clio works in real time to map task-relevant objects in a robot’s environment, allowing the robot (Boston Dynamic’s Spot quadruped robot, pictured) to perform a task in natural language (“pick up an orange backpack”). Credit: Massachusetts Institute of Technology
Imagine having to bring order to a messy kitchen, starting with a counter littered with sauce packets. If your goal is to clean the counter, you can sweep packages as a group. If, however, you wanted to sort the mustard packets first before throwing away the rest, you would sort in a more discriminating way, by type of sauce. And if, among the mustards, you had a craving for Gray Poupon, finding that specific brand would require a more in-depth search.
MIT engineers have developed a method that allows robots to make equally intuitive and task-relevant decisions.
The team’s new approach, called Clio, allows a robot to identify which parts of a scene matter, given the tasks at hand. With Clio, a robot retrieves a list of tasks described in natural language and, based on these tasks, it then determines the level of granularity required to interpret its environment and “remember” only those parts of a scene that are relevant.
In real-world experiments ranging from a cluttered cubicle to a five-story building on the MIT campus, the team used Clio to automatically segment a scene at different levels of granularity, based on a set of tasks specified in natural language prompts such as “move rack.” magazines” and “get a first aid kit”.
The team also piloted Clio in real time on a quadruped robot. As the robot explored an office building, Clio identified and mapped only those parts of the scene related to the robot’s tasks (like retrieving a dog toy while ignoring piles of office supplies), allowing the robot to grasp the objects of interest.
Clio is named after the Greek muse of history, for her ability to identify and remember only the important elements for a given task. The researchers envision that Clio would be useful in many situations and environments in which a robot would need to quickly study and make sense of its surroundings in the context of its given task.
“Search and rescue is the motivating application of this work, but Clio can also power domestic robots and robots working in a factory alongside humans,” says Luca Carlone, associate professor in the Department of Aeronautics and Engineering. MIT Astronautics (AeroAstro), principal investigator in the Laboratory for Information and Decision Systems (LIDS), and director of the MIT SPARK Laboratory. “It’s really about helping the robot understand the environment and what it needs to remember to complete its mission.”
The team details their results in a study published today in the journal IEEE Letters on Robotics and Automation. Carlone’s co-authors include SPARK Lab members: Dominic Maggio, Yun Chang, Nathan Hughes, and Lukas Schmid; and members of the MIT Lincoln Laboratory: Matthew Trang, Dan Griffith, Carlyn Dougherty and Eric Cristofalo.
Open fields
Huge advances in computer vision and natural language processing have enabled robots to identify objects in their environment. But until recently, robots were only able to do this in “closed” scenarios, in which they were programmed to work in a carefully organized and controlled environment, with a finite number of objects that the robot had been pre-trained to recognize.
In recent years, researchers have taken a more “open” approach to allowing robots to recognize objects in more realistic contexts. In the field of open recognition, researchers have leveraged deep learning tools to create neural networks capable of processing billions of images from the Internet, as well as the text associated with each image (like the photo of a dog on a friend’s Facebook, captioned “Meet my new puppy!”).
From millions of image-text pairs, a neural network learns, then identifies, segments of a scene that are characteristic of certain terms, such as dog. A robot can then use this neural network to spot a dog in a completely new scene.
But there is still a challenge: how to analyze a scene in a way that is useful and relevant for a particular task.
“Typical methods select an arbitrary, fixed level of granularity to determine how to merge segments of a scene into what you can think of as a single ‘object,'” says Maggio. “However, the granularity of what you call an ‘object’ is actually related to what the robot needs to do. If this granularity is set without taking tasks into account, then the robot may end up with a map that does not is not useful for his work.
Information bottleneck
With Clio, the MIT team aimed to enable robots to interpret their environment at a level of granularity that could be automatically adapted to the tasks at hand.
For example, for a task of moving a stack of books to a shelf, the robot should be able to determine that the entire stack of books is the relevant object for the task. Similarly, if the task was to move only the green book from the rest of the stack, the robot would have to distinguish the green book as a single target object and ignore the rest of the scene, including the other books in the stack.
The team’s approach combines cutting-edge computer vision and large language models including neural networks that make connections between millions of open source images and semantic text. They also incorporate mapping tools that automatically divide an image into several small segments, which can be fed into the neural network to determine whether certain segments are semantically similar.
The researchers then exploit an idea from classical information theory called the “information bottleneck,” which they use to compress a number of image segments in such a way as to select and store the semantically most valuable segments. more relevant for a given task.
“For example, let’s say there’s a stack of books in the scene and my task is just to get the green book. In this case, we pass all this information about the scene through this bottleneck and we end up with a group of segments that represent “the green book,” explains Maggio.
“All the other segments that are not relevant are simply grouped into a cluster that we can simply remove. And we end up with an object with the right granularity that is necessary to support my task.”
The researchers demonstrated Clio in different real-world environments.
“What we thought would be a really pragmatic experience would be to run Clio in my apartment, where I didn’t do any cleaning beforehand,” says Maggio.
The team made a list of tasks in natural language, such as “move a pile of clothes,” then applied Clio to images of Maggio’s cluttered apartment. In these cases, Clio was able to quickly segment the apartment scenes and feed the segments through the Information Bottleneck algorithm to identify which segments made up the pile of clothes.
They also used Clio on Boston Dynamic’s quadruped robot, Spot. They gave the robot a list of tasks to complete, and while the robot explored and mapped the interior of an office building, Clio ran in real time on an on-board computer mounted on Spot, to select segments from the mapped scenes. relate visually to the given task.
The method generated an overlay map showing only the target objects, which the robot then used to approach the identified objects and physically complete the task.
“Getting Clio running in real time was a big achievement for the team,” says Maggio. “A lot of pre-work can take several hours.”
In the future, the team plans to adapt Clio to be able to handle higher-level tasks and build on recent advances in photorealistic visual representations of scenes.
“We always give Clio somewhat specific tasks, like ‘find a deck of cards,'” says Maggio. “For search and rescue, you need to give it higher-level tasks, like ‘find survivors’ or ‘turn the power back on.’ So we want to achieve a more human understanding of how to accomplish more complex tasks. »
More information:
Dominic Maggio et al, Clio: open 3D scene graphics driven by real-time tasks, IEEE Letters on Robotics and Automation (2024). DOI: 10.1109/LRA.2024.3451395. dspace.mit.edu/handle/1721.1/157072
Provided by the Massachusetts Institute of Technology
This story is republished courtesy of MIT News (web.mit.edu/newsoffice/), a popular site that covers news in MIT research, innovation and education.
Quote: New method allows robots to map a scene, identify objects in order to accomplish a set of tasks (2024, September 30) retrieved September 30, 2024 from
This document is subject to copyright. Except for fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for informational purposes only.