Your daily to-do list is probably pretty simple: washing dishes, grocery shopping, and other details. It’s unlikely that you wrote “pick up the first dirty dish” or “wash this plate with a sponge” because each of these miniature stages of the chore seems intuitive. Although we can regularly carry out each step without much thought, a robot requires a complex plan that involves more detailed outlines.
MIT’s Improbable AI Lab, a group at the Computer Science and Artificial Intelligence Laboratory (CSAIL), has offered these machines a helping hand with a new multimodal framework: Compositional Foundation Models for Hierarchical Planning (HiP), which develops detailed, workable plans with the expertise of three different foundation designs. Like OpenAI’s GPT-4, the base model on which ChatGPT and Bing Chat were built, these base models are trained on massive amounts of data for applications such as image generation, text translation and robotics.
The work is published on the arXiv preprint server.
Unlike RT2 and other multimodal models that are trained on paired vision, language, and action data, HiP uses three different base models, each trained on different data modalities. Each foundation model captures a different part of the decision-making process and then works together when it’s time to make decisions. HiP removes the need to access coupled data on vision, language and action, which is difficult to obtain. HiP also makes the reasoning process more transparent.
What is considered an everyday task for a human may be a “long-term goal” for a robot – an overarching goal that involves completing many small steps first – requiring enough data to plan, understand and execute goals. While computer vision researchers have attempted to create monolithic base models for this problem, combining linguistic, visual, and action data is expensive. Instead, HiP represents a different multimodal recipe: a trio that cost-effectively integrates linguistic, physical, and environmental intelligence into a robot.
“Basic models don’t have to be monolithic,” says Jim Fan, an AI researcher at NVIDIA who was not involved in the study. “This work decomposes the complex task of planning embodied agents into three constituent models: a language reasoner, a visual world model, and an action planner. It makes a difficult decision-making problem more tractable and transparent. “
The team thinks their system could help these machines accomplish household tasks, like putting away a book or putting a bowl in the dishwasher. Additionally, HiP could help complete multi-step construction and manufacturing tasks, such as stacking and placing different materials in specific sequences.
HiP evaluation
The CSAIL team tested HiP’s acuity on three manipulation tasks, outperforming comparable frameworks. The system reasons by developing intelligent plans that adapt to new information.
First, the researchers asked him to stack blocks of different colors on top of each other, then place more blocks nearby. The problem: Some of the correct colors weren’t present, so the robot had to place white blocks in a colored bowl to paint them. HiP has often adapted to these changes with precision, especially compared to state-of-the-art task planning systems such as Transformer BC and Action Diffuser, adjusting its plans to stack and place each square as needed.
Another test: place objects like candy and a hammer in a brown box, ignoring the other objects. Some of the items to be moved were dirty, so HiP adjusted his plans to place them in a cleaning box and then in the brown container. In a third demonstration, the robot was able to ignore unnecessary objects to achieve kitchen subgoals, such as opening a microwave, clearing a kettle, and turning on a light. Some of the requested steps had already been taken, so the robot adapted by skipping those directions.
A three-level hierarchy
HiP’s three-pronged planning process works like a hierarchy, with the ability to pre-train each of its components on different data sets, including information outside of robotics. At the bottom of this order is a large language model (LLM), which begins to create ideas by capturing all the necessary symbolic information and developing an abstract task plan. By applying common sense knowledge found on the Internet, the model divides its objective into sub-objectives. For example, “make a cup of tea” transforms into “fill a pot of water”, “boil the pot” and subsequent actions required.
“All we want to do is take existing pre-trained models and make them interface successfully with each other,” says Anurag Ajay, a Ph.D. student in the MIT Department of Electrical Engineering and Computer Science (EECS) and affiliated with CSAIL. “Instead of promoting one model that does everything, we combine several that leverage different modalities of Internet data. When used in tandem, they facilitate robotic decision-making and can potentially make tasks easier in homes, factories and construction sites. “
These models also need some form of “eyes” to understand the environment in which they operate and execute each subgoal correctly. The team used a large video streaming model to complement the initial planning done by the LLM, which collects geometric and physical information about the world from images broadcast over the Internet. In turn, the video model generates an observation trajectory plan, refining the outline of the LLM to incorporate new physical knowledge.
This process, known as iterative refinement, allows HiP to reason through its ideas, taking into account feedback at each stage to generate a more practical plan. The feedback flow is similar to writing an article, where an author can send their draft to an editor, and with revisions incorporated, the editor reviews the latest changes and finalizes.
In this case, the top of the hierarchy is an egocentric action model, or a sequence of first-person images that infer what actions should take place based on one’s environment. During this step, the video model’s observation plane is mapped onto the space visible to the robot, thereby helping the machine decide how to perform each task within the long-term goal. If a robot uses HiP to brew tea, this means it will have mapped the exact location of the pot, sink and other key visual elements and will begin to achieve each sub-goal.
However, multimodal work is limited by the lack of high-quality video base models. Once available, they could interface with HiP’s small-scale video models to further improve visual sequence prediction and robot action generation. A higher quality version would also reduce the current data requirements of video models.
That being said, the CSAIL team’s approach used only a tiny portion of data overall. Additionally, training HiP was inexpensive and demonstrated the potential of using readily available base models to accomplish long-term tasks.
“What Anurag demonstrated is a proof of concept of how we can take models trained on distinct tasks and data modalities and combine them into models for robotic planning. In the future, HiP could be complemented by pre-trained models that can process touch and sound to make better plans,” says lead author Pulkit Agrawal, MIT assistant professor at EECS and director of the Improbable AI Lab. The group also plans to apply HiP to solving long-term real-world tasks in robotics.
Ajay and Agrawal are lead authors of a paper describing the work. They are joined by MIT professors and CSAIL principal investigators Tommi Jaakkola, Joshua Tenenbaum and Leslie Pack Kaelbling; Akash Srivastava, CSAIL Research Affiliate and Research Director of the MIT-IBM AI Lab; graduate students Seungwook Han and Yilun Du; former postdoc Abhishek Gupta, now an assistant professor at the University of Washington; and former graduate student Shuang Li, Ph.D.
More information:
Anurag Ajay et al, Compositional Foundation Models for Hierarchical Planning, arXiv (2023). DOI: 10.48550/arxiv.2309.08587
arXiv
Provided by the Massachusetts Institute of Technology
This story is republished courtesy of MIT News (web.mit.edu/newsoffice/), a popular site that covers news about MIT research, innovation and education.
Quote: Multiple AI models help robots execute complex plans more seamlessly (January 8, 2024) retrieved January 8, 2024 from
This document is subject to copyright. Apart from fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for information only.