To teach an AI agent a new task, such as opening a kitchen cabinet, researchers often use reinforcement learning, a process of trial and error in which the agent is rewarded for taking action which bring him closer to his goal.
In many cases, a human expert must carefully design a reward function, which is an incentive mechanism that motivates the agent to explore. The human expert must update this reward function iteratively as the agent explores and tries different actions. This can be time-consuming, inefficient, and difficult to scale, especially when the task is complex and involves many steps.
Researchers from MIT, Harvard University and the University of Washington have developed a new approach to reinforcement learning that does not rely on an expert-designed reward function. Instead, it leverages crowdsourced feedback, collected from many non-expert users, to guide the agent in learning to achieve its goal. The work has been published on the preprint server arXiv.
While other methods also attempt to use feedback from non-experts, this new approach allows the AI agent to learn faster, despite the fact that the data collected from users is often full of errors. This noisy data can cause other methods to fail.
Additionally, this new approach allows feedback to be collected asynchronously, so that non-expert users around the world can contribute to the agent’s learning.
“Today, one of the longest and most difficult parts of designing a robotic agent is engineering the reward function. Today, reward functions are designed by expert researchers, a paradigm that is not scalable if we want to teach our robots many different tasks. “The work proposes a way to extend robot learning by outsourcing the design of the reward function and allowing non-experts to provide useful feedback,” says Pulkit Agrawal, assistant professor in the department of electrical and engineering engineering. MIT Computer Science (EECS). who directs the Improbable AI Lab at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).
In the future, this method could help a robot quickly learn to perform specific tasks in a user’s home, without the owner needing to show the robot physical examples of each task. The robot could explore on its own, with crowdsourced non-expert feedback guiding its exploration.
“In our method, the reward function guides the agent toward what it should explore, instead of telling it exactly what it should do to accomplish the task. Thus, even if human supervision is somewhat imprecise and noisy, the agent is still able to explore, which helps it learn better,” explains lead author Marcel Torne, research assistant at the Improbable AI Lab.
Torne is joined on the paper by his MIT advisor, Agrawal; lead author Abhishek Gupta, assistant professor at the University of Washington; as well as others at the University of Washington and MIT. The research will be presented at the Neural Information Processing Systems Conference next month.
Loud comments
One way to collect user feedback for reinforcement learning is to show a user two photos of the states the agent achieved, then ask users to indicate which one is closer to a goal. For example, perhaps a robot’s goal is to open a kitchen cabinet. One image might show the robot opening the cabinet, while the second might show it opening the microwave. A user would choose the photo in the “best” state.
Some previous approaches attempt to use this binary and participatory feedback to optimize a reward function that the agent would use to learn the task. However, because non-experts are likely to make errors, the reward function can become very noisy, so the agent may get stuck and never achieve its goal.
“Basically, the agent would take the reward function too seriously. It would try to match the reward function perfectly. So instead of directly optimizing the reward function, we just use it to tell the robot which areas he should explore.” said Torne.
He and his collaborators decoupled the process into two distinct parts, each driven by its own algorithm. They call their new reinforcement learning method HuGE (Human Guided Exploration).
On one hand, a goal selection algorithm is continually updated with crowdsourced human feedback. Feedback is not used as a reward function, but rather to guide the agent’s exploration. In a sense, non-expert users post breadcrumbs that gradually lead the agent toward its goal.
On the other hand, the agent explores alone, in a self-supervised manner and guided by the objective selector. It collects images or videos of the actions it attempts, which are then sent to humans and used to update the goal selector.
This reduces the area to be explored by the agent, leading it to more promising areas closer to its objective. But if there is no feedback, or if the feedback takes time to arrive, the agent will continue to learn on its own, albeit more slowly. This allows feedback to be collected infrequently and asynchronously.
“The exploration loop can continue on its own, because it’s just going to explore and learn new things. And then when you get a better signal, it’s going to explore in a more concrete way. You can just let them run on their own rhythm.”, adds Torne.
And because feedback only gently guides the agent’s behavior, the agent will eventually learn to complete the task even if users provide incorrect answers.
Faster learning
The researchers tested this method on a number of simulated and real-world tasks. In simulation, they used HuGE to efficiently learn tasks with long sequences of actions, such as stacking blocks in a particular order or navigating a large maze.
In real-world testing, they used HuGE to train robotic arms to draw the letter “U” and select and place objects. For these tests, they collected data from 109 non-expert users in 13 different countries spanning three continents.
In real and simulated experiments, HuGE helped agents learn to achieve their goal faster than other methods.
The researchers also found that data from non-experts produced better performance than synthetic data, produced and labeled by the researchers. For non-expert users, labeling 30 images or videos took less than two minutes.
“This makes this method very promising in terms of the possibility of scaling up this method,” adds Torne.
In a related paper, presented by the researchers at the recent Robot Learning Conference, they improved HuGE so that an AI agent can learn to perform the task and then autonomously reset the environment to continue to learn. For example, if the agent learns to open a cabinet, the method also guides him in closing the cabinet.
“We can now have it learn completely autonomously without the need for human resetting,” he says.
The researchers also point out that, in this as in other learning approaches, it is essential to ensure that AI agents are aligned with human values.
In the future, they want to continue perfecting HuGE so that the agent can learn other forms of communication, such as natural language and physical interactions with the robot. They also want to apply this method to train several agents at once.
More information:
Marcel Torne et al, Breadcrumbs towards the Objective: Exploration Conditioned by the Objective from Human Feedback in the Loop, arXiv (2023). DOI: 10.48550/arxiv.2307.11049
arXiv
Provided by the Massachusetts Institute of Technology
This story is republished courtesy of MIT News (web.mit.edu/newsoffice/), a popular site that covers news in MIT research, innovation and education.
Quote: A new method uses crowdsourced feedback to train robots (November 27, 2023) retrieved November 27, 2023 from
This document is subject to copyright. Except for fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for information only.