The system could make it easier to train different types of robots to complete tasks—machines ranging from mechanical arms to humanoid robots and driverless cars. It could also help make AI web agents, a next generation of AI tools that can carry out complex tasks with little supervision, better at scrolling and clicking, says Mohit Shridhar, a research scientist specializing in robotic manipulation, who worked on the project.
“You can use image-generation systems to do almost all the things that you can do in robotics,” he says. “We wanted to see if we could take all these amazing things that are happening in diffusion and use them for robotics problems.”
To teach a robot to complete a task, researchers normally train a neural network on an image of what’s in front of the robot. The network then spits out an output in a different format—the coordinates required to move forward, for example.
Genima’s approach is different because both its input and output are images, which is easier for the machines to learn from, says Ivan Kapelyukh, a PhD student at Imperial College London, who specializes in robot learning but wasn’t involved in this research.
“It’s also really great for users, because you can see where your robot will move and what it’s going to do. It makes it kind of more interpretable, and means that if you’re actually going to deploy this, you could see before your robot went through a wall or something,” he says.
Genima works by tapping into Stable Diffusion’s ability to recognize patterns (knowing what a mug looks like because it’s been trained on images of mugs, for example) and then turning the model into a kind of agent—a decision-making system.
First, the researchers fine-tuned stable Diffusion to let them overlay data from robot sensors onto images captured by its cameras.
The system renders the desired action, like opening a box, hanging up a scarf, or picking up a notebook, into a series of colored spheres on top of the image. These spheres tell the robot where its joint should move one second in the future.
The second part of the process converts these spheres into actions. The team achieved this by using another neural network, called ACT, which is mapped on the same data. Then they used Genima to complete 25 simulations and nine real-world manipulation tasks using a robot arm. The average success rate was 50% and 64%, respectively.