The new model, called RFM-1, was trained on years of data collected from Covariant’s small fleet of item-picking robots that customers like Crate & Barrel and Bonprix use in warehouses around the world, as well as words and videos from the internet. In the coming months, the model will be released to Covariant customers. The company hopes the system will become more capable and efficient as it’s deployed in the real world.
So what can it do? In a demonstration I attended last week, Covariant cofounders Peter Chen and Pieter Abbeel showed me how users can prompt the model using five different types of input: text, images, video, robot instructions, and measurements.
For example, show it an image of a bin filled with sports equipment, and tell it to pick up the pack of tennis balls. The robot can then grab the item, generate an image of what the bin will look like after the tennis balls are gone, or create a video showing a bird’s-eye view of how the robot will look doing the task.
If the model predicts it won’t be able to properly grasp the item, it might even type back, “I can’t get a good grip. Do you have any tips?” A response could advise it to use a specific number of the suction cups on its arms to give it better a grasp—eight versus six, for example.
This represents a leap forward, Chen told me, in robots that can adapt to their environment using training data rather than the complex, task-specific code that powered the previous generation of industrial robots. It’s also a step toward worksites where managers can issue instructions in human language without concern for the limitations of human labor. (“Pack 600 meal-prep kits for red pepper pasta using the following recipe. Take no breaks!”)
Lerrel Pinto, a researcher who runs the general-purpose robotics and AI lab at New York University and has no ties to Covariant, says that even though roboticists have built basic multimodal robots before and used them in lab settings, deploying one at scale that’s able to communicate in this many modes marks an impressive feat for the company.
To outpace its competitors, Covariant will have to get its hands on enough data for the robot to become useful in the wild, Pinto told me. Warehouse floors and loading docks are where it will be put to the test, constantly interacting with new instructions, people, objects, and environments.
“The groups which are going to train good models are going to be the ones that have either access to already large amounts of robot data or capabilities to generate those data,” he says.