Why humanoid robots are learning everyday tasks faster than expected


Last September roboticist Benjie Holson posted the “Humanoid Olympic Games”: a set of increasingly difficult tests for humanoid robots that he demonstrated himself while dressed in a silver bodysuit. The challenges, such as opening a door with a round doorknob, started out easy, at least for a human, and progressed to “gold medal” tasks such as properly buttoning and hanging up a men’s dress shirt and using a key to open a door.

Holson’s point was that the hard tasks aren’t the dazzling ones. While other competitions feature robots playing sports and dancing, Holson argued that the robots we actually want are the ones that can do laundry and cook meals.

He expected the challenges to take years to resolve. Instead, within months, robotics company Physical Intelligence completed 11 of the 15 challenges—from bronze to gold—with a robot that washed windows, spread peanut butter and used a dog poop bag.


On supporting science journalism

If you’re enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.


Scientific American spoke to Holson about why vision-only, or camera-based, systems are outperforming his expectations and how close we are to a genuinely useful machine. He has since released a new, more difficult set of challenges.

[An edited transcript of the interview follows.]

You designed these challenges to be hard. Were you surprised by how quickly the results came in?

It was so much faster than I was expecting. When I chose the challenges, I was trying to calibrate them so some bronze ones would get done in the first month or two, then silver and gold in the next six months, and the most difficult ones might take a year or a year and a half. To have them do basically almost all of them in the first three months is wild.

What made that possible?

I started with the premise that we have things that look impressive at a fairly narrow set of tasks—vision-only, no touch, simple manipulator, not incredible precision. That limits what you can be good at. I tried to think of tasks that would require us to break forward out of that set. It turns out I wildly underestimated what’s possible with vision-only and simple manipulators.

When I visited Physical Intelligence, I learned they don’t have any force sensing. They’re doing all of that 100 percent vision-based. The key-insertion task, the peanut butter spreading—I thought those would require force inputs. But apparently you just throw more video demonstrations at it, and it works.

How exactly do you train a robot to do that without coding it line by line?

It’s all learning from demonstration. Somebody teleoperates the robot doing the task hundreds of times, they train a model based on that, and then the robot can do the task.

There is a lot of confusion about whether large language models (LLMs) are useless for robots. Are they?

I used to be fairly dubious of the utility of LLMs in robotics. The problem they were good at solving two or three years ago was high-level planning—“If I want to make tea, what are the steps?” Ordering the steps is the easy part. Picking up the teapot and filling it is the really challenging thing.

On the other hand, we’ve started doing vision-action models using the same transformer architecture [as that used in LLMs]. You can use transformers for text in, text out, images in, text out—but also images in, robot actions out.

The neat thing is they’re starting with models pretrained on text, images, maybe video. Before you even start training your specific task, the AI already understands what a teapot is, what water is, that you might want to fill a teapot with water. So while training your task, it doesn’t have to start from, “Let me figure out what geometry is.” It can start with, “I see, we’re moving teapots around”—which is wild that it works.

How did you come up with the “Olympic” tasks?

So part of it was a challenge and part of it was a prediction. I tried to think of the next set of things that we can’t do now that someone’s going to be able to do soon.

Humans rely on touch to do things such as finding keys in a pocket. How do we get around that in robotics?

That’s a very good question we don’t know the answer to yet. Touch technology is way worse, more expensive, delicate and far behind cameras. Cameras, we’ve been working on for a long time.

The big question is: Are cameras enough? Both Physical Intelligence and Sunday Robotics [which completed the bronze-medal task of rolling matched socks] have made the bet that putting a camera on the wrist, very close to the fingers, lets you kind of see forces by seeing how everything smushes. When the robot grabs something, it sees the fingers have some rubber that deflects; the object deflects, and it infers forces from that. When smearing peanut butter on bread, the robot watches the knife deflect down and crush the bread and judges forces from that. It works way better than I expected.

What about safety?

The energy needed to stay balanced is often quite high. If a robot is falling, that’s a very fast, hard acceleration to get the leg in front in time. Your system has to inject a lot of energy into the world—and that’s what’s unsafe.

I’m a huge fan of centaur robots—mobile wheel base with arms and a head. For safety, that’s such an easier way to get there quickly. If a humanoid loses power, it’s going to fall down. The general plan seems like it’s to make a robot so incredibly valuable that we as a society create a new safety class for it—like bicycles and cars. They’re dangerous but so valuable that we tolerate the risk.

Have these results changed your time line?

I used to think home robots were at least 15 years away. Now I think at least six. The difference is I thought it would be much longer before doing a useful thing in a human space, even as a demo, was plausible.

But roboticists have seen time and again there’s a long road between “it worked in a lab and I got a video” and “I can sell a product.” Waymo was driving on roads in 2009; I couldn’t buy a ride until 2024. It takes a long time to get reliability squared away.

What’s the biggest bottleneck left?

Reliability and safety—the stuff Physical Intelligence shows is incredibly impressive, but if you put it on a different table with different lighting and use a different sock, it might not work. Each step toward generalization seems to take an order of magnitude more data, turning days of data collection into weeks or months.



Source link

two-lane oval clumsy

Are red-hot BAE Systems and Babcock shares simply unstoppable now?

Leave a Reply

Your email address will not be published. Required fields are marked *