Robots that learn from videos of human activities and simulated interactions

Optimistic science fiction typically imagines a future where humans create art and pursue fulfilling pastimes while AI-enabled robots handle dull or dangerous tasks. In contrast, the AI systems of today display increasingly sophisticated generative abilities on ostensible creative tasks. But where are the robots? This gap is known as Moravec’s paradox, the thesis that the hardest problems in AI involve sensorimotor skills, not abstract thought or reasoning. To put it another way, “The hard problems are easy, and the easy problems are hard.”

Today, we are announcing two major advancements toward general-purpose embodied AI agents capable of performing challenging sensorimotor skills:

  • An artificial visual cortex (called VC-1): a single perception model that, for the first time, supports a diverse range of sensorimotor skills, environments, and embodiments. VC-1 is trained on videos of people performing everyday tasks from the groundbreaking Ego4D dataset created by Meta AI and academic partners. And VC-1 matches or outperforms best-known results on 17 different sensorimotor tasks in virtual environments.
  • A new approach called adaptive (sensorimotor) skill coordination (ASC), which achieves near-perfect performance (98 percent success) on the challenging task of robotic mobile manipulation (navigating to an object, picking it up, navigating to another location, placing the object, repeating) in physical environments.

Data powers both of these breakthroughs. AI needs data to learn from — and, specifically, embodied AI needs data that captures interactions with the environment. Traditionally, this interaction data is collected either by collecting large amounts of demonstrations or by allowing the robot to learn from interactions from scratch. Both approaches are too resource-intensive to scale toward the learning of a general embodied AI agent. In both of these works, we are developing new ways for robots to learn, using videos of human interactions with the real world and simulated interactions within photorealistic simulated worlds.

Read more here: