Without much prior experience, kids can recognize other people’s intentions and come up with plans to help them achieve their goals, even in novel scenarios. By contrast, even the most sophisticated AI systems to date still struggle with basic social interactions. That’s why researchers at MIT, Nvidia, and ETH Zurich developed Watch-And-Help (WAH), a challenge in which embodied AI agents need to understand goals by watching a demonstration of a human performing a task and coordinating with the human to solve the task as quickly as possible.
The concept of embodied AI draws on embodied cognition, the theory that many features of psychology — human or otherwise — are shaped by aspects of the entire body of an organism. By applying this logic to AI, researchers hope to improve the performance of AI systems like chatbots, robots, autonomous vehicles, and even smart speakers that interact with their environments, people, and other AI. A truly embodied robot could check to see whether a door is locked, for instance, or retrieve a smartphone that’s ringing in an upstairs bedroom.
In the first phase of WAH, which the researchers call the Watch stage, an AI agent observes a humanlike agent perform a task and infers a goal from their actions. In the second stage — the Help stage — the AI agent assists the humanlike agent in achieving the same goal in a completely different environment. The researchers assert that this two-stage framework poses unique challenges for human-AI collaboration because the AI agent has to reason about the humanlike agent’s intention and generalize its knowledge about the goal.
To enable the kinds of interactions involved in WAH, the researchers had to extend the open source platform VirtualHome and build a multi-agent environment dubbed VirtualHome-Social. VirtualHome-Social simulates home settings so agents can interact with different objects and agents, for example opening a container or grabbing a utensil from a drawer. VirtualHome-Social also provides built-in agents that emulate human behaviors and an interface for human players. This enables testing with real humans and human activities displayed in semi-realistic environments.
The humanlike agent represents a built-in agent in VirtualHome-Social. It plans its actions based on a goal and its observation of the environment. During the Help stage, the AI agent receives observations from the system at each step and sends an action command back to control a virtual avatar. Meanwhile, the humanlike agent — which can also be controlled by a human — updates its plan based on its latest observation to reflect any state change caused by the AI agent.
The researchers designed an evaluation protocol and provided benchmarks for WAH, including a goal model for the Watch stage and multiple planning and machine learning baselines for the Help stage. The team says results indicate that to achieve success in WAH, AI agents must acquire strong social perception and generalizable helping strategies — as hypothesized.
“Our ultimate goal is to build AI agents that can work with real humans. Our platform opens up exciting directions of future work, such as online goal inference and direct communication between agents,” the researchers wrote. “We hope that the proposed challenge and virtual environment can promote future research on building more sophisticated machine social intelligence.”
The audio problem:
Learn how new cloud-based API solutions are solving imperfect, frustrating audio in video conferences. Access here