Grounding Language in Multi-Perspective Referential Communication

Zineng Tang, Lingjun Mao, Alane Suhr

2024-10-08

Grounding Language in Multi-Perspective Referential Communication

Summary

This paper presents a new task and dataset that helps robots or AI agents understand and generate language about objects in a shared environment, considering different viewpoints.

What's the problem?

In environments where multiple agents (like robots) interact, they need to communicate about objects and their locations. However, each agent may see the scene differently, making it hard for them to refer to objects accurately. Existing models struggle to generate and comprehend these references effectively when working together.

What's the solution?

The authors created a dataset with 2,970 human-written phrases that describe objects and their relationships in a scene. They tested automated models as both speakers (generating descriptions) and listeners (understanding descriptions) when paired with humans. They found that while these models performed better than before, they still lagged behind human pairs. To improve this, they trained an open-weight speaker model that learned from successful communication, resulting in a significant increase in effective communication rates.

Why it matters?

This research is important because it enhances how AI agents can work together in shared environments by improving their ability to communicate about objects. Better communication among agents can lead to more effective collaboration in various applications, such as robotics, gaming, and virtual reality.

Abstract

We introduce a task and dataset for referring expression generation and comprehension in multi-agent embodied environments. In this task, two agents in a shared scene must take into account one another's visual perspective, which may be different from their own, to both produce and understand references to objects in a scene and the spatial relations between them. We collect a dataset of 2,970 human-written referring expressions, each paired with human comprehension judgments, and evaluate the performance of automated models as speakers and listeners paired with human partners, finding that model performance in both reference generation and comprehension lags behind that of pairs of human agents. Finally, we experiment training an open-weight speaker model with evidence of communicative success when paired with a listener, resulting in an improvement from 58.9 to 69.3% in communicative success and even outperforming the strongest proprietary model.

View Paper