INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding

Ji Ha Jang, Hoigi Seo, Se Young Chun

2024-09-11

INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding

Summary

This paper talks about INTRA, a new method for teaching AI systems how to understand the potential uses of objects (affordances) without needing detailed labeled data.

What's the problem?

Teaching AI systems to recognize how objects can be interacted with (affordance) is challenging because traditional methods require a lot of detailed annotations for every object. Additionally, existing approaches need both types of images—exocentric (from an outside view) and egocentric (from the user's view)—which can be hard to gather. This makes it difficult for AI to learn effectively about how to interact with various objects in different situations.

What's the solution?

To solve these problems, the authors developed INTRA, which focuses on learning from exocentric images alone and uses a technique called contrastive learning. This means that instead of needing paired datasets, INTRA learns by comparing different images to identify unique features of interactions. They also used advanced vision-language models to create flexible affordance maps that reflect how objects can be interacted with based on text descriptions. Their approach showed better performance than previous methods on several datasets.

Why it matters?

This research is important because it allows AI systems to learn about object interactions more efficiently without needing extensive labeled data. By improving how AI understands affordances, we can enhance its ability to navigate and interact with the real world, which is crucial for applications in robotics and virtual reality.

Abstract

Affordance denotes the potential interactions inherent in objects. The perception of affordance can enable intelligent agents to navigate and interact with new environments efficiently. Weakly supervised affordance grounding teaches agents the concept of affordance without costly pixel-level annotations, but with exocentric images. Although recent advances in weakly supervised affordance grounding yielded promising results, there remain challenges including the requirement for paired exocentric and egocentric image dataset, and the complexity in grounding diverse affordances for a single object. To address them, we propose INTeraction Relationship-aware weakly supervised Affordance grounding (INTRA). Unlike prior arts, INTRA recasts this problem as representation learning to identify unique features of interactions through contrastive learning with exocentric images only, eliminating the need for paired datasets. Moreover, we leverage vision-language model embeddings for performing affordance grounding flexibly with any text, designing text-conditioned affordance map generation to reflect interaction relationship for contrastive learning and enhancing robustness with our text synonym augmentation. Our method outperformed prior arts on diverse datasets such as AGD20K, IIT-AFF, CAD and UMD. Additionally, experimental results demonstrate that our method has remarkable domain scalability for synthesized images / illustrations and is capable of performing affordance grounding for novel interactions and objects.

View Paper