ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights

Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki

2024-06-24

ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights

Summary

This paper introduces a new method called In-Context Abstraction Learning (ICAL) that helps language and vision-language models (LLMs and VLMs) learn better from less-than-perfect examples. Instead of needing perfect demonstrations, these models can create their own useful examples from sub-optimal ones, improving their decision-making abilities.

What's the problem?

Large language and vision-language models usually need high-quality examples to learn effectively. However, obtaining these perfect examples can be difficult and costly. The challenge is to find a way for these models to learn from examples that are not ideal, which could limit their performance.

What's the solution?

The researchers developed ICAL, which allows models to take noisy or imperfect demonstrations and turn them into useful learning experiences. This process involves identifying and correcting inefficient actions, as well as understanding relationships between tasks and objects. The model learns from human feedback while trying to perform tasks in similar environments. This way, it builds a memory of effective strategies that enhance its performance in various tasks.

Why it matters?

This research is significant because it shows that models can improve their learning capabilities without relying solely on expert-crafted examples. By enabling LLMs and VLMs to generate their own learning examples, the method enhances their efficiency and effectiveness in real-world applications, making them more adaptable and capable of handling diverse tasks.

Abstract

Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot in-context learning for decision making and instruction following. However, they require high-quality exemplar demonstrations to be included in their context window. In this work, we ask: Can LLMs and VLMs generate their own prompt examples from generic, sub-optimal demonstrations? We propose In-Context Abstraction Learning (ICAL), a method that builds a memory of multimodal experience insights from sub-optimal demonstrations and human feedback. Given a noisy demonstration in a new domain, VLMs abstract the trajectory into a general program by fixing inefficient actions and annotating cognitive abstractions: task relationships, object state changes, temporal subgoals, and task construals. These abstractions are refined and adapted interactively through human feedback while the agent attempts to execute the trajectory in a similar environment. The resulting abstractions, when used as exemplars in the prompt, significantly improve decision-making in retrieval-augmented LLM and VLM agents. Our ICAL agent surpasses the state-of-the-art in dialogue-based instruction following in TEACh, multimodal web agents in VisualWebArena, and action anticipation in Ego4D. In TEACh, we achieve a 12.6% improvement in goal-condition success. In VisualWebArena, our task success rate improves over the SOTA from 14.3% to 22.7%. In Ego4D action forecasting, we improve over few-shot GPT-4V and remain competitive with supervised models. We show finetuning our retrieval-augmented in-context agent yields additional improvements. Our approach significantly reduces reliance on expert-crafted examples and consistently outperforms in-context learning from action plans that lack such insights.

View Paper