In-Context Imitation Learning via Next-Token Prediction

Letian Fu, Huang Huang, Gaurav Datta, Lawrence Yunliang Chen, William Chung-Ho Panitch, Fangchen Liu, Hui Li, Ken Goldberg

2024-08-29

Summary

This paper discusses In-Context Imitation Learning via Next-Token Prediction, which focuses on teaching robots to perform new tasks by learning from examples without changing their internal programming.

What's the problem?

Teaching robots to do new tasks usually requires extensive programming and retraining, which can be time-consuming and complex. Existing methods often rely on language-based instructions or reward systems, which may not be effective for all types of tasks, especially those involving physical actions.

What's the solution?

The authors propose a new approach called the In-Context Robot Transformer (ICRT), which allows robots to learn by observing sensor data (like images and movements) instead of relying on language. The robot uses this data to predict the next actions it should take based on what it has seen before. This method enables the robot to adapt to new tasks quickly and without needing to change its underlying programming. Experiments with a Franka Emika robot showed that ICRT could successfully learn and perform tasks even in different settings from those it was trained on.

Why it matters?

This research is important because it simplifies how robots can learn new skills, making them more flexible and capable of handling various tasks in real-world environments. By allowing robots to learn through observation rather than complex programming, this approach can lead to more efficient and effective robotic systems in fields like manufacturing, healthcare, and service industries.

Abstract

We explore how to enhance next-token prediction models to perform in-context imitation learning on a real robot, where the robot executes new tasks by interpreting contextual information provided during the input phase, without updating its underlying policy parameters. We propose In-Context Robot Transformer (ICRT), a causal transformer that performs autoregressive prediction on sensorimotor trajectories without relying on any linguistic data or reward function. This formulation enables flexible and training-free execution of new tasks at test time, achieved by prompting the model with sensorimotor trajectories of the new task composing of image observations, actions and states tuples, collected through human teleoperation. Experiments with a Franka Emika robot demonstrate that the ICRT can adapt to new tasks specified by prompts, even in environment configurations that differ from both the prompt and the training data. In a multitask environment setup, ICRT significantly outperforms current state-of-the-art next-token prediction models in robotics on generalizing to unseen tasks. Code, checkpoints and data are available on https://icrt.dev/

View Paper