OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

Fangzhi Xu, Hang Yan, Qiushi Sun, Jinyang Wu, Zixian Huang, Muye Huang, Jingyang Gong, Zichen Ding, Kanzhi Cheng, Yian Wang, Xinyu Che, Zeyi Sun, Jian Zhang, Zhangyue Yin, Haoran Luo, Xuanjing Huang, Ben Kao, Jun Liu, Qika Lin

2026-02-09

OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

Summary

This paper introduces a new way to test how well AI agents, powered by Large Language Models, can learn and plan for the future in complex situations.

What's the problem?

Currently, we mostly test AI agents by giving them specific instructions and seeing if they can follow them. This doesn't really check if the agent can *learn* how the world works on its own, figure out hidden rules, and use that knowledge to make plans that stretch far into the future. It's like giving a student the answers to a test instead of seeing if they understand the material.

What's the solution?

The researchers created a testing environment called OdysseyArena. This environment forces agents to actively explore and discover how things work through trial and error over a long period of time. They built different scenarios where agents need to figure out underlying patterns to succeed, and created a standardized set of challenges to compare different AI models. They also made a very difficult version of the environment to really push the limits of what these agents can do.

Why it matters?

This work is important because it highlights a major weakness in current AI agents: they struggle to learn from experience and plan effectively for the long term. Improving this ability is crucial for building truly autonomous AI that can handle real-world problems, which are rarely simple and require understanding how things change over time.

Abstract

The rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous agents capable of navigating complex environments. However, existing evaluations primarily adopt a deductive paradigm, where agents execute tasks based on explicitly provided rules and static goals, often within limited planning horizons. Crucially, this neglects the inductive necessity for agents to discover latent transition laws from experience autonomously, which is the cornerstone for enabling agentic foresight and sustaining strategic coherence. To bridge this gap, we introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions. We formalize and instantiate four primitives, translating abstract transition dynamics into concrete interactive environments. Building upon this, we establish OdysseyArena-Lite for standardized benchmarking, providing a set of 120 tasks to measure an agent's inductive efficiency and long-horizon discovery. Pushing further, we introduce OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons (e.g., > 200 steps). Extensive experiments on 15+ leading LLMs reveal that even frontier models exhibit a deficiency in inductive scenarios, identifying a critical bottleneck in the pursuit of autonomous discovery in complex environments. Our code and data are available at https://github.com/xufangzhi/Odyssey-Arena

View Paper