Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Ye, Jiaao Chen, Hanrui Wang, Wei-Chen Wang, Yuzhi Zhang, Linfeng Zhang, Weinan E, Di Jin, Siheng Chen

2026-01-16

Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

Summary

This paper introduces ML-Master 2.0, an AI agent designed to perform complex machine learning engineering tasks completely on its own, over extended periods of time like a full day or more.

What's the problem?

Current AI, especially Large Language Models, are good at quick thinking and problem-solving, but they struggle with tasks that require planning and consistent effort over long durations. Think of it like trying to write a novel – an AI might be able to write a good sentence, but keeping the story coherent and making progress over weeks is really hard because they get bogged down in details and forget the bigger picture. They have trouble learning from small bits of feedback and using that to guide their long-term strategy.

What's the solution?

The researchers tackled this by creating a system called Hierarchical Cognitive Caching, or HCC. This system is inspired by how computers manage information. It's like giving the AI a really good memory that organizes information into different levels. Important, lasting knowledge is stored securely, while temporary details from each step are quickly processed and then summarized. This allows the AI to focus on the overall goal without getting lost in the weeds, and to build on past experiences to improve its strategy over time. Essentially, it separates short-term actions from long-term planning.

Why it matters?

This work is important because it shows that we can build AI that can independently conduct complex scientific research, going beyond what humans can realistically manage due to time and complexity. It’s a step towards AI that can truly discover new things on its own, not just assist humans, and opens the door to tackling incredibly challenging problems.

Abstract

The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanning days or weeks. While Large Language Models (LLMs) have demonstrated prowess in short-horizon reasoning, they are easily overwhelmed by execution details in the high-dimensional, delayed-feedback environments of real-world research, failing to consolidate sparse feedback into coherent long-term guidance. Here, we present ML-Master 2.0, an autonomous agent that masters ultra-long-horizon machine learning engineering (MLE) which is a representative microcosm of scientific discovery. By reframing context management as a process of cognitive accumulation, our approach introduces Hierarchical Cognitive Caching (HCC), a multi-tiered architecture inspired by computer systems that enables the structural differentiation of experience over time. By dynamically distilling transient execution traces into stable knowledge and cross-task wisdom, HCC allows agents to decouple immediate execution from long-term experimental strategy, effectively overcoming the scaling limits of static context windows. In evaluations on OpenAI's MLE-Bench under 24-hour budgets, ML-Master 2.0 achieves a state-of-the-art medal rate of 56.44%. Our findings demonstrate that ultra-long-horizon autonomy provides a scalable blueprint for AI capable of autonomous exploration beyond human-precedent complexities.

View Paper