daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently
Mohan Jiang, Dayuan Fu, Junhao Shi, Ji Zeng, Weiye Si, Keyu Li, Xuefeng Li, Yang Xiao, Wenjie Li, Dequan Wang, Pengfei Liu
2026-02-04
Summary
This paper focuses on the difficulty of getting Large Language Models (LLMs) to handle complex, multi-step tasks that require planning and adapting over a long period of time, like a software agent working on a project.
What's the problem?
LLMs are great at quick tasks, but they struggle with tasks that unfold over many steps because there isn't enough real-world data available to train them on how to manage these long-term processes. Creating this data is hard – either it's too simple and doesn't reflect real complexity, or it requires a lot of expensive human effort to label and create.
What's the solution?
The researchers looked at how software developers work using 'Pull Requests' (PRs) on platforms like GitHub. PRs break down big projects into smaller, manageable changes, and track how those changes are refined and improved over time. They created a system called daVinci-Agency that automatically learns from these PR sequences, using the commits within each PR to understand how tasks are broken down, how consistency is maintained, and how bugs are fixed. This approach creates realistic training data without needing extensive human labeling.
Why it matters?
This work is important because it provides a way to train LLMs to be more effective at handling complex, real-world tasks. By learning from the natural evolution of software projects, these models can become better at planning, adapting, and achieving long-term goals, ultimately making them more useful as intelligent agents. The results show significant improvements in performance on challenging benchmarks, suggesting this is a promising direction for future research.
Abstract
While Large Language Models (LLMs) excel at short-term tasks, scaling them to long-horizon agentic workflows remains challenging. The core bottleneck lies in the scarcity of training data that captures authentic long-dependency structures and cross-stage evolutionary dynamics--existing synthesis methods either confine to single-feature scenarios constrained by model distribution, or incur prohibitive human annotation costs, failing to provide scalable, high-quality supervision. We address this by reconceptualizing data synthesis through the lens of real-world software evolution. Our key insight: Pull Request (PR) sequences naturally embody the supervision signals for long-horizon learning. They decompose complex objectives into verifiable submission units, maintain functional coherence across iterations, and encode authentic refinement patterns through bug-fix histories. Building on this, we propose daVinci-Agency, which systematically mines structured supervision from chain-of-PRs through three interlocking mechanisms: (1) progressive task decomposition via continuous commits, (2) long-term consistency enforcement through unified functional objectives, and (3) verifiable refinement from authentic bug-fix trajectories. Unlike synthetic trajectories that treat each step independently, daVinci-Agency's PR-grounded structure inherently preserves the causal dependencies and iterative refinements essential for teaching persistent goal-directed behavior and enables natural alignment with project-level, full-cycle task modeling. The resulting trajectories are substantial--averaging 85k tokens and 116 tool calls--yet remarkably data-efficient: fine-tuning GLM-4.6 on 239 daVinci-Agency samples yields broad improvements across benchmarks, notably achieving a 47% relative gain on Toolathlon. Beyond benchmark performance, our analysis confirms...