LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

Yukang Feng, Jianwen Sun, Zelai Yang, Jiaxin Ai, Chuanhao Li, Zizhen Li, Fanrui Zhang, Kang He, Rui Ma, Jifan Lin, Jie Sun, Yang Xiao, Sizhuo Zhou, Wenxiao Wu, Yiming Liu, Pengfei Liu, Yu Qiao, Shenglin Zhang, Kaipeng Zhang

2026-02-25

LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

Summary

This paper introduces a new way to test how well AI agents can handle complex programming tasks that require many steps to complete, going beyond what current tests can do.

What's the problem?

Current tests for AI programming assistants aren't very good at measuring how well they can plan and execute long, complicated projects. They often use tasks that are too short, might include code the AI was already trained on from the internet, and don't give detailed feedback on where the AI goes wrong. This means we don't really know if these AIs can actually build realistic software.

What's the solution?

The researchers created a benchmark called LongCLI-Bench, which includes 20 challenging programming tasks taken from real computer science assignments and workflows. These tasks cover things like starting a project from scratch, adding features to existing code, fixing bugs, and improving code quality. They also developed a way to test the AI not just on whether it finishes the task correctly, but also on whether it avoids breaking existing functionality. They also track the AI's progress step-by-step to see exactly where it gets stuck.

Why it matters?

The results show that even the best AI agents struggle with these longer, more complex tasks, often failing to get very far. However, the research suggests that combining human guidance with AI assistance – like giving the AI a starting plan or offering help along the way – can significantly improve performance. This highlights the need to focus on building AI systems that work *with* humans, rather than trying to replace them entirely, to tackle real-world software engineering challenges.

Abstract

Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. We propose a dual-set testing protocol for LongCLI-Bench, which measures requirement fulfillment (fail-to-pass) and regression avoidance (pass-to-pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents' planning and execution capabilities to overcome key challenges in long-horizon task performance.

View Paper