SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

Jialong Chen, Xander Xu, Hu Wei, Chuan Chen, Bing Zhao

2026-03-05

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

Summary

This paper introduces a new way to test how well AI agents can fix and improve software code over time, moving beyond simply checking if a fix works right away.

What's the problem?

Current methods for evaluating AI code-fixing abilities focus on whether the AI can correct a bug in one attempt. However, real-world software development isn't like that; it involves many changes and improvements made over months or even years. Existing tests don't measure if an AI can *maintain* code quality through these ongoing changes, potentially leading to problems down the line.

What's the solution?

The researchers created a benchmark called SWE-CI. This benchmark gives AI agents tasks based on actual software projects that have a long history of updates. The AI isn't just asked to fix a bug once, but to repeatedly analyze and modify the code as it evolves over many 'commits' – essentially simulating a real software development process with continuous integration. Each task represents about 233 days of development and 71 code updates.

Why it matters?

SWE-CI is important because it provides a more realistic test of AI agents' ability to contribute to long-term software projects. It helps us understand if these agents can not only fix bugs but also ensure the code remains well-maintained and doesn't become more problematic with future updates, which is crucial for building reliable software.

Abstract

Large language model (LLM)-powered agents have demonstrated strong capabilities in automating software engineering tasks such as static bug fixing, as evidenced by benchmarks like SWE-bench. However, in the real world, the development of mature software is typically predicated on complex requirement changes and long-term feature iterations -- a process that static, one-shot repair paradigms fail to capture. To bridge this gap, we propose SWE-CI, the first repository-level benchmark built upon the Continuous Integration loop, aiming to shift the evaluation paradigm for code generation from static, short-term functional correctness toward dynamic, long-term maintainability. The benchmark comprises 100 tasks, each corresponding on average to an evolution history spanning 233 days and 71 consecutive commits in a real-world code repository. SWE-CI requires agents to systematically resolve these tasks through dozens of rounds of analysis and coding iterations. SWE-CI provides valuable insights into how well agents can sustain code quality throughout long-term evolution.

View Paper