Agentic Rubrics as Contextual Verifiers for SWE Agents

Mohit Raghavendra, Anisha Gunjal, Bing Liu, Yunzhong He

2026-01-08

Agentic Rubrics as Contextual Verifiers for SWE Agents

Summary

This paper introduces a new method called Agentic Rubrics to help improve how we check if code-writing AI agents are producing correct and useful software fixes.

What's the problem?

Currently, verifying if an AI agent's code fix actually works often requires running the code, which can be slow and complicated, especially for large projects. While faster methods exist, they don't always understand the code well or give clear reasons for their judgments, making it hard to trust their assessments.

What's the solution?

Agentic Rubrics uses another AI agent to carefully examine the project's code and create a detailed checklist, or 'rubric', of what a good fix should look like. Then, instead of running the code, the system checks the proposed fix against this rubric to see how well it meets the criteria. This avoids the need for code execution and provides a more understandable evaluation.

Why it matters?

This approach is important because it offers a faster, more scalable, and more reliable way to verify AI-generated code. The results show it performs well compared to existing methods and can even identify issues that traditional tests might miss, ultimately leading to better and more trustworthy AI agents for software engineering.

Abstract

Verification is critical for improving agents: it provides the reward signal for Reinforcement Learning and enables inference-time gains through Test-Time Scaling (TTS). Despite its importance, verification in software engineering (SWE) agent settings often relies on code execution, which can be difficult to scale due to environment setup overhead. Scalable alternatives such as patch classifiers and heuristic methods exist, but they are less grounded in codebase context and harder to interpret. To this end, we explore Agentic Rubrics: an expert agent interacts with the repository to create a context-grounded rubric checklist, and candidate patches are then scored against it without requiring test execution. On SWE-Bench Verified under parallel TTS evaluation, Agentic Rubrics achieve a score of 54.2% on Qwen3-Coder-30B-A3B and 40.6% on Qwen3-32B, with at least a +3.5 percentage-point gain over the strongest baseline in our comparison set. We further analyze rubric behavior, showing that rubric scores are consistent with ground-truth tests while also flagging issues that tests do not capture. Our ablations show that agentic context gathering is essential for producing codebase-specific, unambiguous criteria. Together, these results suggest that Agentic Rubrics provide an efficient, scalable, and granular verification signal for SWE agents.

View Paper