SWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering?

Tingxu Han, Yi Zhang, Wei Song, Chunrong Fang, Zhenyu Chen, Youcheng Sun, Lijie Hu

2026-03-18

SWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering?

Summary

This paper investigates how helpful pre-made 'skills' are when added to AI agents designed to help with software engineering tasks, like coding and debugging.

What's the problem?

AI agents are getting 'skills' added to them to make them better at specific tasks, but it's not really clear if these skills actually improve performance in real-world software development. It's possible they don't help much, or even make things worse, and we haven't had a good way to measure their true impact.

What's the solution?

The researchers created a new testing platform called SWE-Skills-Bench. This platform uses real software projects from GitHub and specific requirements for those projects. They then tested the AI agents on these projects both *with* and *without* the added skills, using automated tests to see if the skills actually helped the agent complete the tasks correctly. They looked at whether the skills improved the success rate, and also how much extra 'thinking' (tokens) the skills required from the AI.

Why it matters?

The findings show that these 'skills' aren't as useful as people might think. Most of the skills didn't improve performance at all, and some even made things worse. This means we need to be careful about adding skills to AI agents and focus on making sure they're a good fit for the specific task and project. The SWE-Skills-Bench platform provides a way to rigorously test and evaluate these skills in the future, helping developers build better AI tools.

Abstract

Agent skills, structured procedural knowledge packages injected at inference time, are increasingly used to augment LLM agents on software engineering tasks. However, their real utility in end-to-end development settings remains unclear. We present SWE-Skills-Bench, the first requirement-driven benchmark that isolates the marginal utility of agent skills in real-world software engineering (SWE). It pairs 49 public SWE skills with authentic GitHub repositories pinned at fixed commits and requirement documents with explicit acceptance criteria, yielding approximately 565 task instances across six SWE subdomains. We introduce a deterministic verification framework that maps each task's acceptance criteria to execution-based tests, enabling controlled paired evaluation with and without the skill. Our results show that skill injection benefits are far more limited than rapid adoption suggests: 39 of 49 skills yield zero pass-rate improvement, and the average gain is only +1.2%. Token overhead varies from modest savings to a 451% increase while pass rates remain unchanged. Only seven specialized skills produce meaningful gains (up to +30%), while three degrade performance (up to -10%) due to version-mismatched guidance conflicting with project context. These findings suggest that agent skills are a narrow intervention whose utility depends strongly on domain fit, abstraction level, and contextual compatibility. SWE-Skills-Bench provides a testbed for evaluating the design, selection, and deployment of skills in software engineering agents. SWE-Skills-Bench is available at https://github.com/GeniusHTX/SWE-Skills-Bench.

View Paper