How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings
Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, Shiyu Chang
2026-04-08
Summary
This paper investigates how useful 'skills' are for improving the performance of AI agents powered by large language models (LLMs). These skills are like pre-built tools or pieces of knowledge that agents can use to complete tasks, but the study looks at how well they actually work in realistic situations.
What's the problem?
While adding skills to LLM agents seems like a good idea, it's surprisingly hard to get them to actually *help*. Previous tests gave the LLMs perfect skills tailored to each task, which isn't how things work in the real world. The problem is that agents often have to find skills themselves from a large library, and even the best match might not be perfect for the job. The researchers found that as the situation becomes more realistic – meaning the agent has to do more searching and the skills aren't perfectly suited – the skills become less and less effective, sometimes performing no better than not using skills at all.
What's the solution?
The researchers conducted a thorough study where agents had to choose from a huge collection of 34,000 real-world skills. They then explored ways to 'refine' these skills after they're chosen, making them more relevant to the specific task at hand. They tried two approaches: refining skills based on the specific question being asked, and refining them generally without considering the question. They discovered that refining skills based on the question significantly improved performance when the initial skills were reasonably good to begin with. They also tested this approach on a more complex benchmark and saw improvements with a powerful LLM called Claude Opus 4.6.
Why it matters?
This research is important because it shows that simply *having* skills isn't enough to make LLM agents better. It highlights the need to focus on how agents *find* and *adapt* those skills to the task at hand. The findings suggest that skill refinement is a crucial step to unlock the full potential of skills for AI agents, and that there's still work to be done to make these skills truly reliable and effective across different models and situations.
Abstract
Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided with hand-crafted, narrowly-tailored task-specific skills for each task, whereas in many realistic settings, the LLM agent may have to search for and select relevant skills on its own, and even the closest matching skills may not be well-tailored for the task. In this paper, we conduct the first comprehensive study of skill utility under progressively challenging realistic settings, where agents must retrieve skills from a large collection of 34k real-world skills and may not have access to any hand-curated skills. Our findings reveal that the benefits of skills are fragile: performance gains degrade consistently as settings become more realistic, with pass rates approaching no-skill baselines in the most challenging scenarios. To narrow this gap, we study skill refinement strategies, including query-specific and query-agnostic approaches, and we show that query-specific refinement substantially recovers lost performance when the initial skills are of reasonable relevance and quality. We further demonstrate the generality of retrieval and refinement on Terminal-Bench 2.0, where they improve the pass rate of Claude Opus 4.6 from 57.7% to 65.5%. Our results, consistent across multiple models, highlight both the promise and the current limitations of skills for LLM-based agents. Our code is available at https://github.com/UCSB-NLP-Chang/Skill-Usage.