SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong

2026-02-18

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Summary

This paper investigates whether 'Skills,' which are essentially pre-packaged sets of instructions, actually improve the performance of AI agents powered by large language models (LLMs). It introduces a new way to test these Skills and compares how well agents do with and without them.

What's the problem?

Currently, there's no good way to determine if adding Skills to LLM agents actually makes them better at completing tasks. Everyone is using Skills, but no one has a standard method to measure their effectiveness, and it's unclear if agents can even create useful Skills themselves. It's possible Skills could even *hurt* performance on some tasks.

What's the solution?

The researchers created 'SkillsBench,' a collection of 86 different tasks across many areas like software development and healthcare. They tested agents in three ways: without any Skills, with pre-made (curated) Skills, and with Skills the agent generated on its own. They ran these tests with different agent setups and tracked how often the agent successfully completed each task. They found that pre-made Skills generally helped, but the amount of improvement varied a lot depending on the task. Importantly, agents couldn't reliably create Skills that improved their performance.

Why it matters?

This work is important because it provides a standardized way to evaluate Skills for LLM agents. It shows that while Skills can be helpful, they aren't a guaranteed improvement and that simply giving an agent a lot of information isn't as effective as focused, well-designed instructions. It also suggests that relying on agents to create their own instructions isn't a viable strategy, at least with current technology, and that smaller models can perform as well as larger ones when equipped with good Skills.

Abstract

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.

View Paper