SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
Ziao Zhang, Kou Shi, Shiting Huang, Avery Nie, Yu Zeng, Yiming Zhao, Zhen Fang, Qishen Su, Haibo Qiu, Wei Yang, Qingnan Ren, Shun Zou, Wenxuan Huang, Lin Chen, Zehui Chen, Feng Zhao
2026-04-21
Summary
This paper introduces a new way to test how well AI agents can learn and improve over time by using and creating their own 'skills' to solve problems.
What's the problem?
Currently, most tests for AI agents focus on whether they can *use* skills that are already given to them. This doesn't test if they can figure out skills on their own, fix them when they break, or build up a useful collection of skills over time. It's like giving a student a toolbox and seeing if they can build something, but not testing if they can learn to *make* new tools when needed.
What's the solution?
The researchers created a benchmark called SkillFlow, which includes 166 tasks organized into 20 related groups. These tasks are designed so that an agent can learn a basic process and then apply it to different situations. The AI agents start with no skills and have to learn by trying to solve tasks one after another. When they succeed, they 'save' what they learned as a skill that they can use later. The researchers then tested several powerful AI models, like Claude and Qwen, to see how well they could learn and improve using this method.
Why it matters?
This work is important because it highlights that simply *using* skills isn't enough for AI to become truly intelligent. It shows that AI needs to be able to learn skills independently, adapt them when things go wrong, and build a lasting knowledge base. The SkillFlow benchmark provides a standardized way to measure these abilities and identify areas where AI still needs to improve, pushing the field towards more capable and adaptable AI agents.
Abstract
As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time. We introduce SkillFlow, a benchmark of 166 tasks across 20 families in which task construction within each family follows a Domain-Agnostic Execution Flow (DAEF) that defines an agent workflow framework, allowing these tasks to share a consistent workflow. Agents are evaluated under an Agentic Lifelong Learning protocol in which they begin without skills, solve tasks sequentially within each family, externalize lessons through trajectory- and rubric-driven skill patches, and carry the updated library forward. Experiments reveal a substantial capability gap. For Claude Opus 4.6, lifelong skill evolution improves task success from 62.65% to 71.08% (+8.43 points). However, high skill usage does not necessarily imply high utility: Kimi K2.5 gains only +0.60 points despite 66.87% skill usage, while Qwen-Coder-Next reaches only a 44.58% task completion rate and still regresses relative to the vanilla setting. SkillFlow contributes a structured testbed for this direction and an in-depth empirical analysis of skill discovery, patching, transfer, and their failure modes under lifelong evaluation.