Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs
Liang Zeng, Yongcong Li, Yuzhen Xiao, Changshi Li, Chris Yuhao Liu, Rui Yan, Tianwen Wei, Jujie He, Xuchen Song, Yang Liu, Yahui Zhou
2025-06-25
Summary
This paper talks about Skywork-SWE, an automated system that collects and organizes a huge amount of software engineering data to train large language models that can solve coding problems better and handle long, complex tasks.
What's the problem?
The problem is that existing datasets for training AI in software engineering are small and require a lot of manual work to prepare, which limits how well AI can learn to solve coding tasks, especially those needing long-term problem-solving.
What's the solution?
The researchers created a pipeline that automatically gathers a large and diverse set of real coding tasks from GitHub, tests and validates them in real programming environments, and generates training examples to improve model learning.
Why it matters?
This matters because by using more and better training data, the AI models become much better at understanding and solving programming problems, making software development more efficient and reliable.
Abstract
An automated data-curation pipeline for software engineering improves large language model performance on SWE tasks, achieving state-of-the-art results with and without test-time scaling techniques.