Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs

Liang Zeng, Yongcong Li, Yuzhen Xiao, Changshi Li, Chris Yuhao Liu, Rui Yan, Tianwen Wei, Jujie He, Xuchen Song, Yang Liu, Yahui Zhou

2025-06-25

Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in
LLMs

Summary

This paper talks about Skywork-SWE, an automated system that collects and organizes a huge amount of software engineering data to train large language models that can solve coding problems better and handle long, complex tasks.

What's the problem?

The problem is that existing datasets for training AI in software engineering are small and require a lot of manual work to prepare, which limits how well AI can learn to solve coding tasks, especially those needing long-term problem-solving.

What's the solution?

The researchers created a pipeline that automatically gathers a large and diverse set of real coding tasks from GitHub, tests and validates them in real programming environments, and generates training examples to improve model learning.

Why it matters?

This matters because by using more and better training data, the AI models become much better at understanding and solving programming problems, making software development more efficient and reliable.

Abstract

An automated data-curation pipeline for software engineering improves large language model performance on SWE tasks, achieving state-of-the-art results with and without test-time scaling techniques.

View Paper