Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, Vimal Thilak
2025-01-28
Summary
This paper talks about CodeMonkeys, a new system that uses advanced AI techniques to solve real-world software problems. It focuses on improving how AI models use extra thinking time (test-time compute) to tackle complex coding issues more effectively.
What's the problem?
Current AI models struggle to solve complicated software engineering problems efficiently. While giving these models more time to think (test-time compute) can help, researchers aren't sure about the best way to use this extra time. Also, dealing with large amounts of code can be challenging for AI models.
What's the solution?
The researchers created CodeMonkeys, a system that lets AI models repeatedly edit code and test it. It tries many different solutions for each problem and uses a clever voting system to pick the best one. CodeMonkeys also has a smart way of figuring out which parts of a big codebase are important, saving time and resources. They tested their system on a dataset of real GitHub issues and found it could solve 57.4% of the problems, spending about $2,300 in the process.
Why it matters?
This research matters because it shows a new way to make AI better at solving real-world coding problems without just making the AI models bigger. By using time and resources more efficiently, CodeMonkeys can tackle complex software issues that current AI struggles with. This could lead to faster and more effective ways of fixing bugs and improving software, potentially saving developers a lot of time and effort. The fact that the researchers are sharing all their code and data means other scientists can build on this work, possibly leading to even better AI coding assistants in the future.
Abstract
Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Experts (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the fraction of inactive parameters, impacts model's performance during pretraining and downstream few-shot evaluation. We find that under different constraints (e.g., parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance. These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.