Which Data Attributes Stimulate Math and Code Reasoning? An Investigation via Influence Functions
Siqi Kou, Qingyuan Tian, Hanwen Xu, Zihao Zeng, Zhijie Deng
2025-05-27
Summary
This paper talks about figuring out which pieces of training data help large language models get better at solving math and coding problems. The researchers use a tool called influence functions to trace back the model's reasoning to specific examples it saw during training, and they find that some data from different areas can actually help with math and code tasks.
What's the problem?
The problem is that it's hard to know exactly what kind of training data makes language models good at reasoning in math and programming. Without this understanding, it's tough to improve the models in a smart way or to know which data to focus on when training them.
What's the solution?
The authors use influence functions, a technique that helps identify which training examples have the biggest impact on the model's answers. By doing this, they discover that certain types of data, even from outside math and coding, can boost the model's performance. They also use this information to adjust the importance of different training data, which leads to more accurate models.
Why it matters?
This matters because it helps researchers and engineers train smarter and more effective language models. By knowing which data is most helpful, they can build models that are better at math and coding, leading to improved tools for students, programmers, and anyone who relies on AI for solving technical problems.
Abstract
Influence functions are used to attribute LLMs' reasoning in math and coding to individual training elements, revealing cross-domain effects and enabling a reweighting strategy that improves model accuracy.