Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Yuqi Zhu, Yi Zhong, Jintian Zhang, Ziheng Zhang, Shuofei Qiao, Yujie Luo, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang

2025-06-25

Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic
Empirical Study

Summary

This paper talks about why open-source large language models (LLMs) often struggle with analyzing data and how their abilities can be improved through better planning, interaction design, and high-quality data.

What's the problem?

The problem is that while open-source LLMs have potential to automate data tasks, they perform poorly on complex reasoning jobs like understanding data, generating code, and making strategic plans because of limitations in their design and the data they learn from.

What's the solution?

The researchers studied these issues by testing models on various realistic tasks and found that the quality of their strategic planning is the main factor for success. They also found that how users interact with the model and the complexity of tasks affect the model's reasoning. Using these insights, they created a method to generate better training data that significantly boosts the models' performance in data analysis.

Why it matters?

This matters because improving open-source LLMs in data analysis can make advanced AI tools more accessible and reliable, helping many people and organizations automate complex data tasks without relying on expensive, closed-source software.

Abstract

Enhancements to open-source large language models' data analysis capabilities through strategic planning, interaction design, and data quality improvements were identified and applied.

View Paper