RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation
Jiajun Zhang, Yuying Li, Zhixun Li, Xingyu Guo, Jingzhuo Wu, Leqi Zheng, Yiran Yang, Jianke Zhang, Qingbin Li, Shannan Yan, Zhetong Li, Changguo Jia, Junfei Wu, Zilei Wang, Qiang Liu, Liang Wang
2026-03-30
Summary
This paper investigates how well current Vision-Language Models (VLMs) can create code to generate complex charts and graphs from real-world data, specifically focusing on charts with multiple sections or 'panels'.
What's the problem?
Existing benchmarks for testing VLMs' code generation abilities for charts are too simple and don't reflect the challenges of working with authentic, complex data. Current models are good at making basic charts, but it's unclear if they can handle the intricacies of real-world visualizations that require analyzing large datasets and creating multi-panel figures with specific analytical goals.
What's the solution?
The researchers created a new, large benchmark called RealChart2Code. This benchmark contains over 2,800 examples of charts based on real data, and it tests the models not just on creating a chart once, but also on improving the code through multiple rounds of conversation and feedback. They then tested 14 different VLMs on this benchmark to see how they performed.
Why it matters?
The results show that VLMs struggle with complex charts and real data, performing significantly worse on RealChart2Code than on simpler benchmarks. This highlights a gap in the current capabilities of these models and points to areas where future research needs to focus, like improving their ability to handle intricate data and complex visualization requests. It also shows a difference in performance between models that are publicly available and those that are proprietary.
Abstract
Vision-Language Models (VLMs) have demonstrated impressive capabilities in code generation across various domains. However, their ability to replicate complex, multi-panel visualizations from real-world data remains largely unassessed. To address this gap, we introduce \texttt{RealChart2Code}, a new large-scale benchmark with over 2,800 instances grounded in authentic datasets and featuring tasks with clear analytical intent. Crucially, it is the first benchmark to systematically evaluate chart generation from large-scale raw data and assess iterative code refinement in a multi-turn conversational setting. Our comprehensive evaluation of 14 leading VLMs on RealChart2Code reveals significant performance degradation compared to simpler benchmarks, highlighting their struggles with complex plot structures and authentic data. Our analysis uncovers a substantial performance gap between proprietary and open-weight models and confirms that even state-of-the-art VLMs often fail to accurately replicate intricate, multi-panel charts. These findings provide valuable insights into the current limitations of VLMs and guide future research directions. We release the benchmark and code at https://github.com/Speakn0w/RealChart2Code.