Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization

Mizanur Rahman, Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Shafiq Joty, Enamul Hoque

2026-01-14

Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization

Summary

This paper introduces a new way to build systems that create charts and graphs from simple text questions about data in tables. These systems, called Text2Vis, aim to automatically understand what you want to see and then generate the correct visualization.

What's the problem?

Current Text2Vis systems, even those using powerful AI models, often struggle to create visualizations that are both correct *and* easy to understand. While some models can generate code that *runs*, the resulting charts might be confusing or not accurately reflect the question asked. Existing methods for improving these systems, like showing them lots of examples, mostly focus on getting the code to work, not on making the final visualization actually good. There's a lack of a way to teach the AI what makes a *good* chart after it's already been created.

What's the solution?

The researchers developed a new framework called RL-Text2Vis that uses a technique called reinforcement learning. Think of it like training a dog with rewards – the system gets 'rewarded' for creating visualizations that are accurate to the text question, have code that runs without errors, and are visually clear. They used a specific reinforcement learning method called Group Relative Policy Optimization and trained it on a model called Qwen2.5. This method considers feedback *after* the chart is generated, allowing it to learn what makes a visualization effective.

Why it matters?

This work is important because it significantly improves the quality of automatically generated charts. Their system outperforms existing models, including the very powerful GPT-4o, and is much better at creating charts that actually work and are easy to interpret. This means that people can more easily explore and understand data without needing to be experts in data visualization or coding, and it establishes a new, effective approach for building these types of AI systems.

Abstract

Text-to-Visualization (Text2Vis) systems translate natural language queries over tabular data into concise answers and executable visualizations. While closed-source LLMs generate functional code, the resulting charts often lack semantic alignment and clarity, qualities that can only be assessed post-execution. Open-source models struggle even more, frequently producing non-executable or visually poor outputs. Although supervised fine-tuning can improve code executability, it fails to enhance overall visualization quality, as traditional SFT loss cannot capture post-execution feedback. To address this gap, we propose RL-Text2Vis, the first reinforcement learning framework for Text2Vis generation. Built on Group Relative Policy Optimization (GRPO), our method uses a novel multi-objective reward that jointly optimizes textual accuracy, code validity, and visualization quality using post-execution feedback. By training Qwen2.5 models (7B and 14B), RL-Text2Vis achieves a 22% relative improvement in chart quality over GPT-4o on the Text2Vis benchmark and boosts code execution success from 78% to 97% relative to its zero-shot baseline. Our models significantly outperform strong zero-shot and supervised baselines and also demonstrate robust generalization to out-of-domain datasets like VIS-Eval and NVBench. These results establish GRPO as an effective strategy for structured, multimodal reasoning in visualization generation. We release our code at https://github.com/vis-nlp/RL-Text2Vis.

View Paper