Visual-ERM: Reward Modeling for Visual Equivalence

Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, Yuhang Zang

2026-03-16

Visual-ERM: Reward Modeling for Visual Equivalence

Summary

This paper focuses on getting computers to automatically create code from images, like turning a chart into a program that can recreate that chart. It tackles the challenge of teaching these systems through a method called reinforcement learning, which is tricky because it's hard to tell the computer if its code is 'good' enough just by looking at text or broad visual comparisons.

What's the problem?

When you try to teach a computer to write code from images using reinforcement learning, you need a way to reward it when it does well. Existing methods for giving this reward are flawed. Simply checking if the generated code matches text descriptions isn't precise enough, and comparing the overall look of the generated image to the original doesn't catch small but important errors. This can lead to the computer finding 'shortcuts' to get rewards without actually creating accurate code – a problem called 'reward hacking'.

What's the solution?

The researchers developed a new system called Visual-ERM. This system acts like a visual judge, directly comparing the generated image (from the code) to the original image pixel by pixel. It doesn't just look at the whole image, but focuses on fine details, giving specific feedback on what's right and wrong. They then used this system to improve an existing visual language model, Qwen3-VL-8B-Instruct, and saw significant improvements in its ability to generate code from charts, tables, and SVGs. They also created a new set of tests, VC-RewardBench, to specifically evaluate how well different systems can detect these visual differences.

Why it matters?

This work shows that providing detailed, visual feedback is crucial for successfully teaching computers to generate code from images using reinforcement learning. It suggests that focusing on the visual accuracy of the output is more effective than relying on text-based rules or general visual comparisons, and that this approach works well across different types of visual data, even without needing to be specifically trained for each task.

Abstract

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.

View Paper