Think Visually, Reason Textually: Vision-Language Synergy in ARC

Beichen Zhang, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang

2025-11-26

Think Visually, Reason Textually: Vision-Language Synergy in ARC

Summary

This paper investigates why even the most advanced AI models, like GPT-5, struggle with a type of problem-solving that humans find relatively easy: figuring out patterns and rules from just a few examples. It focuses on a specific test called ARC-AGI designed to measure this ability.

What's the problem?

Current AI models, despite being good at processing text, can't reliably learn abstract rules from limited information. They fail at tasks that require understanding the underlying logic of a situation, something humans do naturally. The paper found that simply showing these AI models the problems as pictures actually *made* them perform worse, which is surprising because humans often use visual thinking for these kinds of puzzles.

What's the solution?

The researchers realized that both vision (seeing patterns) and language (understanding rules) are important, but at different stages. They developed a system that combines these two strengths. First, the AI uses vision to get a general idea of the pattern. Then, it uses language to create and test specific rules. Finally, it uses vision *again* to check if the rules actually work. This back-and-forth process, called Vision-Language Synergy Reasoning and Modality-Switch Self-Correction, helped the AI improve its accuracy.

Why it matters?

This work is important because it suggests that to create truly intelligent AI, we need to move beyond just focusing on language models. Combining visual understanding with logical reasoning is a crucial step towards building AI that can think and learn more like humans, and generalize to new situations without needing tons of examples.

Abstract

Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code will be released soon.

View Paper