ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization

Guoxin Chen, Jing Wu, Xinjie Chen, Wayne Xin Zhao, Ruihua Song, Chengxi Li, Kai Fan, Dayiheng Liu, Minpeng Liao

2025-10-30

ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization

Summary

This paper focuses on the challenge of automatically converting math problems written in regular English into a format that computers can understand and verify, a process called autoformalization.

What's the problem?

Large Language Models (LLMs) are good at *looking* like they're translating math problems into computer-readable code, but they often get the meaning wrong. They treat it like a simple word-for-word translation instead of truly understanding the math and ensuring the formal version says the same thing as the original problem. Essentially, LLMs lack a way to check their own work and improve it like a human mathematician would.

What's the solution?

The researchers developed a new method called ReForm, which stands for Reflective Autoformalization. ReForm doesn't just generate a formal statement; it also checks if that statement actually means the same thing as the original English problem. If it finds errors, it corrects them and tries again, repeating this process until it's confident the meaning is preserved. To train this system, they created a special training technique called Prospective Bounded Sequence Optimization that rewards the model for both accurate translation *and* good self-checking.

Why it matters?

This work is important because it significantly improves the reliability of autoformalization. By making sure the computer-readable math actually matches the original problem, we can use computers to solve complex math problems stated in natural language with more confidence. They also showed that even humans make mistakes when formalizing math, highlighting how difficult this task truly is and the need for better automated tools.

Abstract

Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem's semantic intent. This limitation arises from the LLM approaches' treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive refinement. To effectively train this reflective model, we introduce Prospective Bounded Sequence Optimization (PBSO), which employs different rewards at different sequence positions to ensure that the model develops both accurate autoformalization and correct semantic validations, preventing superficial critiques that would undermine the purpose of reflection. Extensive experiments across four autoformalization benchmarks demonstrate that ReForm achieves an average improvement of 17.2 percentage points over the strongest baselines. To further ensure evaluation reliability, we introduce ConsistencyCheck, a benchmark of 859 expert-annotated items that not only validates LLMs as judges but also reveals that autoformalization is inherently difficult: even human experts produce semantic errors in up to 38.5% of cases.

View Paper