EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving

Mukai Li, Linfeng Song, Zhenwen Liang, Jiahao Xu, Shansan Gong, Qi Liu, Haitao Mi, Dong Yu

2025-09-17

EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving

Summary

This paper explores how to make powerful AI systems, called Large Language Models, better at automatically proving mathematical theorems without using a ton of computing power.

What's the problem?

Currently, the best way to get these AI models to solve complex proofs involves techniques like 'Chain-of-Thought' reasoning and trying multiple solutions. While effective, these methods are really expensive in terms of processing time and energy. Existing analyses only focus on limiting the number of attempts, ignoring that different problem-solving strategies can have vastly different costs even with the same number of tries.

What's the solution?

The researchers developed a system called EconProver that uses two main ideas. First, it dynamically switches between detailed, step-by-step reasoning and more concise approaches, avoiding unnecessary calculations. Second, it uses a clever reinforcement learning technique with customizable starting points to increase the chances of finding a proof quickly, even with fewer attempts. They combined these into a single system to maximize efficiency.

Why it matters?

This work is important because it shows how to build AI theorem provers that are much more efficient, reducing the computational cost by a significant amount – in their experiments, to just 12% of the original cost – without sacrificing accuracy. This makes these powerful tools more accessible and practical for wider use, especially in situations where computing resources are limited.

Abstract

Large Language Models (LLMs) have recently advanced the field of Automated Theorem Proving (ATP), attaining substantial performance gains through widely adopted test-time scaling strategies, notably reflective Chain-of-Thought (CoT) reasoning and increased sampling passes. However, they both introduce significant computational overhead for inference. Moreover, existing cost analyses typically regulate only the number of sampling passes, while neglecting the substantial disparities in sampling costs introduced by different scaling strategies. In this paper, we systematically compare the efficiency of different test-time scaling strategies for ATP models and demonstrate the inefficiency of the current state-of-the-art (SOTA) open-source approaches. We then investigate approaches to significantly reduce token usage and sample passes while maintaining the original performance. Specifically, we propose two complementary methods that can be integrated into a unified EconRL pipeline for amplified benefits: (1) a dynamic Chain-of-Thought (CoT) switching mechanism designed to mitigate unnecessary token consumption, and (2) Diverse parallel-scaled reinforcement learning (RL) with trainable prefixes to enhance pass rates under constrained sampling passes. Experiments on miniF2F and ProofNet demonstrate that our EconProver achieves comparable performance to baseline methods with only 12% of the computational cost. This work provides actionable insights for deploying lightweight ATP models without sacrificing performance.

View Paper