T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models

Minki Kang, Jongwon Jeong, Jaewoong Cho

2025-04-08

T1: Tool-integrated Self-verification for Test-time Compute Scaling in
Small Language Models

Summary

This paper talks about T1, a method that helps small AI language models check their own work by using tools like calculators, making them smarter without needing bigger, more expensive models.

What's the problem?

Small AI models struggle to double-check their answers, especially for math problems or fact-heavy tasks, because they can't remember all the details needed to verify their work.

What's the solution?

T1 lets small AI models use tools like code interpreters to handle number-heavy checks and fact verification, so they can focus on the parts they're good at while relying on tools for the rest.

Why it matters?

This makes small AI models more reliable for real-world tasks like homework help or customer service, letting them work faster and cheaper than big models while still being accurate.

Abstract

Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving self-verification by sLMs underexplored. In this work, we investigate whether sLMs can reliably self-verify their outputs under test-time scaling. We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking. To address this limitation, we propose Tool-integrated self-verification (T1), which delegates memorization-heavy verification steps to external tools, such as a code interpreter. Our theoretical analysis shows that tool integration reduces memorization demands and improves test-time scaling performance. Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model. Moreover, T1 generalizes effectively to both mathematical (MATH500) and multi-domain knowledge-intensive tasks (MMLU-Pro). Our findings highlight the potential of tool integration to substantially improve the self-verification abilities of sLMs.

View Paper