Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, Shafiq Joty

2025-10-21

Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

Summary

This paper focuses on building better automatic evaluators for large language models, which are programs that judge the quality of text generated by AI. Instead of focusing on complicated training methods, the researchers concentrated on creating a massive, high-quality dataset to train these evaluators.

What's the problem?

Currently, evaluating how good an AI's responses are is a big challenge. While people can do it, it doesn't scale well for constantly improving and testing AI models. Existing automatic evaluators often rely on complex training techniques like reinforcement learning, but haven't been developed with a focus on simply having a lot of good training data. This limits their performance and accessibility.

What's the solution?

The researchers created a dataset of 2.5 million examples covering different types of evaluation tasks – things like comparing two responses, rating a single response, or checking if a response is correct based on a reference answer. They then used this data to train a family of evaluators called FARE (Foundational Automatic Reasoning Evaluators) with a straightforward method: repeatedly selecting the best responses and using those to refine the evaluator. They built two versions, an 8 billion parameter model and a 20 billion parameter model.

Why it matters?

FARE models perform exceptionally well, even outperforming much larger and more complex evaluators that were trained using reinforcement learning. This shows that a large, well-curated dataset and a simple training process can be incredibly effective. These improved evaluators can be used to automatically improve AI models during training and to reliably rank different AI-generated responses, making AI development faster and more efficient.

Abstract

Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.

View Paper