Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers

Xingyue Huang, Rishabh, Gregor Franke, Ziyi Yang, Jiamu Bai, Weijie Bai, Jinhe Bi, Zifeng Ding, Yiqun Duan, Chengyu Fan, Wendong Fan, Xin Gao, Ruohao Guo, Yuan He, Zhuangzhuang He, Xianglong Hu, Neil Johnson, Bowen Li, Fangru Lin, Siyu Lin, Tong Liu, Yunpu Ma

2025-09-05

Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers

Summary

This paper introduces the Loong Project, a new system designed to help improve the reasoning abilities of Large Language Models (LLMs) in areas beyond just math and coding, like chemistry and logic.

What's the problem?

While LLMs are getting better at reasoning, especially when they can be checked for correct answers (like in math problems), it's hard to improve their reasoning in subjects where verifying answers is difficult or requires a lot of human effort. Creating enough high-quality training data for these areas is a major bottleneck because it's expensive and time-consuming to get experts to check the answers.

What's the solution?

The researchers created LoongBench, a collection of carefully checked questions and answers across 12 different subjects, along with the code to verify those answers. They also built LoongEnv, a system that automatically generates new, similar questions and answers. This creates a loop where an LLM can practice reasoning, get rewarded for correct answers verified by code, and learn to improve its performance. They then tested this system with different LLMs to see how well it worked and analyzed the quality of the automatically generated data.

Why it matters?

This work is important because it provides a way to automatically create training data for improving LLMs in a wide range of complex subjects. By reducing the need for expensive human verification, it makes it more feasible to build LLMs that can reason effectively in areas beyond just math and programming, potentially leading to more versatile and helpful AI systems.

Abstract

Recent advances in Large Language Models (LLMs) have shown that their reasoning capabilities can be significantly improved through Reinforcement Learning with Verifiable Reward (RLVR), particularly in domains like mathematics and programming, where ground-truth correctness can be automatically evaluated. However, extending this success to other reasoning-intensive domains remains challenging due to the scarcity of high-quality, verifiable datasets and the high cost of human supervision. In this work, we introduce the Loong Project: an open-source framework for scalable synthetic data generation and verification across a diverse range of reasoning-intensive domains. The framework consists of two key components: (1) LoongBench, a curated seed dataset containing 8,729 human-vetted examples across 12 domains (e.g., Advanced Mathematics, Chemistry, Logic), each paired with executable code and rich metadata; and (2) LoongEnv, a modular synthetic data generation environment that supports multiple prompting strategies to produce new question-answer-code triples. Together, these components form an agent-environment loop that enables reinforcement learning, where an LLM-based agent is rewarded for generating Chain-of-Thought (CoT) solutions that align with code-executed answers. Empirically, we benchmark LoongBench on a broad suite of both open-source and proprietary LLMs to evaluate domain coverage and reveal performance bottlenecks. In addition, we conduct a comprehensive analysis of synthetic data generated by LoongEnv, examining correctness, difficulty, and diversity. Code and documentation are available at https://github.com/camel-ai/loong.

View Paper