InfoSynth: Information-Guided Benchmark Synthesis for LLMs

Ishir Garg, Neel Kolhe, Xuandong Zhao, Dawn Song

2026-01-05

InfoSynth: Information-Guided Benchmark Synthesis for LLMs

Summary

This paper introduces InfoSynth, a new system designed to automatically create challenging and unique tests for large language models, specifically focusing on their ability to reason and write code.

What's the problem?

Currently, evaluating how well large language models can think and code is difficult because creating good tests is a slow, expensive process that relies on people writing them by hand. Also, many existing tests have already been seen by the models during their training, meaning they aren't truly measuring the model's *actual* abilities – it's more like they're remembering the answer. We need fresh, diverse tests to get an accurate picture of how these models are improving.

What's the solution?

The researchers developed InfoSynth, which uses ideas from information theory to automatically generate new tests. It starts with a small set of example problems and then uses a process similar to evolution – called a genetic algorithm – to create variations and new problems. The system also checks its own work, making sure the generated tests are solvable and have correct answers. It measures how different the new tests are from the originals, ensuring they’re truly novel and diverse, without needing to constantly run the tests on the language models themselves to check.

Why it matters?

InfoSynth is important because it provides a way to quickly and reliably create a large number of high-quality tests for large language models. This allows researchers to better understand the strengths and weaknesses of these models and track their progress as they improve. By automating the benchmark creation process, it removes a major bottleneck in the development and evaluation of AI systems, and helps ensure we're testing models on problems they haven't already memorized.

Abstract

Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/

View Paper