< Explain other AI papers

An Information Theoretic Perspective on Agentic System Design

Shizhe He, Avanika Narayan, Ishan S. Khare, Scott W. Linderman, Christopher Ré, Dan Biderman

2025-12-30

An Information Theoretic Perspective on Agentic System Design

Summary

This paper investigates how to best build systems that use multiple large language models (LLMs) working together, specifically focusing on systems where one LLM summarizes information for another. These systems are becoming popular in applications like advanced search and coding assistants.

What's the problem?

Currently, designing these multi-LLM systems is done through trial and error. It's hard to know whether improvements come from the LLM that does the initial summarizing (the 'compressor') or the LLM that makes the final prediction (the 'predictor'). Testing every combination of LLMs is expensive and time-consuming, and there hasn't been a clear way to predict which combinations will work best.

What's the solution?

The researchers approached this problem using information theory, thinking of the compressor LLM as a way to transmit information – but with some noise. They developed a way to measure how much useful information the compressor LLM actually retains from the original context. This measurement, called 'mutual information,' doesn't depend on any specific task and can predict how well the overall system will perform. They found that bigger compressor LLMs are better, not just at accuracy, but also at efficiently conveying information.

Why it matters?

This work provides a principled way to design these multi-LLM systems. It suggests that focusing on improving the compressor LLM is often more effective than simply using a larger predictor LLM. This is especially important for running parts of these systems locally on devices, as it allows for smaller, more efficient cloud-based predictors, ultimately reducing costs and improving performance.

Abstract

Agentic language model (LM) systems power modern applications like "Deep Research" and "Claude Code," and leverage multi-LM architectures to overcome context limitations. Beneath their apparent diversity lies a recurring pattern: smaller "compressor" LMs (that can even run locally) distill raw context into compact text that is then consumed by larger "predictor" LMs. Despite their popularity, the design of compressor-predictor systems remains largely ad hoc, with little guidance on how compressor and predictor choices shape downstream performance. In practice, attributing gains to compression versus prediction requires costly, task-specific pairwise sweeps. We argue that these agentic system design questions are, at root, information-theoretic. Viewing the compressor LM as a noisy channel, we introduce a simple estimator of mutual information between the context and its compression to quantify compression quality in a task-independent way. We show that mutual information strongly predicts downstream performance, independent of any specific task. Through an information-theoretic framework, we perform a comprehensive empirical analysis across five datasets and three model families. Results reveal that larger compressors not only are more accurate, but also more token-efficient, conveying more bits of information per token. A 7B Qwen-2.5 compressor, for instance, is 1.6times more accurate, 4.6times more concise, and conveys 5.5times more bits of mutual information per token than its 1.5B sibling. Across datasets, scaling compressors is substantially more effective than scaling predictors, enabling larger on-device compressors to pair with smaller cloud predictors. Applied to a Deep Research system, these principles enable local compressors as small as 3B parameters to recover 99% of frontier-LM accuracy at 26% of API costs.