TRACER: Trace-Based Adaptive Cost-Efficient Routing for LLM Classification
Adam Rida
2026-04-17
Summary
This paper introduces a system called TRACER that aims to reduce the cost of using large language models (LLMs) by creating a smaller, faster 'surrogate' model that can handle many of the same tasks. It leverages the data already generated from using the LLM to train this surrogate.
What's the problem?
Using LLMs for tasks like classifying text can be expensive because each request requires significant computing power. Every time you ask an LLM a question, it creates a record of the input and its answer. The challenge is figuring out how to use this existing data to create a cheaper alternative without sacrificing accuracy, and knowing when it's safe to use the cheaper alternative instead of the original LLM.
What's the solution?
TRACER works by continuously training a smaller model on the data generated from the LLM's own use. It doesn't just deploy this smaller model immediately, though. Instead, it uses a 'parity gate' – meaning the surrogate model is only used when it agrees with the original LLM above a certain level of confidence set by the user. TRACER also provides explanations about *why* the surrogate model is handling certain inputs and deferring others, making the process transparent.
Why it matters?
This research is important because it offers a way to significantly reduce the cost of using powerful LLMs. By intelligently routing requests to a cheaper surrogate model when it's reliable, and keeping the LLM for more complex cases, it makes these technologies more accessible and sustainable. The open-source nature of TRACER allows others to build upon and improve this approach.
Abstract
Every call to an LLM classification endpoint produces a labeled input-output pair already retained in production logs. These pairs constitute a free, growing training set: a lightweight surrogate trained on them can absorb a significant portion of future traffic at near-zero marginal inference cost. The open questions are when the surrogate is reliable enough to deploy, what it handles versus defers, and how that boundary evolves as data accumulates. We introduce TRACER (Trace-based Adaptive Cost-Efficient Routing), an open-source system that trains ML surrogates on an LLM's own production traces and governs deployment through a parity gate: the surrogate is activated only when its agreement with the LLM exceeds a user-specified threshold α. To make the routing boundary transparent, TRACER generates interpretability artifacts describing which input regions the surrogate handles, where it plateaus, and why it defers. On a 77-class intent benchmark with a Sonnet 4.6 teacher, TRACER achieves 83-100% surrogate coverage depending on the quality target α; on a 150-class benchmark, the surrogate fully replaces the teacher. On a natural language inference task, the parity gate correctly refuses deployment because the embedding representation cannot support reliable separation. The system is available as open-source software.