Retrieval-augmented reasoning with lean language models
Ryan Sze-Yin Chan, Federico Nanni, Tomas Lazauskas, Rosie Wood, Penelope Yong, Lionel Tarassenko, Mark Girolami, James Geddes, Andrew Duncan
2025-08-20
Summary
This paper presents a new way to make a small, efficient language model that can both understand complex questions and find relevant information, all without needing huge amounts of computing power or connecting to big external services.
What's the problem?
Many current AI systems that can answer questions by looking up information use very large and powerful models, which are expensive to run and might not be suitable for situations where privacy is a big concern or where you don't have access to supercomputers, like on a personal device or in a secure network.
What's the solution?
The researchers created a smaller language model that combines retrieval (finding information) and reasoning (understanding and processing that information). They used a technique called test-time scaling and trained a small model, specifically fine-tuning Qwen2.5-Instruct, to understand difficult questions about medical conditions from the NHS website. They generated special training data using more advanced models to teach their smaller model how to reason and find answers, and they also explored ways to shorten the information they feed it.
Why it matters?
This work is important because it shows that you can build AI systems that are good at answering complex questions and reasoning, even with smaller, more accessible models. This makes advanced AI capabilities more practical for everyday use, especially in situations where using large, cloud-based systems isn't feasible or desired, offering a more private and efficient solution.
Abstract
This technical report details a novel approach to combining reasoning and retrieval augmented generation (RAG) within a single, lean language model architecture. While existing RAG systems typically rely on large-scale models and external APIs, our work addresses the increasing demand for performant and privacy-preserving solutions deployable in resource-constrained or secure environments. Building on recent developments in test-time scaling and small-scale reasoning models, we develop a retrieval augmented conversational agent capable of interpreting complex, domain-specific queries using a lightweight backbone model. Our system integrates a dense retriever with fine-tuned Qwen2.5-Instruct models, using synthetic query generation and reasoning traces derived from frontier models (e.g., DeepSeek-R1) over a curated corpus, in this case, the NHS A-to-Z condition pages. We explore the impact of summarisation-based document compression, synthetic data design, and reasoning-aware fine-tuning on model performance. Evaluation against both non-reasoning and general-purpose lean models demonstrates that our domain-specific fine-tuning approach yields substantial gains in answer accuracy and consistency, approaching frontier-level performance while remaining feasible for local deployment. All implementation details and code are publicly released to support reproducibility and adaptation across domains.