Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 102 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 30 tok/s
GPT-5 High 27 tok/s Pro
GPT-4o 110 tok/s
GPT OSS 120B 475 tok/s Pro
Kimi K2 203 tok/s Pro
2000 character limit reached

Retrieval-augmented reasoning with lean language models (2508.11386v1)

Published 15 Aug 2025 in cs.CL, cs.AI, and cs.CY

Abstract: This technical report details a novel approach to combining reasoning and retrieval augmented generation (RAG) within a single, lean LLM architecture. While existing RAG systems typically rely on large-scale models and external APIs, our work addresses the increasing demand for performant and privacy-preserving solutions deployable in resource-constrained or secure environments. Building on recent developments in test-time scaling and small-scale reasoning models, we develop a retrieval augmented conversational agent capable of interpreting complex, domain-specific queries using a lightweight backbone model. Our system integrates a dense retriever with fine-tuned Qwen2.5-Instruct models, using synthetic query generation and reasoning traces derived from frontier models (e.g., DeepSeek-R1) over a curated corpus, in this case, the NHS A-to-Z condition pages. We explore the impact of summarisation-based document compression, synthetic data design, and reasoning-aware fine-tuning on model performance. Evaluation against both non-reasoning and general-purpose lean models demonstrates that our domain-specific fine-tuning approach yields substantial gains in answer accuracy and consistency, approaching frontier-level performance while remaining feasible for local deployment. All implementation details and code are publicly released to support reproducibility and adaptation across domains.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a framework combining retrieval augmentation and structured reasoning with lean LMs to perform domain-specific QA tasks.
  • The methodology leverages synthetic query generation, document summarisation, and reasoning-aware fine-tuning to achieve performance comparable to frontier LLMs.
  • Empirical evaluations show that even small models (as low as 1.5B parameters) deliver high accuracy in complex tasks while ensuring resource efficiency and data privacy.

Retrieval-Augmented Reasoning with Lean LLMs

Introduction

The paper "Retrieval-augmented reasoning with lean LLMs" (2508.11386) presents a comprehensive framework for integrating retrieval-augmented generation (RAG) and structured reasoning within small-scale, locally deployable LLMs. The work addresses the challenge of achieving high performance in complex, domain-specific question answering tasks under resource and privacy constraints, where reliance on large, cloud-hosted LLMs or external APIs is infeasible. The authors demonstrate that, through careful pipeline design, synthetic data generation, and reasoning-aware fine-tuning, lean models can approach the performance of frontier LLMs in specialized domains.

System Architecture and Pipeline

The proposed system consists of several tightly integrated components: a dense retriever, a lean reasoning model, synthetic query and reasoning trace generation, and a conversational interface. The pipeline is designed to be reproducible and adaptable to other domains. Figure 1

Figure 2: Overview of the pipeline, covering synthetic data creation, information retrieval, reasoning trace generation, and model fine-tuning.

Retriever

The retriever indexes a corpus of domain-specific documents (e.g., NHS A-to-Z condition pages) using sentence-transformer embeddings and a vector database (Chroma or FAISS). Documents are chunked with overlap to preserve semantic continuity, and retrieval is performed via similarity search. Full-document retrieval is employed to ensure the LLM receives complete context, mitigating the risk of missing critical information due to chunk boundaries.

Synthetic Data and Reasoning Traces

Synthetic queries are generated using a large LLM (GPT-4o), with prompts designed to elicit realistic, varied, and challenging user requests. For each query, a set of relevant documents is retrieved, and a frontier reasoning model (DeepSeek-R1) is prompted to produce detailed reasoning traces and final answers. This process yields a dataset of (query, retrieved documents, reasoning trace, answer) tuples for supervised fine-tuning.

Document Summarisation

To address context window limitations and reduce computational overhead, retrieved documents are summarised using Qwen2.5-32B-Instruct, achieving an 85% reduction in length while retaining essential information. Summarisation is performed offline to avoid latency during inference.

Fine-Tuning

Lean Qwen2.5-Instruct models (1.5B–32B parameters) are fine-tuned on the synthetic reasoning trace dataset using next-token prediction. The training setup leverages long context windows (up to 32k tokens), gradient checkpointing, and FSDP sharding to accommodate large input sequences. The fine-tuned models learn to emulate the structured reasoning process of the teacher model, including explicit intermediate steps before producing final answers.

Empirical Evaluation

Reasoning Performance

The authors replicate the s1.1 fine-tuning protocol and evaluate model performance on the AIME24 mathematical reasoning benchmark. Results indicate that reasoning-aware fine-tuning yields substantial gains for models ≥14B parameters, while smaller models may experience degradation due to insufficient capacity. Figure 3

Figure 3: Performance on AIME24 of Qwen2.5-Instruct models and their post-trained versions fine-tuned on DeepSeek-R1 reasoning traces.

Retrieval and Reasoning Synergy

The system is evaluated on a synthetic medical QA task using the NHS A-to-Z corpus. Retrieval accuracy is maximized by indexing summarised documents, with p@5p@5 reaching 76%. Condition and disposition prediction tasks show that the fine-tuned 32B model (t0-1.1-k5-32B) achieves 56% condition accuracy and 51% disposition accuracy, matching or exceeding several frontier models (GPT-4o, DeepSeek-R1) when provided with retrieved context. Notably, the lean model outperforms general-purpose reasoning models (s1.1-32B, Qwen3-32B) in domain-specific tasks. Figure 4

Figure 5: Performance on condition prediction of Qwen2.5-Instruct models alone, with RAG, and with post-trained t0 versions combining RAG and reasoning.

Model Size Trade-offs

Distillation enables strong performance even in models as small as 1.5B parameters, with condition prediction accuracy comparable to much larger non-reasoning baselines. The primary performance boost for small models arises from reasoning-aware fine-tuning, which imparts the ability to interpret and synthesize retrieved evidence.

Conversational Interface and Deployment

A Svelte-based web frontend is provided, supporting multi-turn chat interactions and dynamic retrieval. The backend orchestrates retrieval, reasoning, and response generation, with tool-calling and prompt templates tailored for Qwen2.5-Instruct. The interface is designed for extensibility and secure deployment, with all state managed server-side to ensure consistency. Figure 6

Figure 1: Snapshot of the chat interface, illustrating multi-turn interaction and reasoning trace display.

Discussion and Implications

Domain-Specific vs General Reasoning

Fine-tuning on in-domain reasoning traces confers a clear advantage over general-purpose reasoning models, particularly in tasks requiring nuanced interpretation of specialized evidence. The approach is robust to query complexity, with error analysis showing reduced underestimation and improved handling of ambiguous or challenging requests.

Resource Efficiency and Privacy

The framework enables deployment of high-performing QA agents in secure, resource-constrained environments, with model sizes and memory footprints suitable for on-premises or air-gapped infrastructure. This is critical for applications in healthcare, government, and other sectors where data privacy is paramount.

Future Directions

Potential enhancements include query-aware document summarisation, retriever-generator co-training, multi-hop retrieval, and integration of multimodal evidence. The methodology is extensible to other domains, provided suitable synthetic data and reasoning traces can be generated or curated.

Conclusion

This work demonstrates that retrieval-augmented reasoning can be effectively realized in lean LLMs through structured pipeline design, synthetic data generation, and reasoning-aware fine-tuning. The resulting models achieve performance comparable to frontier LLMs in domain-specific QA tasks, with significant implications for privacy-preserving, resource-efficient AI deployment. The open-source codebase and modular architecture facilitate adaptation to diverse applications requiring grounded, interpretable reasoning over private knowledge bases.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Youtube Logo Streamline Icon: https://streamlinehq.com