Even Small Reasoners Should Quote Their Sources: Introducing the Pleias-RAG Model Family

Published 25 Apr 2025 in cs.CL | (2504.18225v1)

Abstract: We introduce a new generation of small reasoning models for RAG, search, and source summarization. Pleias-RAG-350m and Pleias-RAG-1B are mid-trained on a large synthetic dataset emulating the retrieval of a wide variety of multilingual open sources from the Common Corpus. They provide native support for citation and grounding with literal quotes and reintegrate multiple features associated with RAG workflows, such as query routing, query reformulation, and source reranking. Pleias-RAG-350m and Pleias-RAG-1B outperform SLMs below 4 billion parameters on standardized RAG benchmarks (HotPotQA, 2wiki) and are competitive with popular larger models, including Qwen-2.5-7B, Llama-3.1-8B, and Gemma-3-4B. They are the only SLMs to date maintaining consistent RAG performance across leading European languages and ensuring systematic reference grounding for statements. Due to their size and ease of deployment on constrained infrastructure and higher factuality by design, the models unlock a range of new use cases for generative AI.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper presents small reasoning models that integrate built-in citation for source grounding, reducing hallucinations in AI responses.
It introduces a structured multi-step workflow with synthetic multilingual training, achieving competitive performance against larger models.
The models are designed for on-device AI and regulated settings, ensuring data privacy and compliance through explicit external memory use.

The paper introduces the Pleias-RAG model family, specifically Pleias-RAG-350m and Pleias-RAG-1B, as a new generation of Small Reasoning Models (SRMs) designed for Retrieval Augmented Generation (RAG), search, and source summarization. The core motivation is to address the limitations of larger models (high computational requirements, privacy concerns, data friction) while overcoming the typical factuality and reasoning challenges faced by smaller models, especially concerning hallucinations.

These models are mid-trained on a large (approx. 9.5 billion tokens), synthetically generated dataset that emulates RAG workflows over a wide variety of multilingual open sources derived from the Common Corpus. A key feature is their native support for citation and grounding, providing literal quotes from sources using a Wikipedia-inspired <ref> syntax.

Key Design Principles and Implementation Details:

Source Reasoning Paradigm: Pleias-RAG models are explicitly designed to function as "source reasoners," interacting primarily with external memory (retrieved documents) rather than relying solely on internal knowledge. This design choice is fundamental for mitigating hallucinations and enabling deployment in environments where data privacy and control are critical. By externalizing knowledge, organizations can manage data access and reduce the risk of sensitive information being memorized by the model.
Built-in Citation and Grounding: Unlike many RAG systems that use post-hoc methods for adding citations, Pleias-RAG models generate citations directly during the inference process. This involves the model outputting the reference markers (<ref>) and the associated quoted text itself, which it has been trained to process. This allows for greater control over the presentation of sources and facilitates features like automated citation shortening. This built-in approach provides enhanced verifiability and traceability, crucial for applications in regulated industries where audit trails and compliance are necessary.
Structured Reasoning Workflow: The models integrate a standardized, multi-step reasoning process inspired by agentic systems. This workflow includes:
- Query Analysis: Assessing user intent and desired information format.
- Query Report: Classifying the query (e.g., trivial, standard, reformulated, unclear).
- Source Analysis: Identifying and ranking relevant sources provided as context.
- Source Report: Summarizing the adequacy of sources (e.g., extensive, basic, incomplete, infeasible).
- Drafting: Generating the final answer based on the analysis. This structured approach, formalizing steps with special tokens and reports, is hypothesized to help small models maintain focus and improve logical capabilities. It enables the model to dynamically determine its generation path (e.g., providing a quick answer for trivial questions or entering a refusal mode if sources are insufficient).
Multilingual Support: The models offer native multilingual support for major European languages. This is achieved through:
- A new, dedicated tokenizer designed for better fertility and word fidelity in languages like French, Italian, Spanish, German, and Polish, compared to models like Llama.
- Specific adversarial training exercises that include query translation and source translation, forcing the model to handle language switching and maintain performance even when query and source languages differ.
Tokenizer Recycling: The tokenizer for the RAG models reuses less useful tokens from the base model's vocabulary, repurposing them as special tokens for the structured reasoning format. This approach avoids pre-allocating tokens for potentially unused features and leverages less trained tokens, which are then retrained during the mid-training phase.
Mid-training Methodology: Instead of traditional fine-tuning, Pleias-RAG models undergo a data- and compute-intensive mid-training process on a large synthetic dataset.
- Synthetic Dataset Generation: Millions of RAG examples (query + sources) are generated from the auditable Common Corpus. Queries are created using a "back-translation" method, where a LLM (Gemma 3 12B) generates a question based on a random source excerpt.
- Adversarial Exercises: To increase robustness, the dataset includes adversarial examples: randomly dropping sources, shuffling source order, generating refusal scenarios (by pairing irrelevant queries with sources), and introducing language switching scenarios.
- Synthetic Reasoning Generation: A pipeline using larger models (Gemma 3 27B initially, then a fine-tuned Gemma 3 12B) generates the structured reasoning traces and final answers with citations for each RAG example. This process involves iterative correction and filtering to ensure quality and format adherence. The rationale is that synthetic data generation at scale is necessary for specialization, especially when real-world specialized datasets are scarce or proprietary.

Performance and Evaluation:

Evaluations were conducted on standard multi-hop question answering benchmarks (HotpotQA (Yang et al., 2018), 2WikiMultiHopQA (Ho et al., 2020), MuSiQue (Trivedi et al., 2021)) modified for multilingual testing using LLM-as-a-match scoring.

Pleias-RAG-350m and Pleias-RAG-1B perform competitively with larger models (4-8B parameters) and set a new state-of-the-art for SLMs below 4 billion parameters on these benchmarks.
They are found to be Pareto-optimal among SLMs regarding RAG performance vs. model size.
Crucially, they demonstrate negligible performance degradation across tested European languages (French, Italian, German, Spanish), significantly outperforming other SLMs and even some larger models in multilingual RAG scenarios.
Qualitative evaluations highlight the models' ability to provide concise, cited answers but also point to challenges like reliably refusing to answer when information is genuinely absent from the provided sources.

Practical Applications and Deployment:

The small size and specialized design of Pleias-RAG models make them suitable for a range of applications previously challenging for large models:

On-device and Local AI: Deployment on constrained hardware (e.g., tested on Raspberry Pi 4) for use cases requiring local processing, low latency, or offline access.
Secured Infrastructures: Suitable for professional settings with strict data privacy and security requirements, as the external memory paradigm allows sensitive data to remain within controlled environments.
Regulated Industries: The built-in citation and auditable training data provide crucial traceability, transparency, and compliance features for sectors like legal, healthcare, and finance.
Agentic Search and Workflows: The integrated structured reasoning workflow enables the models to perform multi-step tasks and could be integrated into larger systems for query routing, reformulation, and source reranking.

Implementation Considerations:

Computational Requirements: While small in size, the mid-training process requires significant computational resources. Inference, however, is highly efficient, enabling deployment on low-resource hardware.
Data Preparation: The quality and diversity of the synthetic training data are paramount and require careful design, generation, filtering, and iterative refinement.
Context Length: The models are trained with a moderate context length (4096 tokens), which might limit their ability to process very long documents or extensive sets of sources in a single pass.
Integration: Implementing the full structured reasoning workflow in an application requires parsing the model's output which includes special tokens for different steps.
Source Quality: The factuality of the output is dependent on the quality and relevance of the provided sources. The models are trained to reason over provided context, but garbage in will still result in garbage out, even with citations.

Future Directions:

Ongoing research focuses on extending the context length, building in native support for search (generating API calls and processing results), personality tuning for more consistent identity, and incorporating reinforcement learning techniques, particularly leveraging external error feedback and structured critique to improve reasoning chains.

Markdown Report Issue