RAGentA: Agentic RAG Frameworks

Updated 25 February 2026

RAGentA is a suite of multi-agent frameworks for Retrieval-Augmented Generation that unifies single- and multi-hop queries using agentic orchestration and hybrid retrieval strategies.
The framework employs an LLM-driven controller with chain-of-thought reasoning to iteratively gather evidence and decide when to finalize answers.
Empirical results show improved retrieval metrics and answer faithfulness, with modular design adaptable for diverse applications like QA, literature review, and fintech.

RAGentA denotes a family of multi-agent frameworks for Retrieval-Augmented Generation (RAG), characterized by agentic orchestration, hybrid retrieval strategies, interpretable reasoning steps, and dynamic adaptation to query complexity. This term now encompasses several representative architectures, notably including the original open-source Agent-UniRAG system for unified single- and multi-hop RAG, its multi-agent generalizations for attributed QA, scientific literature review, and other verticals. All share a commitment to modularity, explicit evidence attribution, and verifiable answer composition in knowledge-intensive tasks.

1. Agentic RAG Architecture and Unification of Query Types

The foundational architecture for RAGentA is exemplified by Agent-UniRAG, which introduces an LLM-driven agent controller—typically a fine-tuned open-source transformer (e.g., Llama-3-8B)—to orchestrate the RAG workflow in an end-to-end, step-wise manner. There are three primary modules:

Retrieval Module: Comprises a sparse retriever (BM25) and optional dense reranker (e.g., Multilingual E5 Large or similar dual-encoder), returning Top-K passages for each search query.
Agent Controller (LLM): Receives the user query and maintains a working memory log $L = \{\text{Thought} \leftarrow, \text{Action} \leftarrow, \text{Evidence} \leftarrow\}$ , guiding reasoning via explicit chain-of-thought outputs and dynamic decision between “Search” and “Final Answer” actions at every timestep.
Generation Module: The LLM synthesizes the answer using evidence accumulated within memory.

A critical advance is the unified treatment of single-hop and multi-hop queries: all are processed in the same loop where the agent decides—based on query and working memory—when to stop, ensuring multi-hop questions trigger iterative retrieval/planning cycles until a final answer is justified. The workflow proceeds as follows:

procedure AGENT_UNIRAG(q, max_steps=k):
    L ← [(“User Query”, q)]
    for t in 1..k:
        rᵗ, decision = LLM_plan(L)
        if decision == “Final Answer”:
            break
        aᵗ = generate_search_query(rᵗ)
        Dᵗ = RETRIEVE(aᵗ)
        eᵗ = EVIDENCE_REFLECTOR(aᵗ, Dᵗ)
        L.append((“Thought”, rᵗ))
        L.append((“Action”, aᵗ))
        L.append((“Evidence”, eᵗ))
    answer = LLM_generate_final(L)
    return answer

This loop, architecturally mirrored in other agentic RAG systems (e.g., in fintech and literature review), enables interpretable, step-by-step QA workflows adaptable to varying input complexity (Pham et al., 28 May 2025).

2. Multi-Agent Collaboration and Specialized Agent Roles

The RAGentA paradigm often extends the above controller to explicit multi-agent settings, each agent specializing in a subtask with hand-offs across workflow stages. For instance, in attributed QA (Besrour et al., 20 Jun 2025):

Hybrid Retriever: Executes hybrid scoring $S_\mathrm{hybrid}(d;q) = \alpha S_\mathrm{sparse}(d;q) + (1-\alpha)S_\mathrm{dense}(d;q)$ .
Predictor (Agent 1): Generates provisional answers per document.
Filter/Judge (Agent 2): Scores and filters relevance of provisional answers.
Final Synthesizer (Agent 3): Composes a unified answer with fine-grained in-line document citations.
Verifier/Reviser (Agent 4): Decomposes queries, checks completeness, and enacts dynamic retrieval for uncovered aspects.

Similar patterns—modular Planner, Step Definer, Extractor, and QA agents—are observed in systems targeting ambiguous, multi-hop, or domain-specific questions (Nguyen et al., 26 May 2025, Cook et al., 29 Oct 2025).

In the agent-based orchestration employed for scientific literature review, the controller agent dynamically chooses between specialized retrieval sub-pipelines (GraphRAG or VectorRAG), relying on learned policies over query features (Nagori et al., 30 Jul 2025).

3. Retrieval, Hybrid Ranking, and Evidence Processing

At the core of all RAGentA systems is a hybrid retrieval stack. Typical retrieval scoring functions include:

BM25 (Sparse):

$s(q, d) = \sum_{w \in q \cap d} \text{IDF}(w)\, \frac{f(w, d)\, (k_1+1)}{f(w, d)+k_1 (1-b+b\,\frac{|d|}{\text{avgdl}})}$

Dense Reranking:

$s_\mathrm{dense}(q, d) = \phi(q)^\top \phi(d)$

Hybrid Combination:

$S_\mathrm{hybrid}(d;q) = \alpha S_\mathrm{sparse}(d;q) + (1-\alpha)S_\mathrm{dense}(d;q)$

Empirical results show that interpolated hybrid retrieval (with, e.g., $\alpha = 0.65$ ) can yield Recall@20 improvements of up to 12.5% over single-method baselines (Besrour et al., 20 Jun 2025).

To further reduce hallucination and increase faithfulness, agentic workflows execute iterative filtering (via Judge/Filter agents) and dynamic evidence attribution (in-line citations tied to specific retrievals). Reflector or Extractor modules process retrieved documents into concise, context-ready snippets for the generator.

4. Training Procedures, Datasets, and Objectives

Instruction-finetuning is the dominant paradigm for controller/generation LLMs in open-source RAGentA instantiations. The canonical per-turn loss is: $L = -\sum_{j=1}^{|Y|} \log p_\pi(y_j \mid x_{<}, y_{<j})$ with masking on user prompt tokens. Proposed end-to-end losses incorporate negative log-likelihood for retrieval, cross-entropy for planning, and cross-entropy for final generation, with tunable weights $(\alpha, \beta, \gamma)$ (Pham et al., 28 May 2025).

RAGentA research highlights the critical role of curated training resources:

SynAgent-RAG: A synthetic dataset of ≈17K train/1.2K test examples, mixing 50% single-hop and 50% multi-hop queries over Wikipedia, with granular chains-of-thought and evidence annotations (Pham et al., 28 May 2025).
Other frameworks leverage synthetic, privacy-preserved QA pairs or domain-annotated corpora for their respective evaluation and tuning needs (Driouich et al., 26 Aug 2025, Cook et al., 29 Oct 2025).

Curriculum learning is sometimes applied: early epochs focus on single-hop, later shifting weight to multi-hop examples to stabilize planning accuracy.

5. Evaluation, Benchmarking, and Quantitative Performance

RAGentA system evaluation employs both standard Information Retrieval/QA metrics and specialized semantic faithfulness indicators.

Short-form QA: Exact Match (EM), F1, Accuracy.
Long-form QA: ROUGE-L, BLEU, and GPT-based semantic scoring.
Faithfulness: The fraction of answer components grounded in retrieved documents.
Coverage and Relevance: For attributes QA, $\text{Coverage} = \frac{|A \cap G|}{|G|}$ , $\text{Relevance} = \frac{|A \cap G|}{|A|}$ .

Empirical results for various RAGentA implementations include:

Agent-UniRAG (Llama-3-8B-Inst): Single-hop EM ≈ 32.8%, F1 ≈ 46.9% (outperforming GPT-3.5 Adaptive-RAG), Multi-hop EM ≈ 30.4–50.2%, F1 ≈ 39.8–59.9% across MuSiQue and HotpotQA, with ablations showing 5–7 pt and 15–20 pt F1 drops from removing Reflector and Planning modules, respectively (Pham et al., 28 May 2025).
Multi-Agent Attributed QA: Hybrid retrieval yields Recall@20 of 0.5650 (12.5% over BM25), with correctness 0.8346 and faithfulness 0.7044 (+1.09% and +10.72% over baseline) (Besrour et al., 20 Jun 2025).
Fintech Agentic RAG: Hit@5 accuracy of 62.35% vs. 54.12% (baseline); mean semantic accuracy increase Δ ≈ 0.68 (p < 0.05). Latency increases by ×6.4 but gains in robustness and coverage are statistically significant (Cook et al., 29 Oct 2025).

Selection of experimental details and notable gains are summarized below:

Benchmark/System	Key Retrieval/QA Metrics	RAGentA Gains
SynAgent-RAG (long-form)	ROUGE-L 0.36, BLEU 0.15, Steps 2.08	Comparable to GPT-4-Turbo
Attributed QA (LiveRAG style)	+1.09% correctness, +10.72% faithfulness	12.5% Recall@20 vs BM25
Fintech Q&A (Hit@5)	62.35% (vs 54.12%)	Significant p < 0.01
SAP QE (Software Testing)	Accuracy 94.8% (vs. 65.2% basic RAG)	85% efficiency, 35% cost↓

6. Practical Implications, Strengths, and Limitations

RAGentA’s strengths include:

Unified Query Handling: No external classifier required to distinguish single- from multi-hop, or simple from complex queries.
Explicit Reasoning Chains: Thought/Action/Evidence logs at every step augment transparency and interpretability.
Evidence Attribution: Direct in-line citations or source tracking for each answer element.
Modularity: All system components (retrievers, agents, generators) are swappable and extensible; supports plug-in for larger LLMs, trainable retrievers, or multimodal tools.
Competitive Efficiency: Open-source, 8–70B parameter LLMs can approach or match performance of much larger, closed-source systems (Pham et al., 28 May 2025, Besrour et al., 20 Jun 2025, Hariharan et al., 12 Oct 2025).

Noted limitations are inference latency due to multi-agent and multiple LLM calls, restriction to RAG-style tasks (no code synthesis or non-knowledge dialogue), and the fact that most retrieval modules are not yet trained end-to-end.

Potential extensions discussed in the literature include integrating end-to-end retriever learning, one/few-shot planning to reduce steps, multimodal expansion (tables, code execution), continual KB ingestion, and reinforcement learning from agent outcomes (Pham et al., 28 May 2025, Lelong et al., 22 Jul 2025, Hariharan et al., 12 Oct 2025).

7. Deployment Recommendations and Domain-Specific Adaptation

For production- or research-scale deployment of RAGentA-based systems, researchers recommend:

Preloading robust sparse and dense retrieval stacks.
Domain-specific finetuning via agentic data (e.g., SynAgent-RAG style).
Monitoring and limiting agent call steps to manage latency.
Logging explicit reasoning chains for audit and continuous error analysis.
Adapting plug-and-play modules to new domains or restoring feedback loops from evaluation to retriever/generator training.

This approach has demonstrated viability for general knowledge QA (Pham et al., 28 May 2025), attributed QA (Besrour et al., 20 Jun 2025), enterprise software testing (Hariharan et al., 12 Oct 2025), scientific literature review (Nagori et al., 30 Jul 2025), and other complex, information-rich verticals.

RAGentA reframes retrieval-augmented generation as a modular, agent-driven workflow unifying single/multi-hop reasoning, explicit evidence attribution, and transparent answer synthesis. Empirical evidence across diverse research indicates measurably improved precision, faithfulness, robustness, and deployment flexibility relative to static RAG pipelines, with extensibility to specialized and multi-modal knowledge domains (Pham et al., 28 May 2025, Besrour et al., 20 Jun 2025, Lelong et al., 22 Jul 2025, Hariharan et al., 12 Oct 2025, Nagori et al., 30 Jul 2025, Cook et al., 29 Oct 2025, Nguyen et al., 26 May 2025).