- The paper presents a modular agentic RAG pipeline for Ukrainian document understanding, emphasizing retrieval quality with metrics of 0.922 document correctness and 0.811 page proximity.
- It integrates multiple open-source LLMs, with Qwen2.5-3B-Instruct achieving a +0.27 accuracy boost when combined with dense retrieval and reranking.
- Agentic retry mechanisms provide modest accuracy improvements but trade off precise page localization in computationally constrained offline settings.
Agentic Retrieval-Augmented Generation for Ukrainian: A Technical Analysis
Context and Motivation
Retrieval-Augmented Generation (RAG) architectures have become foundational for improving LLM factuality and grounding, particularly for knowledge-intensive tasks. However, most RAG advancements remain confined to high-resource languages, notably English. This paper delineates the first systematic investigation of Agentic RAG applied to Ukrainian within the UNLP 2026 Shared Task focused on multi-domain document understanding. The study implements a modular agentic pipeline and critiques practical challenges in adapting advanced RAG techniques to Ukrainian under computationally constrained offline environments.
System Architecture and Methodology
The proposed pipeline consists of three principal layers:
- Dense Retrieval and Reranking: The retrieval module leverages BGE-M3 as a dense retriever and BGE-reranker-v2-m3 as a reranker, outperforming traditional sparse and multilingual dense baselines in both document and page localization metrics.
- Open-Weight LLM Integration: Multiple open-source Ukrainian/multilingual LLMs (including LapaLLM, MamayLM, Gemma, Llama-2-7b, and Qwen2.5-3B-Instruct) were benchmarked. Qwen2.5-3B-Instruct consistently delivered highest answer accuracy in both LLM-only and LLM+retrieval contexts.
- Minimal Agentic Layer: Lightweight agentic behaviors were implemented—query rephrasing and answer retry loops—designed to enhance answer accuracy by iteratively probing retrieval and generation modules based on confidence heuristics.
The task required not only answering multiple-choice questions but also identifying the supporting document and page. The system was evaluated under stringent constraints—a single NVIDIA P100 GPU and a 9-hour inference time limit—which restricted implementation complexity.
Empirical Evaluation
Dense retrieval with reranking (BGE-M3+BGE reranker) achieved mean document correctness di​=0.922 and mean page proximity pi​=0.811, a significant improvement over sparse methods and single-model dense approaches.
LLM Accuracy
The integration of retrieved context led to a marked increase in answer accuracy for all LLMs. Qwen2.5-3B-Instruct reached 0.69 accuracy when augmented with retrieval—a gain of +0.27 over LLM-only mode—outperforming both Ukrainian-specific and multilingual counterparts.
Agentic Mechanism Effects
Agentic retry mechanisms provided modest, consistent improvements to the overall metric: combining query rephrasing and answer retry increased the final score by approximately 1 point, confirming that minimal agentic RAG can provide incremental gains but is limited by retrieval quality.
Leaderboard Analysis
Two submissions were compared: a non-agentic pipeline and an agentic pipeline. The agentic variant increased answer accuracy (ai​=0.814 vs.\ $0.633$) but decreased page precision (pi​=0.625 vs.\ $0.814$). This trade-off corroborates findings in recent agentic RAG benchmarks (Xi et al., 21 May 2025), indicating that agentic behaviors may retrieve alternative relevant passages at the expense of exact localization.
Implications and Limitations
The numerical results confirm that retrieval quality is the dominant performance driver for Ukrainian agentic RAG. LLM size/pretraining matters less than retrieval efficacy: a robust retriever elevates even relatively small LLMs above LLM-only baselines. The agentic pipeline’s minor improvement over the strong baseline highlights the limited potential of single-step agentic layers in computationally constrained environments.
Practical constraints (single GPU, offline mode) precluded exploration of more advanced agentic architectures (multi-agent collaboration, deep reasoning chains, iterative search), noted as essential for further breakthroughs (Li et al., 13 Jul 2025).
Generalizability is uncertain. The pipeline was not fine-tuned on Ukrainian data and relies on domain-agnostic, multilingual embeddings. There is no evidence yet for transferability to other Ukrainian NLP settings or domains. The system's agentic components are rudimentary and do not encompass full agentic patterns typified by planning, reflection, and tool use (Singh et al., 15 Jan 2025).
Theoretical and Practical Outlook
The study underscores the necessity of domain-adaptive retrievers and Ukrainian-specific rerankers to further improve grounding and localization. Under unconstrained settings, exploration of sophisticated agentic RAG architectures—multi-step planning, chain-of-retrieval, deep iterative search—should be prioritized. Coupling these approaches with large-scale API-accessible LLMs could unlock new state-of-the-art results for Ukrainian, paving the way for more effective information access and question answering in low-resource languages.
Furthermore, the results reinforce the paradigm shift documented in agentic RAG literature: retrieval-centric architectures, rather than sheer model size or language-specific pretraining, dictate system performance for multi-domain document understanding tasks.
Conclusion
This paper provides a technically rigorous first look at Agentic RAG for Ukrainian, evaluated in a competitive multi-domain document understanding challenge. The findings demonstrate that retrieval quality, more than agentic behaviors or LLM choice, is the decisive performance bottleneck. Agentic retry mechanisms offer incremental gains but are constrained by offline pipeline limitations. Future research should focus on robust, Ukrainian-adapted retrieval, unconstrained agentic architectures, and leveraging large-scale LLMs to fully realize the latent potential of Agentic RAG in low-resource settings.