DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

Published 5 Mar 2026 in cs.IR, cs.AI, and cs.CL | (2603.04743v1)

Abstract: LLM agents can automate data-science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval-augmented approaches focus on function-level semantics and ignore data distribution, producing suboptimal matches. We propose DARE (Distribution-Aware Retrieval Embedding), a lightweight, plug-and-play retrieval model that incorporates data distribution information into function representations for R package retrieval. Our main contributions are: (i) RPKB, a curated R Package Knowledge Base derived from 8,191 high-quality CRAN packages; (ii) DARE, an embedding model that fuses distributional features with function metadata to improve retrieval relevance; and (iii) RCodingAgent, an R-oriented LLM agent for reliable R code generation and a suite of statistical analysis tasks for systematically evaluating LLM agents in realistic analytical scenarios. Empirically, DARE achieves an NDCG at 10 of 93.47%, outperforming state-of-the-art open-source embedding models by up to 17% on package retrieval while using substantially fewer parameters. Integrating DARE into RCodingAgent yields significant gains on downstream analysis tasks. This work helps narrow the gap between LLM automation and the mature R statistical ecosystem.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces DARE, a method that integrates distributional signals with function metadata to improve retrieval accuracy in the R ecosystem.
It employs a dual-encoder contrastive architecture that achieves state-of-the-art metrics, including a Recall@1 improvement of 33.4% over baselines.
DARE enables efficient LLM-driven statistical analysis by reducing execution failures and enhancing code synthesis in real-time R workflows.

Distribution-Aware Retrieval for LLM-R Ecosystem Alignment: DARE

Motivation and Problem Context

LLM agents have become increasingly prevalent in automating data science workflows, yet their efficacy is constrained by their predominant orientation toward Python-centric ecosystems. This bias impedes the utilization of R, a language central to statistical computing, which possesses a mature, domain-informed package landscape housed in CRAN. Despite the rigorous methodological repertoire encoded in R packages, LLMs systematically underperform in statistical tool retrieval and often default to suboptimal Python analogs or generate erroneous R code due to their semantic but distribution-unaware representation learning. Standard retrieval-augmented generation (RAG) frameworks further exacerbate this gap by relying solely on function-level semantic similarity, neglecting the crucial context of data distribution characteristics, which are paramount for statistical method applicability.

DARE Model and RPKB Repository

To address these deficiencies, the authors introduce DARE (Distribution-Aware Retrieval Embedding), an embedding-based retrieval model explicitly conditioned on data distribution signals in addition to function metadata. DARE is trained on a contrastive dual-encoder architecture, initialized from all-MiniLM-L6-v2, enabling efficient mapping of both queries and candidate R functions—augmented with inferred distributional constraints—into a shared latent space. The model is supervised with an InfoNCE objective, incentivizing accurate alignment of query-context pairs to statistically compatible functions, using hard negatives derived from batch sampling.

The backbone for DARE is RPKB, a curated repository derived from 8,191 high-quality CRAN packages, filtered for statistical primitives through automated and LLM-assisted extraction of metadata and data profiles. The repository is indexed in ChromaDB, enabling low-latency Maximum Inner Product Search across high-dimensional function embeddings.

Empirical Evaluation and Numerical Results

DARE achieves state-of-the-art retrieval performance across all canonical metrics on RPKB, including NDCG@10 of 93.47%, outperforming prior high-capacity baselines (e.g., Snowflake/arctic-embed-l, Recall@10 = 92.35%, NDCG@10 = 79.32%) by up to 17.8% relative margin. Notably, these gains are achieved with only 23M parameters, a fraction of the parameter counts of leading competitors (> 335M). DARE attains a Recall@1 of 87.39%, marking a 33.4% improvement over the best baseline, and an MRR@10 of 91.76%, indicative of robust ranking quality. Throughput and latency benchmarks further establish DARE as an efficient, deployable module, with 8,512 QPS and 3.7ms average latency; this outpaces larger models by a factor of three to four, ensuring feasibility for agentic real-time retrieval.

RCodingAgent: End-to-End Integration and Downstream Impact

The RCodingAgent, an R-specialized LLM agent, is equipped with DARE for iterative statistical tool retrieval, code generation, and execution-based validation. The agent is evaluated on a suite of 16 R-based statistical analysis tasks, spanning domains such as hypothesis testing, ICA, survival analysis, and mixed-effects modeling. Integration of DARE yields transformative improvements: on execution-grounded benchmarks, agent success rates increase up to 56.25%, with the most pronounced impact observed for models such as grok-4.1-fast (18.75% → 75.00%), gpt-5.2 (25.00% → 62.50%) and minimax-m2.1 (12.50% → 68.75%). Without DARE, agents frequently fail to select appropriate R functions, hallucinate names, or revert to heuristic implementations, leading to execution failures and incorrect outputs.

Practical and Theoretical Implications

DARE sets a new paradigm for retrieval-augmented LLM agents in data science, particularly for ecosystems with rich, distribution-sensitive tool landscapes like R. The model demonstrates that conditioning on structured data context enables fine-grained discrimination among semantically similar but statistically incompatible functions, substantially reducing failure modes in LLM-driven code synthesis. Its lightweight, plug-and-play design and open-sourced knowledge base (RPKB) position DARE as an enabler for scalable, robust agentic workflows in scientific and industrial computing.

Practically, DARE unlocks decades of accumulated statistical expertise for automated analysis pipelines, supporting use cases in genomics, econometrics, ecology, and clinical research, where R solutions remain the gold standard. Theoretically, the work highlights the inadequacy of general-purpose embedding models for specialized analytical tasks and demonstrates the value of distribution-dependent representation learning for tool selection.

Future Directions

Key directions include expanding RPKB through community engagement, constructing large-scale R-centric corpora to enhance native LLM statistical proficiency, and advancing structured tool learning strategies (e.g., function-level reasoning graphs, memory-based tool selection). Integration of RCodingAgent into modular mixture-of-experts frameworks is also proposed, to scale analytical coverage across languages and domains.

Conclusion

DARE represents a significant advance in aligning LLM-based agents with the R statistical ecosystem through distribution-aware retrieval. Its empirical superiority, efficiency, and practical effectiveness in downstream R workflows underscore the necessity of embedding domain-specific statistical logic into retrieval and agentic systems. Future efforts are expected to further bridge the automation gap in statistical computing, fostering more reliable, context-sensitive agentic data science solutions (2603.04743).