- The paper introduces DARE, a method that integrates distributional signals with function metadata to improve retrieval accuracy in the R ecosystem.
- It employs a dual-encoder contrastive architecture that achieves state-of-the-art metrics, including a Recall@1 improvement of 33.4% over baselines.
- DARE enables efficient LLM-driven statistical analysis by reducing execution failures and enhancing code synthesis in real-time R workflows.
Distribution-Aware Retrieval for LLM-R Ecosystem Alignment: DARE
Motivation and Problem Context
LLM agents have become increasingly prevalent in automating data science workflows, yet their efficacy is constrained by their predominant orientation toward Python-centric ecosystems. This bias impedes the utilization of R, a language central to statistical computing, which possesses a mature, domain-informed package landscape housed in CRAN. Despite the rigorous methodological repertoire encoded in R packages, LLMs systematically underperform in statistical tool retrieval and often default to suboptimal Python analogs or generate erroneous R code due to their semantic but distribution-unaware representation learning. Standard retrieval-augmented generation (RAG) frameworks further exacerbate this gap by relying solely on function-level semantic similarity, neglecting the crucial context of data distribution characteristics, which are paramount for statistical method applicability.
DARE Model and RPKB Repository
To address these deficiencies, the authors introduce DARE (Distribution-Aware Retrieval Embedding), an embedding-based retrieval model explicitly conditioned on data distribution signals in addition to function metadata. DARE is trained on a contrastive dual-encoder architecture, initialized from all-MiniLM-L6-v2, enabling efficient mapping of both queries and candidate R functions—augmented with inferred distributional constraints—into a shared latent space. The model is supervised with an InfoNCE objective, incentivizing accurate alignment of query-context pairs to statistically compatible functions, using hard negatives derived from batch sampling.
The backbone for DARE is RPKB, a curated repository derived from 8,191 high-quality CRAN packages, filtered for statistical primitives through automated and LLM-assisted extraction of metadata and data profiles. The repository is indexed in ChromaDB, enabling low-latency Maximum Inner Product Search across high-dimensional function embeddings.
Empirical Evaluation and Numerical Results
DARE achieves state-of-the-art retrieval performance across all canonical metrics on RPKB, including NDCG@10 of 93.47%, outperforming prior high-capacity baselines (e.g., Snowflake/arctic-embed-l, Recall@10 = 92.35%, NDCG@10 = 79.32%) by up to 17.8% relative margin. Notably, these gains are achieved with only 23M parameters, a fraction of the parameter counts of leading competitors (> 335M). DARE attains a Recall@1 of 87.39%, marking a 33.4% improvement over the best baseline, and an MRR@10 of 91.76%, indicative of robust ranking quality. Throughput and latency benchmarks further establish DARE as an efficient, deployable module, with 8,512 QPS and 3.7ms average latency; this outpaces larger models by a factor of three to four, ensuring feasibility for agentic real-time retrieval.
RCodingAgent: End-to-End Integration and Downstream Impact
The RCodingAgent, an R-specialized LLM agent, is equipped with DARE for iterative statistical tool retrieval, code generation, and execution-based validation. The agent is evaluated on a suite of 16 R-based statistical analysis tasks, spanning domains such as hypothesis testing, ICA, survival analysis, and mixed-effects modeling. Integration of DARE yields transformative improvements: on execution-grounded benchmarks, agent success rates increase up to 56.25%, with the most pronounced impact observed for models such as grok-4.1-fast (18.75% → 75.00%), gpt-5.2 (25.00% → 62.50%) and minimax-m2.1 (12.50% → 68.75%). Without DARE, agents frequently fail to select appropriate R functions, hallucinate names, or revert to heuristic implementations, leading to execution failures and incorrect outputs.
Practical and Theoretical Implications
DARE sets a new paradigm for retrieval-augmented LLM agents in data science, particularly for ecosystems with rich, distribution-sensitive tool landscapes like R. The model demonstrates that conditioning on structured data context enables fine-grained discrimination among semantically similar but statistically incompatible functions, substantially reducing failure modes in LLM-driven code synthesis. Its lightweight, plug-and-play design and open-sourced knowledge base (RPKB) position DARE as an enabler for scalable, robust agentic workflows in scientific and industrial computing.
Practically, DARE unlocks decades of accumulated statistical expertise for automated analysis pipelines, supporting use cases in genomics, econometrics, ecology, and clinical research, where R solutions remain the gold standard. Theoretically, the work highlights the inadequacy of general-purpose embedding models for specialized analytical tasks and demonstrates the value of distribution-dependent representation learning for tool selection.
Future Directions
Key directions include expanding RPKB through community engagement, constructing large-scale R-centric corpora to enhance native LLM statistical proficiency, and advancing structured tool learning strategies (e.g., function-level reasoning graphs, memory-based tool selection). Integration of RCodingAgent into modular mixture-of-experts frameworks is also proposed, to scale analytical coverage across languages and domains.
Conclusion
DARE represents a significant advance in aligning LLM-based agents with the R statistical ecosystem through distribution-aware retrieval. Its empirical superiority, efficiency, and practical effectiveness in downstream R workflows underscore the necessity of embedding domain-specific statistical logic into retrieval and agentic systems. Future efforts are expected to further bridge the automation gap in statistical computing, fostering more reliable, context-sensitive agentic data science solutions (2603.04743).