AgentRxiv: Autonomous Research Collaboration

Updated 10 September 2025

AgentRxiv is a collaborative open-source platform that aggregates machine-generated research, enabling autonomous AI labs to share and refine experimental outputs.
It employs a dense retrieval engine with SentenceTransformer embeddings and cosine similarity to efficiently index and retrieve research artifacts for iterative improvement.
Empirical results on benchmarks like MATH-500 reveal that AgentRxiv's collaborative methodology achieves up to a 13.7% relative accuracy improvement over baseline models.

AgentRxiv is a collaborative framework and infrastructure designed to enable AI agent laboratories—composed of autonomous LLM agents—to share, retrieve, and iteratively improve upon machine-generated research outputs. Modeled upon principles established by human-centric preprint servers such as arXiv and bioRxiv, but tailored for autonomous research, AgentRxiv serves as a common platform to facilitate the accumulation and propagation of agent-discovered reasoning strategies, experimental results, and research methodologies. The framework aims to foster laboratory-scale agent collaboration, accelerating discovery and innovation via the continuous accumulation of collective machine intelligence (Schmidgall et al., 23 Mar 2025).

1. Framework Architecture and Core Principles

AgentRxiv operates as a shared, open-source web application where multiple agent laboratories upload their research artifacts—papers, experiment logs, techniques—immediately upon generation. Each document is extracted for both textual content and metadata, which populate an indexed database. Central to the platform is a retrieval engine built upon a pre-trained SentenceTransformer model; this engine computes fixed-dimensional embeddings for all stored papers and for new agent queries, employing cosine similarity to rank and retrieve semantically closest prior works. This enables agent laboratories to efficiently search for and incorporate past findings.

The protocol integrates three canonical research phases within agent laboratory workflows:

Literature Review stage: Agents query both external sources (e.g., arXiv) and internal AgentRxiv archives to synthesize relevant findings.
Experimentation stage: Agents design and run novel experiments or automated prompting strategies to address outstanding research questions, iteratively refining their approaches.
Report Writing stage: Agents autonomously compose full research reports or technical papers, which are then uploaded to the AgentRxiv server, making them available for future retrieval and citation by all laboratories.

This pipeline enables asynchronous and distributed collaboration among any number of laboratories, supporting both sequential and parallel research.

2. Collaborative Research Methodology

Experimental infrastructure in AgentRxiv is constructed around a reference research direction: for instance, “Improve accuracy on MATH-500 using reasoning and prompt engineering.” Each agent laboratory operates a multi-stage pipeline:

Literature review with fixed parameters (e.g., reviewing $N=5$ closest previous AgentRxiv papers plus relevant external summaries).
Experimentation with fixed hyperparameters for prompting, solver chains, and evaluation steps.
Report writing and uploading of papers, with each new submission allowed to cite and leverage the entire existing AgentRxiv corpus.

Two primary modes of experimentation are supported:

Sequential Runs: A single laboratory sequentially generates papers, each referencing its own prior outputs.
Parallel Runs: Multiple independent laboratories operate simultaneously, each referencing both their own and others’ AgentRxiv outputs, with all contributions being immediately visible and citable by all.

This makes discovery non-linear and cumulative, structurally analogous to the collective progression of human science.

3. Empirical Results and Performance Analysis

Empirical evaluation of the AgentRxiv framework centers on quantitatively measurable improvements on machine learning research tasks. For the MATH-500 mathematical reasoning benchmark:

The baseline accuracy using the gpt-4o mini model is 70.2%.
Sequential autonomous research leveraging AgentRxiv led to an 11.4% relative accuracy improvement (to 78.2%) using the best-discovered reasoning strategy (Simultaneous Divergence Averaging, SDA).
In parallel laboratory settings, three distinct laboratories shared results in real time, achieving 79.8% accuracy (a 13.7% relative improvement), with key accuracy milestones reached more rapidly (e.g., 76.2% accuracy attained after just 7 papers versus 23 in the sequential scenario).

Ablation studies demonstrate that access to accumulated prior research is essential: laboratories deprived of AgentRxiv’s knowledge base plateau at substantially lower performance levels.

4. Algorithmic and Technical Components

The core technical innovation in AgentRxiv’s retrieval system employs dense embeddings and cosine similarity for document matching: $\text{cosine\_similarity}(u, v) = \frac{u \cdot v}{\|u\|\|v\|}$ where $u$ and $v$ are embedding vectors for the query and archived documents, respectively. This supports robust nearest neighbor retrieval in latent space for literature review.

The Simultaneous Divergence Averaging (SDA) algorithm—identified as the top-performing reasoning method—is constructed as follows:

Generate two independent “chains of thought” per problem: one at low temperature (“Precise Solver”) for accuracy, one at high temperature (“Creative Evaluator”) for diversity.
Both outputs provide final answers and associated confidence estimates in LaTeX.
Encode both outputs, compute their cosine similarity, and apply a threshold: if agreement is high, the answer with greater cumulative confidence is chosen; otherwise, a meta-evaluation process is triggered for further reconciliation. This iterative dual-chain approach underpins the strongest gains observed on MATH-500.

Experiments across additional benchmarks, including GPQA, MMLU-Pro, and MedQA, showed that AgentRxiv-discovered methods (notably SDA) transferred positively, producing average performance improvements (~+9.3%, with aggregated cross-domain gains of +3.3%).

5. Generalization and Impact Across Domains

Strategies developed through AgentRxiv’s collaborative process generalize beyond the initial training domain. For example, SDA improved scores not only on MATH-500, but also on graduate-level question answering (GPQA), medical licensing benchmarks (MedQA), and multi-discipline tests (MMLU-Pro). These results suggest that cumulative agent collaboration via shared preprint infrastructure can efficiently yield methods that are domain-general, not merely overfit to a narrow benchmark.

This cross-pollination of reasoning techniques is facilitated by the retrieval-based review and citation architecture, allowing transfer not only within but also across task families and research domains.

6. Future Directions and Broader Implications

AgentRxiv is positioned as a foundational component in the design of future AI research ecosystems. Potential implications include:

Accelerated Discovery: By structurally enabling cumulative innovation, AgentRxiv mimics the incremental, distributed nature of human discovery, but at machine timescales.
Human-AI Collaboration: The framework provides a natural avenue for human “copilot” intervention, merging autonomous agent capabilities with domain expert oversight.
Scalable, Distributed Research: The platform prototype demonstrates rapid progress with tightly coupled, parallel agent laboratories. However, the trade-off between total resource cost and real-time convergence (i.e., computational throughput vs. speed to discovery) is underscored. Future configurations are likely to balance these axes for maximal practical impact.
Open-Ended Science: AgentRxiv can serve as a template for distributed open-ended exploration, potentially extending from mathematical problem solving to experimental sciences, hypothesis generation, or conceptual model building.

7. Summary Table: Benchmark Performance Improvements

Mode	Baseline Accuracy	AgentRxiv Accuracy	Relative Improvement
Sequential (single)	70.2%	78.2%	+11.4%
Parallel (triple)	70.2%	79.8%	+13.7%

The above table summarizes key performance metrics reported in (Schmidgall et al., 23 Mar 2025), underscoring the efficacy of collaborative, iterative research via AgentRxiv.

AgentRxiv represents a shift from isolated autonomous research toward a cumulative, collaborative model analogous to human open science. By leveraging dense document retrieval, automated reasoning, and meta-optimization over agent-derived knowledge, it enables rapid, distributed advancement in research tasks. The system both accelerates iterative methodological refinement and offers a template for human–AI mixed-team collaboration in future scientific domains.

PDF Markdown Chat (Pro)

References (1)

AgentRxiv: Towards Collaborative Autonomous Research (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to AgentRxiv.