AgentRxiv: Towards Collaborative Autonomous Research
(2503.18102v1)
Published 23 Mar 2025 in cs.AI, cs.CL, and cs.LG
Abstract: Progress in scientific discovery is rarely the result of a single "Eureka" moment, but is rather the product of hundreds of scientists incrementally working together toward a common goal. While existing agent workflows are capable of producing research autonomously, they do so in isolation, without the ability to continuously improve upon prior research results. To address these challenges, we introduce AgentRxiv-a framework that lets LLM agent laboratories upload and retrieve reports from a shared preprint server in order to collaborate, share insights, and iteratively build on each other's research. We task agent laboratories to develop new reasoning and prompting techniques and find that agents with access to their prior research achieve higher performance improvements compared to agents operating in isolation (11.4% relative improvement over baseline on MATH-500). We find that the best performing strategy generalizes to benchmarks in other domains (improving on average by 3.3%). Multiple agent laboratories sharing research through AgentRxiv are able to work together towards a common goal, progressing more rapidly than isolated laboratories, achieving higher overall accuracy (13.7% relative improvement over baseline on MATH-500). These findings suggest that autonomous agents may play a role in designing future AI systems alongside humans. We hope that AgentRxiv allows agents to collaborate toward research goals and enables researchers to accelerate discovery.
The paper "AgentRxiv: Towards Collaborative Autonomous Research" (Schmidgall et al., 23 Mar 2025) introduces a framework designed to facilitate collaboration and iterative improvement among autonomous LLM-based research agents. Recognizing that scientific progress typically arises from cumulative, collaborative effort rather than isolated breakthroughs, AgentRxiv addresses the limitation of existing agent workflows that operate independently without mechanisms for building upon prior findings. The core idea is to enable "agent laboratories" to interact with a shared repository, simulating the dynamics of a scientific community sharing preprints.
Framework Architecture and Operation
AgentRxiv establishes an ecosystem where multiple agent laboratories can conduct research autonomously and share their findings. The central component is a shared preprint server, analogous to platforms like arXiv.org, where agent-generated research reports can be uploaded and retrieved.
Key components include:
Agent Laboratories: These are computational environments where LLM agents execute research tasks. The specifics of the laboratory setup (e.g., the types of agents, control structures, available tools) are not detailed in the abstract but are implied to be capable of conducting research autonomously, potentially involving tasks like hypothesis generation, experimentation, result analysis, and report writing.
Shared Preprint Server: A centralized repository acting as a knowledge base. Agent laboratories can programmatically upload structured reports detailing their experiments, methodologies, findings, and insights.
Report Format: Although unspecified in the abstract, reports likely adhere to a predefined schema to ensure machine readability and facilitate effective retrieval and synthesis by other agents. This structure might include sections for methodology, results, discussion, and potentially code or prompts.
Upload/Retrieval Mechanism: Laboratories interface with the preprint server via APIs to submit their reports and query existing ones. The retrieval mechanism allows agents to access prior work relevant to their current task, enabling them to leverage existing knowledge, avoid redundant effort, and build upon previous findings. The abstract does not specify the retrieval algorithm (e.g., keyword-based, semantic similarity).
The operational loop involves agents within a laboratory conducting research, documenting their process and results in a report, uploading it to AgentRxiv, and potentially retrieving relevant reports from AgentRxiv to inform subsequent research cycles. This creates a mechanism for knowledge accumulation and dissemination within the agent population.
Experimental Methodology
The efficacy of the AgentRxiv framework was evaluated by tasking agent laboratories with a specific research objective: developing novel reasoning and prompting techniques to improve LLM performance. The performance metric primarily focused on mathematical reasoning, using the MATH-500 benchmark dataset.
The experiments compared three conditions:
Baseline (Isolated): Agent laboratories operating independently without access to AgentRxiv or their own prior research outputs.
Individual Improvement (Access to Prior Work): Agent laboratories capable of uploading reports to AgentRxiv and retrieving their own previous reports, allowing for iterative self-improvement but without access to the work of other laboratories.
Collaborative Improvement (Shared Access): Multiple agent laboratories simultaneously conducting research, uploading reports to the shared AgentRxiv server, and retrieving reports generated by any participating laboratory.
This setup allowed for the quantification of benefits derived from iterative refinement based on an agent's own history versus the benefits of broader collaboration and knowledge sharing across different agent entities.
Key Findings and Performance
The paper reports significant performance improvements attributable to the AgentRxiv framework.
Individual Improvement: Agents with access to their own prior research via AgentRxiv demonstrated a 11.4% relative improvement over the isolated baseline on the MATH-500 benchmark. This suggests that the ability to systematically record and retrieve past experiments enables agents to refine their strategies effectively over time.
Collaborative Improvement: When multiple agent laboratories shared research through AgentRxiv, the collective effort resulted in faster progress and higher overall accuracy compared to isolated labs. This collaborative setting achieved a 13.7% relative improvement over the baseline on MATH-500. This finding highlights the compounding benefits of shared knowledge, where insights from one laboratory can accelerate progress in others.
Generalization: The best-performing strategy developed through this process (presumably a reasoning or prompting technique) showed generalization capabilities. When applied to benchmarks in domains other than mathematics, it yielded an average improvement of 3.3%. This indicates that the research outputs generated and refined within the AgentRxiv framework are not merely overfitted to the specific training task (MATH-500) but possess broader applicability.
These quantitative results underscore the framework's potential to enhance autonomous research capabilities, particularly through mechanisms that mirror human scientific collaboration. The difference between the 11.4% individual improvement and the 13.7% collaborative improvement points towards the added value of cross-laboratory knowledge sharing facilitated by the shared preprint server.
Implications and Considerations
AgentRxiv presents a model for structuring autonomous AI research, moving beyond single-shot task execution towards sustained, collaborative scientific inquiry driven by LLM agents. The findings suggest that equipping agents with mechanisms for persistent memory (via report archiving) and communication (via a shared repository) significantly boosts their capacity for complex problem-solving and discovery.
Potential implementation considerations, while not explicitly detailed in the abstract, would include:
Scalability: The preprint server needs to handle potentially large volumes of reports and retrieval requests from numerous agent laboratories.
Report Quality and Verification: Mechanisms may be needed to assess the quality, novelty, and correctness of agent-generated reports to prevent the propagation of erroneous information.
Retrieval Effectiveness: The utility of the shared knowledge depends heavily on the agents' ability to find truly relevant prior work. Advanced semantic search or graph-based knowledge representation might be necessary.
The results imply that autonomous agents, when organized within collaborative frameworks like AgentRxiv, could potentially contribute to the design and improvement of future AI systems, potentially accelerating the pace of discovery alongside human researchers.
In conclusion, AgentRxiv proposes and provides initial validation for a collaborative framework where autonomous research agents can share and build upon findings via a shared preprint server. The reported performance gains on mathematical reasoning benchmarks, along with evidence of generalization, suggest this paradigm holds promise for enabling more sophisticated and cumulative research conducted by AI agents.