Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

173 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Xolver: Multi-Agent Reasoning Framework

Updated 30 June 2025

Xolver is a multi-agent reasoning framework that integrates persistent memory and diverse agent insights to tackle complex math and programming challenges.
It combines episodic memory, dynamic teamwork, and external tool retrieval to refine problem-solving strategies iteratively.
Benchmark evaluations reveal Xolver’s superior performance with high accuracy on GSM8K, AIME, and LiveCodeBench compared to traditional LLM approaches.

Xolver is a multi-agent reasoning framework designed to endow LLMs with an evolving, persistent memory of holistic experience, enabling collaborative and experience-aware problem solving akin to expert teams in mathematics or programming competitions. Unlike traditional LLMs, which approach each query in isolation, Xolver integrates diverse modalities of episodic learning, agent interaction, and tool augmentation, yielding performance surpassing single-agent or experience-agnostic approaches on complex mathematical and programming tasks.

1. Framework Architecture and Core Components

Xolver operationalizes holistic experience learning via a memory-augmented, multi-agent infrastructure. Its primary architectural elements include:

Planner Agent: Dynamically instantiates a team of specialist reasoning agents tailored to the task (e.g., mathematician, programmer).
Dynamic Reasoning Agents: Each agent brings a distinct area of expertise; agents reason in parallel or iteratively, guided by exemplar problems and the current shared memory.
Judge Agent: Assigns structured scores and qualitative feedback to agent responses, supporting principled selection and further refinement.
Verifier/Debugger Agent: Extracts final answers from reasoning traces and, for code-based tasks, executes and validates generated code.
Episodic Memory ( $\mathcal{D}_E$ ): A persistent database of previously encountered problems, solutions, and reasoning trajectories, integrating both external knowledge and self-experience.
Intermediate Shared Memory ( $\mathcal{D}_S$ ): Stores top-k intermediate reasoning traces, agent responses, and feedback within each inference session, updated iteratively.

Throughout inference, these agents interact with both shared memory and external tools (e.g., code execution engines), in a process modeled on the collaborative learning and problem solving of high-performing human teams.

2. Modalities of Holistic Experience Integration

Xolver unifies several modalities of experience at inference time:

External Retrieval: Uses similarity-based search (e.g., BM25) over a large solved corpus to provide each agent with contextually matched exemplars—problems, solutions, and code—relevant to the current query.
Self-Retrieval: When external data are insufficient, queries the LLM’s own parametric memory for analogous tasks and responses.
Collaborative Interaction: Reasoners share and adapt to each other’s partial solutions, benefiting from mutual expertise and emergent agent dynamics.
Tool Use: Interfaces with external computation platforms, such as Python interpreters, enabling precise numerical calculation, symbolic manipulation, or automated code validation.
Agent-Driven Judging: An explicit judge agent provides structured, critical evaluation of each reasoning trajectory, selecting the most promising for further refinement and promoting self-correction.
Iterative Refinement: Multi-round update cycles allow agents to learn from collective history, accumulating improvements via episodic and shared memory.

This multi-modal experience accrual allows Xolver to bootstrap and incrementally enhance its reasoning capacity within each session, avoiding generation "from scratch" and instead leveraging a compounding memory of successes, errors, and strategic patterns.

3. Performance and Comparative Evaluation

Empirical evaluation demonstrates Xolver’s superior reasoning capability across several high-difficulty benchmarks:

Benchmark	Xolver (o3-mini-high)	SOTA Baseline (if available)
GSM8K	98.1%	o3: 96.7%
AIME’24	94.4%	Qwen3-235B: 85.7%, o4: 93.4%
AIME’25	93.7%	o4: 92.7%
Math-500	99.8%
LiveCodeBench-V5	91.6%	o4: 69.5%, Qwen3-235B: 70.7%

With backbones of moderate size (e.g., QWQ-32B), Xolver frequently outperforms much larger or more advanced LLMs, including Qwen3-235B, Gemini 2.5 Pro, o3, and o4-mini-high, and achieves substantial lifts (+22% or more) over prior agentic systems such as OctoTools and CheatSheet on large code and math benchmarks.

4. Mechanisms of Memory-Based Reasoning and Learning

Xolver implements a procedural pipeline that structures and updates agent memory during inference, embodied by the following steps:

Experience-Driven Context Construction: Each agent initially builds its context ( $\mathcal{C}_0^j$ ) from the query $q$ and top-ranked exemplars:

$\mathcal{C}_0^j = \{q\} \cup \mathcal{R}(\mathcal{D}_E)$

where $\mathcal{R}(\mathcal{D}_E)$ denotes the retrieval operator.

Iterated Memory-Augmented Reasoning: At each iteration $i$ , the agent's context is:

$\mathcal{C}_i^j = \{q\} \cup \{ T_{i-1}^j, R_{i-1}^j \} \cup \mathcal{D}_S$

incorporating both the agent’s own previous trace ( $T_{i-1}^j$ ), response ( $R_{i-1}^j$ ), and the current shared memory $\mathcal{D}_S$ .

Shared Memory Update: After each round, the system maintains only the top-m tuples (reasoning, response, judge feedback), selected via the judge’s scalar scoring function $s(e)$ :

$\mathcal{D}_S \leftarrow \mathrm{TopK}\bigl(\mathcal{M},\, m;\, \mathrm{key}(e) = s(e)\bigr)$

Termination: The inference loop continues until either mathematical correctness (score 1.0) or code verification (all test cases pass) is achieved for every response in $\mathcal{D}_S$ .

A notable property is that agents learn "by analogy," incrementally refining and extending previous solutions (both their own and those of teammates), and reusing strategies and code fragments drawn from memory and prior sessions.

5. Implications for Generalist and Autonomous AI Agents

Xolver represents a system-level advance in generalist AI, demonstrating that reasoning capacity and reliability can be improved not solely via model scaling or retraining, but by orchestrating structured episodic memory, agent interaction, and experience-guided refinement:

Model-Agnostic Gains: Performance gains are consistent across both open and closed LLMs, including QWQ-32B and o3-mini-high, without model-specific tuning.
Transfer and Robustness: Episodic and intermediate memory support not only intra-session learning, but also continual transfer of strategies across problems and domains.
Emergent Behaviors: The collaborative, multi-specialist agent setup engenders emergent skills including error correction, strategic agreement, and improved reasoning alignment.

A plausible implication is that such holistic and collaborative experience mechanisms may become foundational for AI systems aspiring toward expert-level performance across varied symbolic, scientific, and engineering domains.

6. Technical Formulation and Inference Protocol

The system’s computational complexity is dominated by the number of agents ( $m$ ) and iterations ( $\mathcal{I}$ ), with memory complexity $O(m \mathcal{I})$ . The core inference protocol is structured as follows:

\begin{algorithmic}
\STATE  Input: Query %%%%0%%%%, Tools %%%%1%%%%, Episodic Memory %%%%2%%%%, parameters %%%%3%%%%
\STATE  Planner constructs agent team %%%%4%%%%
\FOR{%%%%5%%%% to %%%%6%%%%}
    \STATE Agents build context via %%%%7%%%% using %%%%8%%%% and %%%%9%%%%
    \STATE Each agent outputs %%%%10%%%%, judged to produce %%%%11%%%%
    \STATE Update %%%%12%%%% with top %%%%13%%%% tuples
    \IF{converged} break \ENDIF
\ENDFOR
\STATE Final answer extracted via verifier %%%%14%%%%
\STATE Update episodic memory and output %%%%15%%%%
\end{algorithmic}

Retrieved knowledge (via BM25 or LLM-sampling for self-retrieval) and structured feedback promote sample efficiency and reduce redundant computation by facilitating focused, memory-driven context aggregation.

7. Prospects and Future Developments

Xolver’s design suggests multiple avenues for further enhancement and extension:

Optimization of Computational Resources: Reducing inference token/computation costs via more efficient agent selection, memory management, and retrieval filtering.
Domain Adaptation: Applying the framework to domains beyond mathematics/programming, such as scientific discovery, symbolic integrals, or planning.
Advanced Tool Integration: Incorporating a broader tool ecosystem, including theorem provers, symbolic APIs, and controlled web access.
Continual Learning: Mechanisms for updating episodic memory seamlessly across tasks and sessions, advancing toward persistent, life-long learning.
External Verification: Integration of learned or programmatic verifiers to enforce logical and factual consistency across agent responses.

Summary Table: Xolver Key Features and Results

Aspect	Details/Results
Architecture	Multi-agent, memory-augmented, judge-mediated, tool-augmented
Experience Sources	External/self-retrieval, agent collaboration, feedback, tool use
Learning Mechanism	Incremental refinement via persistent memory and agent feedback
Benchmarks	GSM8K 98.1%, AIME’24 94.4%, AIME’25 93.7%, Math-500 99.8%, LCBench 91.6%
Technical Protocol	Iterative memory update, context-building, agent evaluation
Agnostic Performance	Robust across open and closed LLMs
Noteworthy Advances	Demonstrates shift from isolated LLM inference to experience-driven, generalist reasoning

Xolver exemplifies the transition from one-off, static LLM outputs toward architectures that learn, accumulate, and reuse experience in collaborative and tool-rich settings, with implications for the future of robust, expert-level reasoning in AI.

PDF Markdown Chat (Upgrade)