Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team (2506.14234v1)

Published 17 Jun 2025 in cs.CL and cs.AI

Abstract: Despite impressive progress on complex reasoning, current LLMs typically operate in isolation - treating each problem as an independent attempt, without accumulating or integrating experiential knowledge. In contrast, expert problem solvers - such as Olympiad or programming contest teams - leverage a rich tapestry of experiences: absorbing mentorship from coaches, developing intuition from past problems, leveraging knowledge of tool usage and library functionality, adapting strategies based on the expertise and experiences of peers, continuously refining their reasoning through trial and error, and learning from other related problems even during competition. We introduce Xolver, a training-free multi-agent reasoning framework that equips a black-box LLM with a persistent, evolving memory of holistic experience. Xolver integrates diverse experience modalities, including external and self-retrieval, tool use, collaborative interactions, agent-driven evaluation, and iterative refinement. By learning from relevant strategies, code fragments, and abstract reasoning patterns at inference time, Xolver avoids generating solutions from scratch - marking a transition from isolated inference toward experience-aware language agents. Built on both open-weight and proprietary models, Xolver consistently outperforms specialized reasoning agents. Even with lightweight backbones (e.g., QWQ-32B), it often surpasses advanced models including Qwen3-235B, Gemini 2.5 Pro, o3, and o4-mini-high. With o3-mini-high, it achieves new best results on GSM8K (98.1%), AIME'24 (94.4%), AIME'25 (93.7%), Math-500 (99.8%), and LiveCodeBench-V5 (91.6%) - highlighting holistic experience learning as a key step toward generalist agents capable of expert-level reasoning. Code and data are available at https://kagnlp.github.io/xolver.github.io/.

Summary

  • The paper pioneers a multi-agent framework that integrates persistent experiential memory to enhance LLM inference capabilities.
  • It employs iterative refinement using external retrieval, tool use, and agent evaluation to achieve superior performance on complex benchmarks.
  • The framework’s success on GSM8K, AIME, Math-500, and LiveCodeBench demonstrates its potential for dynamic, adaptable problem-solving in diverse domains.

Overview of "xolvergreen: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team"

The paper introduces "xolvergreen," which is a novel multi-agent reasoning framework designed to enhance the problem-solving capabilities of LLMs. This framework emphasizes "holistic experience learning" by equipping LLMs with a persistent and evolving memory akin to the experiential learning mechanisms employed by expert problem solvers like Olympiad teams. The approach fundamentally diverges from conventional LLM mechanisms that operate in isolation, instead creating a collaborative environment for reasoning.

The primary contribution of xolvergreen is the introduction of an integrated system that uses experiential knowledge to improve inference processes. By incorporating mechanisms such as external and self-retrieval, tool use, and agent-driven evaluation, xolvergreen leverages an evolving memory to iteratively refine its reasoning approach. This memory-driven method marks an advancement from isolated inference to more adaptive and collaborative language agents.

Key Findings and Numerical Performance

The implementation of xolvergreen demonstrates superior performance over existing models across multiple complex reasoning benchmarks such as GSM8K, AIME '24 and '25, and LiveCodeBench, showing significant improvements. When using a stronger backbone like o3-mini-high, the results are particularly impressive: achieving 98.1% on GSM8K, 94.4% on AIME'24, 93.7% on AIME'25, 99.8% on Math-500, and 91.6% on LiveCodeBench. These results indicate that xolvergreen not only surpasses specialized reasoning agents like OctoTools and CheatSheet but also competes effectively against leading models in the field, such as Qwen3-235B and Gemini 2.5 Pro.

One of the interesting aspects of xolvergreen is its robust performance even when instantiated with lighter-weight models, often outperforming larger state-of-the-art LLMs, which demonstrates the efficacy of its experiential learning framework.

Theoretical and Practical Implications

Theoretically, xolvergreen suggests a shift in the capabilities of AI agents—from passive input-output processors to more dynamic and context-aware participants in the problem-solving process. It bridges the gap between static data-driven models and those that can actively integrate experiential learning into reasoning tasks.

Practically, the employment of a multi-agent system within xolvergreen can be extended to various domains requiring comprehensive problem-solving capabilities. Its iterative refinement approach suggests potential applications in complex dynamic systems where learning from historical data is crucial.

Future Directions

The potential extendability of xolvergreen is vast. It illustrates a promising pathway for developing generalist agents capable of adapting their reasoning over time, moving beyond static cognition towards more flexible, adaptive solutions. Furthermore, by open-sourcing the code and data, it provides a foundation for future research to explore deeper integrations of experiential learning and reasoning, as well as more efficient memory mechanisms.

One avenue for future exploration could involve enhancing xolvergreen's interoperability with diverse toolsets, thereby increasing its functionality beyond the current Python environment. Additionally, there is significant room for exploring optimizations to reduce the computational overhead associated with maintaining and updating its persistent memory structures.

In summary, xolvergreen represents a significant step forward in LLM capabilities, enabling a more integrated and iterative approach to problem-solving and reasoning. It opens up new possibilities for developing AI systems that mimic the adaptive learning mechanisms seen in expert human problem solvers.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com