Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation (2506.09991v2)

Published 11 Jun 2025 in cs.LG

Abstract: Autoregressive LLMs (AR-LLMs) frequently exhibit implicit parallelism in sequential generation. Inspired by this, we introduce Multiverse, a new generative model that enables natively parallel generation. Multiverse internalizes a MapReduce paradigm, generating automatically through three stages: (i) a Map stage for adaptive task decomposition, (ii) a Process stage for parallel subtask execution, and (iii) a Reduce stage for lossless result synthesis. Next, we build a real-world Multiverse reasoning model with co-design of data, algorithm, and system, enabling rapid and seamless transfer from frontier AR-LLMs. For data creation, we develop Multiverse Curator, an automated LLM-assisted pipeline that transforms sequential reasoning chains into structured training data, avoiding costly human annotations. Algorithmically, we design Multiverse Attention to separate parallel reasoning steps while keeping compatibility with causal attention for efficient training. Systematically, we implement Multiverse Engine to support parallel inference. It features a dedicated interpreter that dynamically switches between sequential and parallel generation, triggered directly by the model. After a 3-hour fine-tuning with 1K examples, our Multiverse-32B stands as the only open-sourced non-AR model achieving performance on par with leading AR-LLMs of the same scale, evidenced by AIME24 & 25 scores of 54% and 46%, respectively. Moreover, our budget control experiments show that Multiverse-32B exhibits superior scaling, outperforming AR-LLMs by 1.87% on average using the same context length. Such scaling further leads to practical efficiency gains, achieving up to 2x speedup across varying batch sizes. We have open-sourced the entire Multiverse ecosystem, including data, model weights, engine, as well as complete data curation prompts and detailed training and evaluation recipes.

Summary

  • The paper introduces a novel MapReduce framework that decomposes tasks for parallel generation in LLMs, overcoming autoregressive limitations.
  • It details a co-designed ecosystem with the Multiverse Curator and Engine to restructure chain-of-thought data and modify attention for independence.
  • Performance experiments show up to 2x speedup and an average 24.5% improvement in reasoning accuracy over traditional autoregressive models.

This paper introduces Multiverse, a novel generative modeling framework designed to enable LLMs to perform natively parallel generation, moving beyond the sequential limitations of autoregressive (AR) models. The core idea is to internalize a MapReduce paradigm within the LLM, allowing it to adaptively decompose tasks, process subtasks in parallel, and synthesize results losslessly.

The authors observe that existing AR-LLMs often exhibit implicit parallelism in their sequential Chain-of-Thought (CoT) reasoning, with over 98% of analyzed trajectories showing parallelizable branches (categorized as collective or selective). However, these models cannot explicitly enforce or discern this parallelism.

To address this, Multiverse operates in three stages:

  1. Map Stage: Sequentially generates a task decomposition plan, mapping subtasks to independent branches.
  2. Process Stage: Executes these subtasks (branches) in parallel, independently.
  3. Reduce Stage: Sequentially synthesizes the results from all parallel branches. This pipeline can be invoked recursively for complex tasks. Control over this flow is managed by specialized tags like <Parallel>, <Goal>, <Outline>, <Path>, and <Conclusion>.

To build a practical Multiverse model, the paper presents a co-designed ecosystem:

  • Data Curation (Multiverse Curator): An automated LLM-assisted pipeline to convert sequential CoT data (from s1K-1.1 dataset) into structured parallel data. This five-step process involves:

    1. Parsing the sequential chain into a summary tree.
    2. Identifying parallelizable nodes in the tree.
    3. Reformatting the summary into a parallel structure using control tags.
    4. Refilling original reasoning steps into this structure.
    5. Adding Map and Reduce stages and rewriting the Process stage for clarity and independence. This results in Multiverse-1K, a dataset of 1,000 structured training samples.
  • Algorithm Design (Multiverse Attention): Modifies standard causal attention by adjusting attention masks and position embeddings. This allows independent reasoning branches to be processed in parallel during the Process stage by starting each path from an identical position and preventing attention across paths. During the Reduce stage, all paths converge. This design maintains training efficiency and allows rapid adaptation from pre-trained AR models.

  • System Implementation (Multiverse Engine): An extension of existing inference engines (specifically SGLang) to support the MapReduce execution flow. It features an interpreter for the control tags, dynamically switching between sequential and parallel generation. It manages:
    • Sequential \rightarrow Parallel: Mapping subtasks to parallel branches with prefix KV cache sharing.
    • Parallel \rightarrow Sequential: Merging KV states from all branches back into a single sequence without significant overhead, leveraging features like radix attention.

Experiments and Results:

The authors fine-tuned Qwen-2.5-32B-Instruct on the Multiverse-1K dataset (combined with original sequential data with a dynamic mixing ratio) for 3 hours on 8 NVIDIA B200 GPUs to create Multiverse-32B.

  • Reasoning Performance: On benchmarks like AIME24, AIME25, MATH500, and GPQA Diamond, Multiverse-32B significantly outperformed its base model (Qwen2.5-32B-Instruct) by an average of 24.5% and achieved performance on par with or exceeding other 32B AR-LLMs. For example, on AIME24, Multiverse-32B scored 53.8% pass@1 compared to the base model's 15.8% and an Autoregressive-32B (trained on the same data but sequentially) score of 54.6%. The degree of parallelism (tokens generated / sequential steps) was observed to be around 1.15-1.18 for Multiverse-32B on AIME tasks.
  • Scaling Performance: In budget-controlled experiments (fixed context length, thus similar generation time), Multiverse-32B demonstrated superior scaling, outperforming AR-LLMs by an average of 1.87% on GPQA-Diamond and MATH500 by generating more tokens in parallel.
  • Efficiency Analysis: Multiverse Engine achieved up to 2x wall-clock speedup per generated token compared to sequential generation, with speedups increasing with the degree of parallelism. This speedup was shown to be stable across varying batch sizes (1 to 128), indicating good scalability.

The paper concludes that Multiverse offers a viable alternative to purely autoregressive modeling, enabling efficient parallel generation without compromising performance. The authors have open-sourced the entire ecosystem, including data, model weights, engine, and tools. Limitations include the need to explore applications beyond LLM reasoning and to integrate reinforcement learning for potentially discovering more parallelism.

Youtube Logo Streamline Icon: https://streamlinehq.com