Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 33 tok/s
GPT-5 High 27 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 465 tok/s Pro
Kimi K2 205 tok/s Pro
2000 character limit reached

Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation (2507.06607v2)

Published 9 Jul 2025 in cs.CL and cs.LG

Abstract: Recent advances in LLMing have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces SambaY, a novel decoder-hybrid-decoder architecture that uses Gated Memory Units for efficient memory sharing across layers.
  • The paper shows significant efficiency gains and a lower irreducible loss when scaling to 3.4B parameters using a principled μP++ hyperparameter transfer scheme.
  • The paper demonstrates improved long-context retrieval and reasoning capabilities, outperforming baselines on challenging benchmarks and delivering higher throughput.

Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation

This paper introduces SambaY, a novel decoder-hybrid-decoder architecture designed to enhance the efficiency of long-context LLMs. SambaY leverages a Gated Memory Unit (GMU) to share memory across layers, achieving significant improvements in decoding efficiency and long-context performance. The architecture combines a Samba-based self-decoder with a cross-decoder incorporating GMUs, demonstrating superior scalability and reasoning capabilities compared to existing approaches.

Gated Memory Unit and Architecture

The core innovation of this work is the Gated Memory Unit (GMU), a mechanism for efficient memory sharing across layers. The GMU operates on the current layer's input and a memory state from a previous layer, producing a gated representation through learnable projections. The GMU can be expressed as:

yl=(mlσ(W1xl))W2\mathbf{y}_l = (\mathbf{m}_{l'} \odot \sigma(W_1 \mathbf{x}_l)) W_2

where xl\mathbf{x}_l is the current layer's input, ml\mathbf{m}_{l'} is the memory state from a previous layer, σ()\sigma(\cdot) is the SiLU activation function, \odot is element-wise multiplication, and W1,W2W_1, W_2 are learnable weight matrices. Figure 1

Figure 1: The decoder-hybrid-decoder architecture, SambaY, uses Samba as the self-decoder and GMUs interleaved with cross-attention layers in the cross-decoder.

SambaY applies GMUs to the cross-decoder of YOCO, replacing half of the cross-attention layers. This reduces the memory I/O complexity for these layers from O(dkvN)O(d_{kv} N) to O(dh)O(d_{h}), where NN is the sequence length, dkvd_{kv} is the dimension of key-value pairs, and dhd_{h} is the hidden dimension. The authors claim that this leads to significant efficiency gains when Ndh/dkvN \gg d_{h} / d_{kv}.

Scaling Experiments and Results

To ensure a fair comparison of scaling capabilities, the authors designed a principled μ\muP++ hyperparameter transfer scheme that accounts for both depth and width scaling. They conducted extensive experiments up to 3.4B parameters and 600B tokens to verify the scaling behaviors of both their scaling laws and the SambaY architecture. Figure 2

Figure 2

Figure 2: Validation loss versus FLOPs and training tokens demonstrates that SambaY, when scaled with μ\muP++, exhibits a lower irreducible loss compared to the Standard Parameterization (SP).

Compared to Samba+YOCO, SambaY exhibits a significantly lower irreducible loss, suggesting a better scaling potential with large-scale compute. The scaling trajectories were quantitatively compared by fitting the validation loss LL as a function of compute (FLOPs), denoted as $D_{\text{FLOPs}$, to a power law: $L(D_{\text{FLOPs}) = A \cdot D_{\text{FLOPs}^{-b} + C$ where CC represents the irreducible loss. SambaY demonstrated the lowest irreducible loss (C=0.58C = 0.58) for FLOPs scaling.

Long Context Retrieval and Reasoning

The authors evaluated the long-context retrieval capabilities of SambaY using the Phonebook benchmark with a 32K context length. Surprisingly, larger Sliding Window Attention (SWA) sizes did not consistently provide better results. They found that smaller sliding window sizes, particularly in models like SambaY and SambaY+DA, enabled a focus on local patterns and mitigated issues like attention sinks. Figure 3

Figure 3

Figure 3: Accuracy versus Sliding Window Size on Phonebook indicates that larger SWA sizes do not consistently provide better results.

The model Phi4-mini-Flash-Reasoning (3.8B parameters) was pre-trained on 5T tokens. Downstream evaluation (Table 1) demonstrates that Phi4-mini-Flash outperforms the Phi4-mini baseline across a diverse range of tasks, with notable improvements on knowledge-intensive benchmarks like MMLU and coding tasks such as MBPP. This model achieves significantly better performance than Phi4-mini-Reasoning on reasoning benchmarks (AIME24/25, Math500, and GPQA Diamond), while achieving up to 10×\times higher throughput in long-generation scenarios and 4.9×\times speedup in long-context processing. Figure 4

Figure 4

Figure 4: Throughput and latency of text generation under the vLLM inference framework show that SambaY achieves the best throughput in both long-context and long-generation settings.

Ablation Studies

Ablation studies systematically investigated the design choices in the decoder-hybrid-decoder architecture. Several architectural modifications of SambaY were explored, including variations in the self-decoder (SambaY-2, GDNY) and the application of GMUs to gate different intermediate representations. The results indicated that GMUs are effective with alternative memory sources, but their performance on retrieval tasks depends significantly on the memory source's inherent characteristics. Figure 5

Figure 5: Architectural variants, including GDNY with Gated DeltaNet and nGMU in the cross-decoder, were explored to paper the design choices in the decoder-hybrid-decoder architecture.

Conclusion

The paper introduces the Gated Memory Unit (GMU) and the SambaY architecture, demonstrating significant improvements in computational efficiency and long-context performance. Extensive scaling experiments indicate superior scaling properties with increasing computational resources. The largest model, Phi4-mini-Flash-Reasoning, outperforms existing models on challenging reasoning benchmarks while delivering substantially higher decoding throughput on long-context generations. The authors note that future work could explore dynamic sparse attention mechanisms to further improve efficiency on extremely long sequence generation.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube