Two-Stage Decoder-Only Architecture

Updated 17 June 2026

TDO architecture is a design paradigm that explicitly divides processing into a contextual representation phase and a specialized decoding phase to optimize resource usage.
It facilitates bridging heterogeneous modalities—such as language, speech, and vision—by adapting decoder stages to different data types and tasks.
Empirical results show significant gains in memory efficiency, throughput, and error rates compared to traditional monolithic decoder or encoder-decoder models.

A Two-Stage Decoder-Only (TDO) architecture denotes a design paradigm across multiple machine learning fields in which model computation is explicitly organized into two successive phases, both leveraging decoder or decoder-style sub-networks. In contrast to canonical encoder-decoder or fully monolithic decoder architectures, a TDO system decomposes its computational pipeline—either to facilitate modular learning, boost efficiency, enable data-type bridging, or expose internal feature specialization. Key instantiations arise in neural language modeling, sequence transduction, view synthesis, speech recognition, and channel decoding (Luo et al., 13 Oct 2025, Sun et al., 2024, Tsunoo et al., 2023, Sun et al., 28 May 2026, Qu et al., 2024, Montorsi et al., 2017).

1. Foundational Principles and Motivations

The canonical decoder-only Transformer (e.g., GPT) applies a stacked series of masked self-attention layers over input and context tokens, producing outputs one token at a time. Recent analyses show that these stacks can be naturally interpreted as comprising at least two functionally distinct depthwise regions: (i) early/middle layers form deep contextual representations or latent alignments, and (ii) late layers specialize in decoding those representations into the final output sequence (Luo et al., 13 Oct 2025). The TDO philosophy formalizes this implicit specialization, organizing the computational graph or workflow into two explicit phases—each operating as a decoder—potentially with different data visibility, data types, masking, or learned parameters.

This staging offers multiple theoretical and practical benefits:

Efficient resource utilization, including reduced redundant computation during autoregressive decoding or large-context handling (Luo et al., 13 Oct 2025, Sun et al., 2024).
Bridging heterogeneous modalities, e.g., using a first stage to compress audio, images, or source-language text into an intermediate embedding, then a second stage for target token prediction (Tsunoo et al., 2023, Sun et al., 28 May 2026, Qu et al., 2024).
The potential for enhanced transfer, as TDO can force intermediate alignment (linguistic, semantic, or visual) required for better zero-shot generalization or representation disentanglement (Qu et al., 2024, Sun et al., 28 May 2026).

2. Architectural Formulations in Key Domains

2.1 Sequence Modeling and LLMs

Modern decoder-only models (LLMs) exhibit depthwise specialization. In TDO-based designs, inference is partitioned:

Stage I: Early and middle layers ( $L_{\rm mid}$ ), generating a rich contextual representation $\mathbf{H}_{\rm mid}$ from previous tokens.
Stage II: Late layers ( $L_{\rm late} = L-L_{\rm mid}$ ), decoding $\mathbf{H}_{\rm mid}$ to probability distributions over output tokens (Luo et al., 13 Oct 2025).

Such partitioning underlies Direct Multi-Token Decoding (DMTD), where a full forward pass through early/middle layers is conducted only periodically, and the late decoder layers can operate multiple times on cached representations to accelerate generation (Luo et al., 13 Oct 2025).

The YOCO (You Only Cache Once) architecture further realizes this paradigm by:

Using a self-decoder with constant-memory attention (sliding window or gated-retention) for the first half of layers.
Generating a global KV cache after this phase, which is reused by a cross-decoder—the latter half of the layers—accessing only the fixed memory rather than growing contextual caches (Sun et al., 2024).

2.2 Speech Recognition

In end-to-end ASR, TDO enables bridging audio and text in a pure Transformer stack:

Stage 1: A Conformer encoder with CTC collapse generates a sparse sequence of non-blank prompt vectors from audio, removing redundant frames (Tsunoo et al., 2023).
Stage 2: The prompt is prepended to the inputs of a decoder-only Transformer, which performs autoregressive refinement to produce the final transcription. The decoder can be further LM-trained on text-only data without retraining the prompt or encoder mechanisms.

This approach delivers both improved sample efficiency (easily leveraging text data) and competitive error rates versus encoder-decoder baselines (Tsunoo et al., 2023).

2.3 Multilingual Neural Machine Translation

For MNMT, TDO architectures impose explicit separation:

Stage 1: Runs on source tokens (plus target-language instruction) with no target token visibility. It acts as a "pseudo-encoder" that aligns source representations into target-language feature space via prefix-masked self-attention (Qu et al., 2024).
Stage 2: Consumes both the output of Stage 1 and previously generated target tokens, predicting next target tokens autoregressively.

Small FFN adapter modules are added to route representations between stages, and an instruction-level contrastive loss further enhances cross-lingual transfer. Experiments show that TDO outperforms or matches encoder-decoder models in zero-shot translation, highlighting the impact of the explicit two-stage split (Qu et al., 2024).

2.4 View Synthesis

In high-fidelity view synthesis, a TDO approach decomposes the workflow as:

Stage 1: A decoder-only module processes observed views and geometric context, building a scene representation as a multi-layer KV cache (Sun et al., 28 May 2026).
Stage 2: Another decoder-only pass, operating with identical weights, uses only camera geometry of the query to render a novel view by cross-attending to the cached scene representation.

Full parameter sharing and staged patch sizing provide flexibility and efficiency, outperforming encoder-decoder variants in both PSNR and speed (Sun et al., 28 May 2026).

2.5 Channel Decoding

In two-stage soft/hard decoders for coded modulation:

Stage 1: A soft-decoding module (turbo/SCCC or ADBP) infers a subset of most vulnerable bits.
Stage 2: Hard-decoding on remaining bits, typically via high-rate algebraic codes. Per-symbol computation is reduced by applying ADBP in the first stage, with only minimal capacity loss for appropriate bit partitioning (Montorsi et al., 2017).

3. Mathematical Formalizations

The core mathematical principle underpinning TDO is that layered representations $\mathbf{H}_{\rm mid}$ capture all necessary context for subsequent output decoding. Key equations (representative, domain-specific):

LLMs (DMTD):
- $\mathbf{H}_{\rm mid}(\mathbf{x}_{<t}) = \mathrm{Layer}_{1:L_{\rm mid}}(\mathrm{Embed}(\mathbf{x}_{<t}))$
- $\mathbf{z}_t = \mathrm{LMHead}(\mathrm{Layer}_{L_{\rm mid}+1:L}(\mathbf{H}_{\rm mid}))$
- Multi-token DMTD: updates $\mathbf{H}_{\rm mid}$ only once per block, with late layers run per token (Luo et al., 13 Oct 2025).
ASR TDO:
- Prompt generation: $H^{T'} = \mathrm{Encoder}(X^T)$
- Masking and prompt extraction via CTC: non-blank collapsed sequence used as decoder prompt (Tsunoo et al., 2023).
YOCO TDO:
- Global cache: $\hat{K} = \mathrm{LN}(M)W_K;\; \hat{V} = \mathrm{LN}(M)W_V$
- Cross-decoder uses cross-attention against this cache for all tokens (Sun et al., 2024).
MT TDO:
- Stage 1 output: $\mathbf{H}_{\rm mid}$ 0
- Prediction: $\mathbf{H}_{\rm mid}$ 1, optimized via cross-entropy and optional contrastive objectives (Qu et al., 2024).
Channel Decoding:
- Soft stage LLR: $\mathbf{H}_{\rm mid}$ 2
- Complexity reductions via ADBP's D-message parameterization (Montorsi et al., 2017).

4. Empirical Performance and Resource Advantages

TDO architectures yield concrete empirical benefits:

Domain	Key TDO Performance Gains	Ref.
ASR	−1.9% abs. WER on LibriSpeech test-clean vs. CTC; 2× speed improvement	(Tsunoo et al., 2023)
View Synthesis	+1.5 dB PSNR over encoder-decoder; 5–20× render FPS	(Sun et al., 28 May 2026)
LLMs (YOCO)	×9.4 memory, ×30 prefill latency improvement at 1M context	(Sun et al., 2024)
MT (zero-shot)	+3.4 BLEU, +6.99 chrF++ vs. encoder-decoder/vanilla dec-only	(Qu et al., 2024)
Channel Dec.	5–10× lower computational cost vs. BICM at similar BER	(Montorsi et al., 2017)

Resource savings stem from reduced redundant computation (by decoupling context building from output decoding), constant-memory caching (YOCO, view synthesis), and sparse intermediate representations (ASR/CTC collapse). Contrastingly, some efficiency gains require careful tuning of depth split, block size, or capacity, and there are cases where monolithic architectures may still outperform at extreme scales (Luo et al., 13 Oct 2025, Tsunoo et al., 2023).

5. Practical Design Considerations and Limitations

Designers of TDO systems must address:

Split-point and stage depth: Optimal partitioning between stages can be data- or task-dependent, often determined by validation ablations (e.g., depth $\mathbf{H}_{\rm mid}$ 3 in MNMT, late layer count in DMTD) (Qu et al., 2024, Luo et al., 13 Oct 2025).
Stage adaptation: Mitigating representational mismatch between stages via adapters or post-projection FFNs is occasionally essential for stable learning (Qu et al., 2024).
Training regimes: TDO architectures can simplify unpaired or semi-supervised training (ASR, MT) by reusing the second stage as a standard LM trained on text-only data (Tsunoo et al., 2023, Qu et al., 2024).
Caching and memory: Specialized attention mechanisms and cache strategies may complicate deployment, but provide order-of-magnitude improvements with sufficient engineering (Sun et al., 2024, Luo et al., 13 Oct 2025).

Open issues include scalability to extreme model sizes, latency for streaming or low-latency scenarios, and flexible extension to extremely heterogeneous input/output domains (Tsunoo et al., 2023, Qu et al., 2024).

6. Research Directions and Generalizations

TDO architectures have rapidly expanded from their initial formulation in neural LLMs and sequence modeling to encompass:

High throughput text generation (via DMTD and cache partitioned decoders) (Luo et al., 13 Oct 2025, Sun et al., 2024).
Multimodal and cross-modal alignment (ASR, vision, MNMT) (Tsunoo et al., 2023, Sun et al., 28 May 2026, Qu et al., 2024).
Efficient communications systems (soft/hard channel decoders) (Montorsi et al., 2017).

Future directions include scaling up to LLM-scale multilingual tasks (Qu et al., 2024), refining block-wise or cyclical updates for extremely long context, extending staged parameter sharing and decomposition to more complex tasks, and further theoretical analysis of depthwise functional specialization (Luo et al., 13 Oct 2025, Sun et al., 28 May 2026). Streaming and online scenarios, as well as mixture-of-experts or adaptive-depth decoders, remain largely open for TDO-style architecture exploration.

Markdown Report Issue Upgrade to Chat

References (6)

Direct Multi-Token Decoding (2025)

You Only Cache Once: Decoder-Decoder Architectures for Language Models (2024)

Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation (2023)

DVSM: Decoder-only View Synthesis Model Done Right (2026)

Improving Language Transfer Capability of Decoder-only Architecture in Multilingual Neural Machine Translation (2024)

Low Complexity Two-Stage Soft/Hard Decoders (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Two-Stage Decoder-Only (TDO) Architecture.

Two-Stage Decoder-Only Architecture

1. Foundational Principles and Motivations

2. Architectural Formulations in Key Domains

2.1 Sequence Modeling and LLMs

2.2 Speech Recognition

2.3 Multilingual Neural Machine Translation

2.4 View Synthesis

2.5 Channel Decoding

3. Mathematical Formalizations

4. Empirical Performance and Resource Advantages

5. Practical Design Considerations and Limitations

6. Research Directions and Generalizations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Two-Stage Decoder-Only Architecture

1. Foundational Principles and Motivations

2. Architectural Formulations in Key Domains

2.1 Sequence Modeling and LLMs

2.2 Speech Recognition

2.3 Multilingual Neural Machine Translation

2.4 View Synthesis

2.5 Channel Decoding

3. Mathematical Formalizations

4. Empirical Performance and Resource Advantages

5. Practical Design Considerations and Limitations

6. Research Directions and Generalizations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research