Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 175 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 218 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

TiDAR Neural Network Architecture

Updated 13 November 2025
  • TiDAR is a hybrid neural network architecture that integrates diffusion drafting and autoregressive token verification in a single inference cycle.
  • It employs structured attention masks to enable blockwise parallel drafting and sequential validation, optimizing GPU utilization and KV-cache support.
  • TiDAR achieves a 4.71×–5.91× speedup over traditional AR models while maintaining comparable text quality, paving the way for efficient large-scale serving.

TiDAR (Think in Diffusion, Talk in Autoregression) is a sequence-level hybrid neural network architecture that achieves high-throughput generation with quality competitive to autoregressive (AR) LLMs by combining parallel “diffusion drafting” and sequential “autoregressive sampling” within a single forward pass and model instance. TiDAR departs from existing decoding-acceleration methods by fusing blockwise diffusion-style drafting and AR token verification via structured attention masks, enabling efficient utilization of GPU resources and KV cache support, and making it serving-friendly without auxiliary drafter modules or multi-stage inference (Liu et al., 12 Nov 2025).

1. Two-Phase Sequence Generation

TiDAR employs a two-phase generation scheme—diffusion drafting and autoregressive sampling—within a unified inference cycle. Given a causal prefix x<tx_{<t}, kk mask tokens are appended and predicted in parallel using a block-bidirectional attention mask, producing a "draft" of the next kk tokens: {x^t+1,,x^t+k}i=1kpθ(xt+i[x<t;[mask,,mask]k]).\{ \hat x_{t+1}, \dots, \hat x_{t+k} \} \sim \prod_{i=1}^k p_\theta \left( x_{t+i} \mid [x_{<t}; \underbrace{[\text{mask}, \dots, \text{mask}]}_{k}] \right). Concurrently, the AR logits for these positions are computed: i{1,,k}:logitst+iAR=Fθ([x<t;x^t+1:i1]).\forall\,i \in \{1, \dots, k\}: \quad \text{logits}^{\text{AR}}_{t+i} = F_\theta([x_{<t}; \hat x_{t+1:i-1}]). Final tokens are sampled by rejection from the joint chain-factorized AR distribution, appending accepted tokens to the prefix and pre-drafting the next block in parallel. This approach avoids the sequential bottleneck of AR models while maintaining sample quality.

2. Structured Attention Mask Construction

The TiDAR model utilizes a bespoke attention mask M{0,1}(t+k)×(t+k)M \in \{0,1\}^{(t+k)\times(t+k)} supporting:

  • Causal AR attention: Lower-triangular for prefix tokens.
  • Block-bidirectional attention: Full within the draft/mask block.
  • Cross-segment attention: Masked (draft) positions attend to the full prefix.

Formally, for a sequence [x1,,xt,  d1,,dk][\,x_1,\dots,x_{t},\; d_{1},\dots,d_{k}\,] (drafts/masks as did_i), mask entries are: Mi,j={1,jt,ij(prefix causal) 1,j>t,i>t(block bidirectional) 1,jt<i(drafts see prefix) 0,otherwise.M_{i,j} = \begin{cases} 1, & j\le t,\, i\ge j \quad\text{(prefix causal)} \ 1, & j>t,\, i>t \quad\text{(block bidirectional)} \ 1, & j\le t< i \quad\text{(drafts see prefix)} \ 0, & \text{otherwise.} \end{cases}

This enables simultaneous drafting (blockwise, parallel) and verification (causal, sequential) in a single self-attention computation, with accepted drafts and remaining mask tokens jointly encoded as an active segment. At inference, segment reordering and pre-initialized masks enable efficient realization of this masking with minimal overhead.

3. Diffusion Drafting: Denoising Process and Objective

TiDAR implements a one-step denoising process for parallel token drafting. During training, all suffix tokens in the current window are replaced with mask\text{mask}, with the loss summing AR cross-entropy over the prefix and diffusion cross-entropy over the masked suffix: L=11+α[αi=1S1logpθ(xi+1xi)+i=1S1logpθ(xi[x<i;mask])],(α=1)\mathcal{L} = \frac{1}{1+\alpha} \left[ \alpha\sum_{i=1}^{S-1} -\log p_\theta(x_{i+1}|x_{\le i}) + \sum_{i=1}^{S-1} -\log p_\theta(x_i|[x_{<i};\text{mask}]) \right], \qquad (\alpha=1) Inference proceeds by maximizing (or sampling) marginal probability for each of the kk drafted positions, i.e.,

{x^t+1,,x^t+k}=argmaxxt+1:t+ki=1kpθ(xt+i[x<t;maskk]).\{\hat x_{t+1},\dots,\hat x_{t+k}\} = \arg\max_{x_{t+1:t+k}} \prod_{i=1}^k p_\theta(x_{t+i}|[x_{<t};\text{mask}^k]).

No stochastic masking schedule is used; the “full-mask” strategy maintains implementation simplicity and latency.

4. Autoregressive Token Verification and Cache Management

Post-drafting, TiDAR employs an exact key/value (KV) cache for all causally attended tokens—comprising the prefix and any drafts accepted in prior steps. Each drafted token is verified using rejection sampling: a proposal x^t+i\hat x_{t+i} from the diffusion marginal is checked against the AR logit-based argmax. If accepted,

x^t+i=argmaxx{βlogitst+iAR(x)+(1β)logitst+iDiff(x)},\hat x_{t+i} = \arg\max_x \left\{ \beta\,\text{logits}^{\rm AR}_{t+i}(x) + (1 - \beta)\,\text{logits}^{\rm Diff}_{t+i}(x) \right\},

it is appended to the prefix; otherwise, fallback to the AR top prediction. The associated KV pairs for accepted drafts are retained; rejected ones are dropped. The effective sampling distribution interpolates AR and diffusion posteriors: π(xt+ix<t+i1)exp[βlogpθAR(xt+ix<t+i1)+(1β)logpθDiff(xt+ix~)].\pi(x_{t+i}|x_{<t+i-1}) \propto \exp\left[ \beta\,\log p_\theta^{\rm AR}(x_{t+i}|x_{<t+i-1}) + (1-\beta)\,\log p_\theta^{\rm Diff}(x_{t+i}|\tilde x) \right].

5. Unified Forward Pass: Decoding Loop and Efficiency

TiDAR’s principal innovation is a single-pass decoding loop, with no additional drafter model or stages. The transformer processes simultaneously causal (AR) and block-bidirectional (Diff) logits:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
prefix = prompt
KV_cache = encode(prefix causally)
drafts = [mask] * k  # initial draft block
while not finished:
    logits_AR, logits_Diff = Transformer(prefix  drafts; mask=M)
    accepted, new_drafts = [], []
    for i in 1..k:
        proposal = sample_from(Diff_logits[i])
        if proposal == argmax(beta*AR_logits[i]+(1-beta)*Diff_logits[i]):
            accepted.append(proposal)
            KV_cache.append_kv(proposal)
        else:
            accepted.append(argmax(AR_logits[i]))
            KV_cache.append_kv(accepted[-1])
        new_drafts.append(mask)
    prefix += accepted
    drafts = one_step_pre_draft(new_drafts)
No retracing or recomputation is required, and all verification is done with outputs from the existing forward computation. This structure leverages available “free token slots” on GPUs, increasing utilization.

6. Throughput, Quality, and Comparative Performance

On the NVIDIA H100 (batch=1), TiDAR-1.5B (Qwen2.5 backbone) achieves 7.45 T/NFE and a 4.71× tokens/sec speedup versus the autoregressive baseline (≈1.58 tokens/forward). TiDAR-8B (Qwen3) attains 8.25 T/NFE and a 5.91× speedup (≈1.40 tokens/forward). Comparative results are summarized as follows:

Model & Method Speedup vs. AR Quality
AR-base Reference
Speculative decoding 2–3× Lower than AR
Block Diffusion 3–4× Lower than AR
Dream-7B, Llada-8B 2–3× Sub-AR
TiDAR-1.5B, 8B 4.71–5.91× ≈ AR

The quality–throughput Pareto plot positions TiDAR near the upper right corner, indicating simultaneous efficiency and quality. Notably, TiDAR is the first architecture documented to close the quality gap with AR models while offering 4.71×–5.91× tokens/sec improvement.

7. Hyperparameters, Implementation, and Limitations

Key hyperparameters include block size k{4,8,16}k \in \{4,8,16\} (higher kk implies greater parallelism, minor quality drop), a one-step diffusion denoising strategy, and loss balancer α=1\alpha=1. The large attention mask is pre-initialized of shape (max_seq+k(1+k))×(max_seq+k(1+k))(\text{max\_seq}+k(1+k))\times(\text{max\_seq}+k(1+k)), sliced on demand. Exact KV-cache is maintained only for AR tokens, and no retracing is necessary. The architecture uses bfloat16 precision, 4096 max length, Adam optimizer with cosine learning rate, and continual pretraining on 50B–150B tokens.

Limitations include increased sequence length during training due to mask append, non-adaptive block sizes, and focused support for up to 4K-context tokens. Prospective research directions include optimizing training routines, extending long-context support, custom attention scheduling, and adaptive or multi-step drafting.

8. Significance and Outlook

TiDAR establishes that sequence-level hybrid design enables both parallel generation efficiency and AR-level quality within a standalone model instance, without auxiliary drafters or extra inference stages. By exploiting structured attention, flexible segment reordering, and rejection-based AR verification, TiDAR leverages underlying hardware for maximal performance. This suggests promising directions for future sequence models seeking to reconcile high-throughput decoding and sample fidelity, with practical advantages for large-scale, low-latency serving (Liu et al., 12 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to TiDAR Architecture.