Proxy Token Synthesis in Diffusion Transformers

Updated 31 December 2025

Proxy token synthesis is a computational strategy that partitions latent feature maps into local windows to compute representative tokens for global semantic modeling.
It employs global proxy self-attention followed by cross-attention injection and a texture complement module to reduce quadratic complexity while preserving fine details.
Empirical evaluations demonstrate significant GFLOPs reductions and competitive generative performance for high-resolution images and long-sequence video tasks.

Proxy token synthesis is a computational strategy for transformer-based diffusion models that efficiently captures global semantic information while drastically reducing the cost associated with self-attention operations. Deployed in the Proxy-Tokenized Diffusion Transformer (PT-DiT), this mechanism partitions latent feature maps into local windows and computes representative tokens as window averages, facilitating a compact yet expressive global modeling scheme (Wang et al., 2024).

1. Mathematical Formulation and Workflow

Proxy-token synthesis begins with a high-dimensional latent feature map $z_s \in \mathbb{R}^{f \times h \times w \times D}$ , with $f$ frames, spatial dimensions $h$ and $w$ , and feature dimension $D$ . The map is partitioned into non-overlapping windows of size $(p_t, p_h, p_w)$ in time, height, and width. Each $n$ th window $W_n$ encompasses $|W_n| = p_t p_h p_w$ tokens.

The $n$ th proxy token, $P_{a,n}$ , is defined as:

$P_{a,n} = \frac{1}{p_t p_h p_w} \sum_{i \in W_n} z_{s,i} \in \mathbb{R}^D,\quad n=1,\dots,M,$

where

$M = \frac{f}{p_t} \frac{h}{p_h} \frac{w}{p_w},\quad N = f h w,\quad \frac{M}{N} = \frac{1}{p_t p_h p_w}.$

The aggregated proxies $P_a \in \mathbb{R}^{M \times D}$ encode the average features for all windows.

Subsequent processing involves:

Global proxy self-attention: Multi-head self-attention is applied to $P_a$ , yielding $\mathrm{SA}(P_a) \in \mathbb{R}^{M \times D}$ .
Global injection via cross-attention: The global semantics are broadcast to all latent tokens by cross-attention,

$z_s \leftarrow \mathrm{CS}\big(z_s,\,\mathrm{SA}(P_a)\big),$

injecting aggregated context back to the detailed latent grid.

Texture Complement Module (TCM): Fine detail is restored using window self-attention (WSA) and shifted-window self-attention (SWSA) following the structure of the Swin Transformer.

The core PT-DiT block executes these components in the following pseudo-code:

Input: z_s ∈ R^{N×D}
1. Reshape z_s → R^{f×h×w×D}
2. P_a ← Averaging(z_s; window size = p_t,p_h,p_w)    # P_a ∈ R^{M×D}
3. G ← SA(P_a)                                         # global proxy self-attn
4. z_s ← CS(z_s, G)                                    # cross-attn injection
5. Reshape z_s → [N/(p_tp_hp_w), (p_tp_hp_w), D]
6. \hat z ← WSA(z_s) + z_s                             # window attention
7. z_w ← SWSA(\hat z) + \hat z                         # shifted-window attention
8. Reshape z_w → (N×D) → Text‐CrossAttn → MLP → output

2. Compression Ratios and Scaling Behavior

The proxy-token mechanism provides a parameterizable reduction in token count:

For $256 \times 256$ images: $(p_t, p_h, p_w) = (1,2,2)$ yields $M/N = 1/4$
For $512 \times 512$ : $(1,4,4) \Rightarrow 1/16$
For $1024 \times 1024$ : $(1,8,8) \Rightarrow 1/64$
For $2048 \times 2048$ : $(1,16,16) \Rightarrow 1/256$

Video processing typically fixes $p_f=4$ and proportionally adjusts spatial compression, allowing scale-adaptive resource allocation as resolution increases (Wang et al., 2024).

3. Computational Complexity Analysis

Proxy-token synthesis achieves significant cost savings by reducing the quadratic self-attention dependency:

Standard global self-attention: $O(2N^2D)$
PT-DiT with proxy-tokens and TCM:

$2\frac{N^2}{(p_tp_hp_w)^2}D + 2\frac{N^2}{p_tp_hp_w}D + 4\frac{N}{p_tp_hp_w}(p_tp_hp_w)^2D$

At image sizes $256/512/1024/2048$, proxy blocks contribute $34.3\%/9.7\%/4.7\%/2.3\%$ relative cost versus full self-attention.

Empirically, GFLOPs are reduced by approximately $49\%$ for the ImageNet scale (vs DiT) and $34\%$ vs PixArt- $\alpha$ ($2048$-resolution images). For video at $512 \times 512 \times 48$ , PT-DiT/H uses only $50\%$ of the GFLOPs of CogVideoX-2B or EasyAnimateV4, with lower memory overhead as frame count grows (Wang et al., 2024). This suggests the mechanism is especially effective for very high-resolution and long-sequence settings.

4. Ablation Studies and Empirical Evaluation

Ablation experiments reveal that proxy-token synthesis quantitatively and qualitatively preserves generative performance:

Without the Global Injection and Interaction Module (GIIM): FID 50k $= 23.71$
Without the Texture Complement Module (TCM): $69.07$
Dropping SWSA: $23.59$

Comparison of proxy extraction methods:

Window averaging (canonical): $19.30$
Top-Left token only: $20.84$
Random token: $21.00$

Global injection schemes:

Cross-attention (canonical): $19.30$
Linear projection: $20.24$
Spatial interpolation: $21.82$

Compression ratio at $256$:

$(1,2,2)$ : $19.30$
$(1,4,4)$ : $21.24$
$(1,8,8)$ : $20.43$

Zero-shot MS-COCO FID-30K @256: Qihoo-T2I $= 15.70$ , demonstrating competitive generative quality with state-of-the-art approaches at substantially lower computational expense. Text-to-video benchmarks (UCF-101 / MSR-VTT) confirm parity or superiority over previous DiT-based architectures (Wang et al., 2024).

5. Mechanistic Rationale and Semantic Function

The proxy-token synthesis paradigm is predicated on the empirical observation that attention maps in local spatial windows are highly correlated, and global modeling need not consume all tokens. Averaged window representations minimize redundancy, enabling self-attention among a reduced set of proxies to efficiently encode global semantic context.

Cross-attention subsequently propagates this context to the full latent set, retaining expressivity without incurring quadratic costs. The window and shifted-window refinements (Texture Complement Module) enforce local detail and avoid discretization artifacts, such as grid patterns (Wang et al., 2024).

A plausible implication is that proxy-token synthesis generalizes well to other domains where spatial or temporal redundancy is endemic, provided local window structure is preserved.

Quantitative and qualitative analyses indicate that PT-DiT, leveraging proxy-token synthesis, matches or outperforms conventional global-attention-based DiTs in both image and video generation, while drastically reducing hardware requirements and training/inference complexity.

In direct comparisons:

DiT: Baseline full global self-attention; quadratic cost.
PixArt- $\alpha$ : Competing transformer utilizing global attention; higher GFLOPs.
PT-DiT: Proxy-tokenized global injection; $34\%-49\%$ GFLOPs reduction for images, $>50\%$ in video.

These results suggest the proxy-token synthesis strategy is robust and adaptable across a range of generative tasks, including text-to-image (T2I), text-to-video (T2V), and text-to-multi-view (T2MV) settings (Wang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Proxy Token Synthesis.

Proxy Token Synthesis in Diffusion Transformers

1. Mathematical Formulation and Workflow

2. Compression Ratios and Scaling Behavior

3. Computational Complexity Analysis

4. Ablation Studies and Empirical Evaluation

5. Mechanistic Rationale and Semantic Function

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Proxy Token Synthesis in Diffusion Transformers

1. Mathematical Formulation and Workflow

2. Compression Ratios and Scaling Behavior

3. Computational Complexity Analysis

4. Ablation Studies and Empirical Evaluation

5. Mechanistic Rationale and Semantic Function

6. Comparison with Baseline and Related Architectures

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research