Papers
Topics
Authors
Recent
Search
2000 character limit reached

Proxy Token Synthesis in Diffusion Transformers

Updated 31 December 2025
  • Proxy token synthesis is a computational strategy that partitions latent feature maps into local windows to compute representative tokens for global semantic modeling.
  • It employs global proxy self-attention followed by cross-attention injection and a texture complement module to reduce quadratic complexity while preserving fine details.
  • Empirical evaluations demonstrate significant GFLOPs reductions and competitive generative performance for high-resolution images and long-sequence video tasks.

Proxy token synthesis is a computational strategy for transformer-based diffusion models that efficiently captures global semantic information while drastically reducing the cost associated with self-attention operations. Deployed in the Proxy-Tokenized Diffusion Transformer (PT-DiT), this mechanism partitions latent feature maps into local windows and computes representative tokens as window averages, facilitating a compact yet expressive global modeling scheme (Wang et al., 2024).

1. Mathematical Formulation and Workflow

Proxy-token synthesis begins with a high-dimensional latent feature map zsRf×h×w×Dz_s \in \mathbb{R}^{f \times h \times w \times D}, with ff frames, spatial dimensions hh and ww, and feature dimension DD. The map is partitioned into non-overlapping windows of size (pt,ph,pw)(p_t, p_h, p_w) in time, height, and width. Each nnth window WnW_n encompasses Wn=ptphpw|W_n| = p_t p_h p_w tokens.

The nnth proxy token, Pa,nP_{a,n}, is defined as:

Pa,n=1ptphpwiWnzs,iRD,n=1,,M,P_{a,n} = \frac{1}{p_t p_h p_w} \sum_{i \in W_n} z_{s,i} \in \mathbb{R}^D,\quad n=1,\dots,M,

where

M=fpthphwpw,N=fhw,MN=1ptphpw.M = \frac{f}{p_t} \frac{h}{p_h} \frac{w}{p_w},\quad N = f h w,\quad \frac{M}{N} = \frac{1}{p_t p_h p_w}.

The aggregated proxies PaRM×DP_a \in \mathbb{R}^{M \times D} encode the average features for all windows.

Subsequent processing involves:

  • Global proxy self-attention: Multi-head self-attention is applied to PaP_a, yielding SA(Pa)RM×D\mathrm{SA}(P_a) \in \mathbb{R}^{M \times D}.
  • Global injection via cross-attention: The global semantics are broadcast to all latent tokens by cross-attention,

zsCS(zs,SA(Pa)),z_s \leftarrow \mathrm{CS}\big(z_s,\,\mathrm{SA}(P_a)\big),

injecting aggregated context back to the detailed latent grid.

  • Texture Complement Module (TCM): Fine detail is restored using window self-attention (WSA) and shifted-window self-attention (SWSA) following the structure of the Swin Transformer.

The core PT-DiT block executes these components in the following pseudo-code:

1
2
3
4
5
6
7
8
9
Input: z_s  R^{N×D}
1. Reshape z_s  R^{f×h×w×D}
2. P_a  Averaging(z_s; window size = p_t,p_h,p_w)    # P_a ∈ R^{M×D}
3. G  SA(P_a)                                         # global proxy self-attn
4. z_s  CS(z_s, G)                                    # cross-attn injection
5. Reshape z_s  [N/(p_tp_hp_w), (p_tp_hp_w), D]
6. \hat z  WSA(z_s) + z_s                             # window attention
7. z_w  SWSA(\hat z) + \hat z                         # shifted-window attention
8. Reshape z_w  (N×D)  TextCrossAttn  MLP  output

2. Compression Ratios and Scaling Behavior

The proxy-token mechanism provides a parameterizable reduction in token count:

  • For 256×256256 \times 256 images: (pt,ph,pw)=(1,2,2)(p_t, p_h, p_w) = (1,2,2) yields M/N=1/4M/N = 1/4
  • For 512×512512 \times 512: (1,4,4)1/16(1,4,4) \Rightarrow 1/16
  • For 1024×10241024 \times 1024: (1,8,8)1/64(1,8,8) \Rightarrow 1/64
  • For 2048×20482048 \times 2048: (1,16,16)1/256(1,16,16) \Rightarrow 1/256

Video processing typically fixes pf=4p_f=4 and proportionally adjusts spatial compression, allowing scale-adaptive resource allocation as resolution increases (Wang et al., 2024).

3. Computational Complexity Analysis

Proxy-token synthesis achieves significant cost savings by reducing the quadratic self-attention dependency:

  • Standard global self-attention: O(2N2D)O(2N^2D)
  • PT-DiT with proxy-tokens and TCM:

2N2(ptphpw)2D+2N2ptphpwD+4Nptphpw(ptphpw)2D2\frac{N^2}{(p_tp_hp_w)^2}D + 2\frac{N^2}{p_tp_hp_w}D + 4\frac{N}{p_tp_hp_w}(p_tp_hp_w)^2D

  • At image sizes $256/512/1024/2048$, proxy blocks contribute 34.3%/9.7%/4.7%/2.3%34.3\%/9.7\%/4.7\%/2.3\% relative cost versus full self-attention.

Empirically, GFLOPs are reduced by approximately 49%49\% for the ImageNet scale (vs DiT) and 34%34\% vs PixArt-α\alpha ($2048$-resolution images). For video at 512×512×48512 \times 512 \times 48, PT-DiT/H uses only 50%50\% of the GFLOPs of CogVideoX-2B or EasyAnimateV4, with lower memory overhead as frame count grows (Wang et al., 2024). This suggests the mechanism is especially effective for very high-resolution and long-sequence settings.

4. Ablation Studies and Empirical Evaluation

Ablation experiments reveal that proxy-token synthesis quantitatively and qualitatively preserves generative performance:

  • Without the Global Injection and Interaction Module (GIIM): FID 50k =23.71= 23.71
  • Without the Texture Complement Module (TCM): $69.07$
  • Dropping SWSA: $23.59$

Comparison of proxy extraction methods:

  • Window averaging (canonical): $19.30$
  • Top-Left token only: $20.84$
  • Random token: $21.00$

Global injection schemes:

  • Cross-attention (canonical): $19.30$
  • Linear projection: $20.24$
  • Spatial interpolation: $21.82$

Compression ratio at $256$:

  • (1,2,2)(1,2,2): $19.30$
  • (1,4,4)(1,4,4): $21.24$
  • (1,8,8)(1,8,8): $20.43$

Zero-shot MS-COCO FID-30K @256: Qihoo-T2I =15.70= 15.70, demonstrating competitive generative quality with state-of-the-art approaches at substantially lower computational expense. Text-to-video benchmarks (UCF-101 / MSR-VTT) confirm parity or superiority over previous DiT-based architectures (Wang et al., 2024).

5. Mechanistic Rationale and Semantic Function

The proxy-token synthesis paradigm is predicated on the empirical observation that attention maps in local spatial windows are highly correlated, and global modeling need not consume all tokens. Averaged window representations minimize redundancy, enabling self-attention among a reduced set of proxies to efficiently encode global semantic context.

Cross-attention subsequently propagates this context to the full latent set, retaining expressivity without incurring quadratic costs. The window and shifted-window refinements (Texture Complement Module) enforce local detail and avoid discretization artifacts, such as grid patterns (Wang et al., 2024).

A plausible implication is that proxy-token synthesis generalizes well to other domains where spatial or temporal redundancy is endemic, provided local window structure is preserved.

Quantitative and qualitative analyses indicate that PT-DiT, leveraging proxy-token synthesis, matches or outperforms conventional global-attention-based DiTs in both image and video generation, while drastically reducing hardware requirements and training/inference complexity.

In direct comparisons:

  • DiT: Baseline full global self-attention; quadratic cost.
  • PixArt-α\alpha: Competing transformer utilizing global attention; higher GFLOPs.
  • PT-DiT: Proxy-tokenized global injection; 34%49%34\%-49\% GFLOPs reduction for images, >50%>50\% in video.

These results suggest the proxy-token synthesis strategy is robust and adaptable across a range of generative tasks, including text-to-image (T2I), text-to-video (T2V), and text-to-multi-view (T2MV) settings (Wang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Proxy Token Synthesis.