Cross-Resolution Phase-Aligned Attention
- The paper introduces CRPA, a modification that reindexes key positions to the query grid, restoring phase coherence and eliminating aliasing in mixed-resolution settings.
- CRPA adjusts rotary positional embeddings to use a consistent query stride, enabling high-fidelity image and video generation without retraining.
- Empirical results demonstrate superior performance on metrics like FID, CLIP-IQA, and VBench, with minimal computational overhead in diffusion transformers.
Cross-Resolution Phase-Aligned Attention (CRPA) is a training-free, drop-in modification for rotary positional embeddings (RoPE) in attention mechanisms, specifically designed to stabilize and enhance mixed-resolution denoising in Diffusion Transformers (DiTs). CRPA resolves structural attention collapse caused by phase misalignment when processing tokens from heterogeneous spatial grids, restoring phase coherence and enabling high-fidelity image and video generation without the need for retraining or architectural changes (Wu et al., 24 Nov 2025).
1. Structural Failure of RoPE under Mixed Resolution
The rotary positional embedding mechanism parameterizes position by applying block-diagonal rotations to token embeddings. For model dimension , query and key vectors are decomposed into rotation pairs; each uses a frequency , :
Embedding a token at position gives , . The attention dot product thus encodes only the relative offset:
In mixed-resolution generation—where low-resolution (LR) and high-resolution (HR) regions have different physical strides —naive linear interpolation remaps index coordinates piecewise-affinely:
for each region . This implies that the same physical distance induces region-dependent phase increments:
Phase increments lose correspondence, producing "cross-rate aliasing": the sharp, multi-frequency phase kernels learned by attention heads become destabilized. Small stride discrepancies—especially at fine frequencies—produce severe blur, periodic artifacts, or attention collapse. This is particularly brittle in pretrained DiTs with highly selective heads (Wu et al., 24 Nov 2025).
2. The CRPA Mechanism: Phase Realignment
CRPA ensures that within each attention operation, all RoPE phase computations occur at a single, consistent spatial stride—the "query stride." For a query grid with stride and a key/value grid with stride , define the scale ratio:
Key and value positions are reindexed to the query grid before RoPE:
RoPE is then applied:
The dot-product attention becomes:
Thus, every relative offset is expressed in the query’s native stride, preserving the learned phase structure and enforcing "one attention, one scale." This simple reindexing realigns phases across heterogeneous resolutions, eliminating aliasing at the source (Wu et al., 24 Nov 2025).
3. Implementation and Practical Considerations
CRPA can be implemented in existing attention modules using minimal code changes. The key steps are:
- Project queries, keys, and values to per-head representations.
- Compute the scale ratio .
- Align key/value indices to the query stride by multiplying by .
- Apply RoPE based on the reindexed positions.
- Perform standard scaled dot-product attention.
Pseudocode summary:
1 2 3 4 5 6 7 8 |
def CRPA_attention(Q, K, V, p_q, p_k, s_q, s_k): alpha = s_q / s_k p_k_aligned = alpha * p_k Qh_rope = apply_rope(Q @ Wq, p_q) Kh_rope = apply_rope(K @ Wk, p_k_aligned) attn = softmax(Qh_rope @ Kh_rope.transpose(-1, -2) / sqrt(d_h), dim=-1) output = attn @ (V @ Wv) return output |
Boundary effects between regions with different strides can be further mitigated with the Boundary Expand-and-Replace technique, which pads overlap regions. For example, with overlap , empirical results show substantial improvements in texture and color continuity at LR-HR seams (Wu et al., 24 Nov 2025).
4. Compatibility and Limitations
CRPA is fully compatible with pretrained Diffusion Transformers and does not require downstream adaptation. By restoring coherent RoPE phase relationships, it reestablishes the exact phase kernels expected by all heads and layers across the architecture. The only assumption is that the model was originally trained with RoPE at a fixed stride; CRPA compensates for mixed-resolution at inference time.
In ultra-high stride disparities (e.g., 4K vs. 1K grids), minor phase inconsistencies can arise, but are substantially reduced compared to linear interpolation. Boundary Expand-and-Replace effectively mitigates residual artifacts.
A plausible implication is that models trained specifically with CRPA applied at training may show additional robustness to extreme stride mismatches, though the data addresses only inference-time application.
5. Empirical Results and Quantitative Evaluation
CRPA has been benchmarked on mixed-resolution text-to-image and text-to-video generation tasks using state-of-the-art DiT backbones:
| Task | Metric | PI‑LR | PI‑HR | NTK | YaRN | CRPA |
|---|---|---|---|---|---|---|
| DOVER Overall (Video) | 75.34 (CRPA) | 63.39 | 35.04 | 44.52 | 66.38 | 75.34 |
| VBench Total (Video) | 0.770 (CRPA) | 0.717 | 0.661 | 0.680 | 0.725 | 0.770 |
| FID (Image) | 32.04 (CRPA) | 41.45 | 49.84 | 42.43 | 33.83 | 32.04 |
| CLIP-IQA (Image) | 0.563 (CRPA) | 0.425 | 0.295 | 0.341 | 0.434 | 0.563 |
| MUSIQ (Image) | 68.88 (CRPA) | 55.79 | 65.68 | 65.80 | 65.64 | 68.88 |
CRPA achieves the highest DOVER Overall and VBench Total scores in video, and outperforms prior state-of-the-art in FID, CLIP-IQA, and MUSIQ on image tasks. Latency remains comparable to other methods (e.g., 43s for video, 2.8s for images), with CRPA alone contributing only ≈0.137s (0.3%) additional overhead. In three-stage coarse→mixed→fine acceleration pipelines, CRPA yields both higher fidelity and efficiency versus methods such as RALU, at matched runtime (Wu et al., 24 Nov 2025).
Qualitatively, CRPA preserves or sharpens facial and edge details in HR regions, avoids collapse/blur seen with linear RoPE, and maintains continuity across LR–HR transitions.
6. Ablations and Phase Coherence Restoration
Ablation studies demonstrate the uniform stabilization effect of CRPA:
- With boundary overlap, DOVER rises to 75.34 from 68.43 and VBench to 0.770 from 0.747.
- After CRPA application, all heads’ phase kernel curves revert to their original, sharply selective learned forms, eliminating the erratic periodicity caused by misaligned interpolation.
- This restoration is head- and layer-uniform, and requires no weight updates.
Full pipeline evaluation (CRPA + tiny VAE + saliency + up/downsampler) keeps computational overhead below 3% of total runtime (Wu et al., 24 Nov 2025).
7. Summary and Significance
Cross-Resolution Phase-Aligned Attention provides an analytically principled, computationally efficient solution to the aliasing pathologies that arise with RoPE under mixed-resolution tokenization. By enforcing a single, physically consistent stride for all phase computations within an attention head, it maintains the integrity of pretrained phase kernels, advances the state of the art in high-fidelity, mixed-resolution generation, and offers seamless integration into existing Diffusion Transformer workflows without retraining. This method establishes a new standard for generative modeling across variable-resolution grids (Wu et al., 24 Nov 2025).