Cross-Resolution Phase-Aligned Attention

Updated 9 February 2026

The paper introduces CRPA, a modification that reindexes key positions to the query grid, restoring phase coherence and eliminating aliasing in mixed-resolution settings.
CRPA adjusts rotary positional embeddings to use a consistent query stride, enabling high-fidelity image and video generation without retraining.
Empirical results demonstrate superior performance on metrics like FID, CLIP-IQA, and VBench, with minimal computational overhead in diffusion transformers.

Cross-Resolution Phase-Aligned Attention (CRPA) is a training-free, drop-in modification for rotary positional embeddings (RoPE) in attention mechanisms, specifically designed to stabilize and enhance mixed-resolution denoising in Diffusion Transformers (DiTs). CRPA resolves structural attention collapse caused by phase misalignment when processing tokens from heterogeneous spatial grids, restoring phase coherence and enabling high-fidelity image and video generation without the need for retraining or architectural changes (Wu et al., 24 Nov 2025).

1. Structural Failure of RoPE under Mixed Resolution

The rotary positional embedding mechanism parameterizes position by applying block-diagonal rotations to token embeddings. For model dimension $d$ , query and key vectors are decomposed into $d/2$ rotation pairs; each uses a frequency $\omega_i = 10000^{-2i/d}$ , $i = 0, \ldots, d/2-1$ :

$R(\theta) = \begin{pmatrix} \cos\theta & -\sin\theta \ \sin\theta & \cos\theta \end{pmatrix}, \quad \mathcal{R}(p) = \operatorname{diag}(R(\omega_0 p), R(\omega_1 p), \dots)$

Embedding a token at position $p$ gives $q \mapsto \widetilde{q} = \mathcal{R}(p)q$ , $k \mapsto \widetilde{k} = \mathcal{R}(p)k$ . The attention dot product thus encodes only the relative offset:

$\widetilde{q}^{\top}\widetilde{k} = q^{\top}\mathcal{R}(p_k - p_q)k$

In mixed-resolution generation—where low-resolution (LR) and high-resolution (HR) regions have different physical strides $s_{\mathrm{LR}}, s_{\mathrm{HR}}$ —naive linear interpolation remaps index coordinates piecewise-affinely:

$\phi(p) = a_r + s_r(p-b_r),\quad p\in r$

for each region $r$ . This implies that the same physical distance $\Delta_{\rm phys}$ induces region-dependent phase increments:

$\Delta_{\mathrm{phys}} \neq \phi(p+\Delta_{\mathrm{phys}}) - \phi(p) = s_r \Delta_{\mathrm{phys}}$

Phase increments $\omega_i(\phi(p_k)-\phi(p_q))$ lose correspondence, producing "cross-rate aliasing": the sharp, multi-frequency phase kernels learned by attention heads become destabilized. Small stride discrepancies—especially at fine frequencies—produce severe blur, periodic artifacts, or attention collapse. This is particularly brittle in pretrained DiTs with highly selective heads (Wu et al., 24 Nov 2025).

2. The CRPA Mechanism: Phase Realignment

CRPA ensures that within each attention operation, all RoPE phase computations occur at a single, consistent spatial stride—the "query stride." For a query grid with stride $S_q$ and a key/value grid with stride $S_k$ , define the scale ratio:

$\alpha_{k\rightarrow q} = \frac{S_q}{S_k}$

Key and value positions are reindexed to the query grid before RoPE:

$p_k^{(q)} = \alpha_{k\rightarrow q} p_k,\quad p_q^{(q)} = p_q$

RoPE is then applied:

$\widetilde{q} = \mathcal{R}(p_q^{(q)})q,\quad \widetilde{k} = \mathcal{R}(p_k^{(q)})k$

The dot-product attention becomes:

$\widetilde{q}^{\top}\widetilde{k} = q^{\top}\mathcal{R}(\alpha_{k \rightarrow q} p_k - p_q)k$

Thus, every relative offset is expressed in the query’s native stride, preserving the learned phase structure and enforcing "one attention, one scale." This simple reindexing realigns phases across heterogeneous resolutions, eliminating aliasing at the source (Wu et al., 24 Nov 2025).

3. Implementation and Practical Considerations

CRPA can be implemented in existing attention modules using minimal code changes. The key steps are:

Project queries, keys, and values to per-head representations.
Compute the scale ratio $\alpha = s_q/s_k$ .
Align key/value indices $p_k$ to the query stride by multiplying by $\alpha$ .
Apply RoPE based on the reindexed positions.
Perform standard scaled dot-product attention.

Pseudocode summary:

def CRPA_attention(Q, K, V, p_q, p_k, s_q, s_k):
    alpha = s_q / s_k
    p_k_aligned = alpha * p_k
    Qh_rope = apply_rope(Q @ Wq, p_q)
    Kh_rope = apply_rope(K @ Wk, p_k_aligned)
    attn = softmax(Qh_rope @ Kh_rope.transpose(-1, -2) / sqrt(d_h), dim=-1)
    output = attn @ (V @ Wv)
    return output

The RoPE application to aligned indices can be fused into the Q/K projection kernels for negligible overhead. CRPA requires no changes to learned weights, no retraining, and is applied at inference.

Boundary effects between regions with different strides can be further mitigated with the Boundary Expand-and-Replace technique, which pads overlap regions. For example, with overlap $n_{\mathrm{pad}}=2$ , empirical results show substantial improvements in texture and color continuity at LR-HR seams (Wu et al., 24 Nov 2025).

4. Compatibility and Limitations

CRPA is fully compatible with pretrained Diffusion Transformers and does not require downstream adaptation. By restoring coherent RoPE phase relationships, it reestablishes the exact phase kernels expected by all heads and layers across the architecture. The only assumption is that the model was originally trained with RoPE at a fixed stride; CRPA compensates for mixed-resolution at inference time.

In ultra-high stride disparities (e.g., 4K vs. 1K grids), minor phase inconsistencies can arise, but are substantially reduced compared to linear interpolation. Boundary Expand-and-Replace effectively mitigates residual artifacts.

A plausible implication is that models trained specifically with CRPA applied at training may show additional robustness to extreme stride mismatches, though the data addresses only inference-time application.

5. Empirical Results and Quantitative Evaluation

CRPA has been benchmarked on mixed-resolution text-to-image and text-to-video generation tasks using state-of-the-art DiT backbones:

Task	Metric	PI‑LR	PI‑HR	NTK	YaRN	CRPA
DOVER Overall (Video)	75.34 (CRPA)	63.39	35.04	44.52	66.38	75.34
VBench Total (Video)	0.770 (CRPA)	0.717	0.661	0.680	0.725	0.770
FID (Image)	32.04 (CRPA)	41.45	49.84	42.43	33.83	32.04
CLIP-IQA (Image)	0.563 (CRPA)	0.425	0.295	0.341	0.434	0.563
MUSIQ (Image)	68.88 (CRPA)	55.79	65.68	65.80	65.64	68.88

CRPA achieves the highest DOVER Overall and VBench Total scores in video, and outperforms prior state-of-the-art in FID, CLIP-IQA, and MUSIQ on image tasks. Latency remains comparable to other methods (e.g., 43s for video, 2.8s for images), with CRPA alone contributing only ≈0.137s (0.3%) additional overhead. In three-stage coarse→mixed→fine acceleration pipelines, CRPA yields both higher fidelity and efficiency versus methods such as RALU, at matched runtime (Wu et al., 24 Nov 2025).

Qualitatively, CRPA preserves or sharpens facial and edge details in HR regions, avoids collapse/blur seen with linear RoPE, and maintains continuity across LR–HR transitions.

6. Ablations and Phase Coherence Restoration

Ablation studies demonstrate the uniform stabilization effect of CRPA:

With $n_{\mathrm{pad}}=2$ boundary overlap, DOVER rises to 75.34 from 68.43 and VBench to 0.770 from 0.747.
After CRPA application, all heads’ phase kernel $\kappa(\Delta)$ curves revert to their original, sharply selective learned forms, eliminating the erratic periodicity caused by misaligned interpolation.
This restoration is head- and layer-uniform, and requires no weight updates.

Full pipeline evaluation (CRPA + tiny VAE + saliency + up/downsampler) keeps computational overhead below 3% of total runtime (Wu et al., 24 Nov 2025).

7. Summary and Significance

Cross-Resolution Phase-Aligned Attention provides an analytically principled, computationally efficient solution to the aliasing pathologies that arise with RoPE under mixed-resolution tokenization. By enforcing a single, physically consistent stride for all phase computations within an attention head, it maintains the integrity of pretrained phase kernels, advances the state of the art in high-fidelity, mixed-resolution generation, and offers seamless integration into existing Diffusion Transformer workflows without retraining. This method establishes a new standard for generative modeling across variable-resolution grids (Wu et al., 24 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross-Resolution Phase-Aligned Attention (CRPA).