Papers
Topics
Authors
Recent
Search
2000 character limit reached

Positional-aware Perceiver Resampler

Updated 17 November 2025
  • PPR is a specialized module that fuses facial and body embeddings to encode spatial and identity cues for multi-character image generation.
  • It employs dual Perceiver resamplers and a fusion MLP to distill high-dimensional features into compact tokens for targeted conditioning in diffusion pipelines.
  • The design enforces feature separation and uses attention-guided loss to maintain coherent character and background representations across image sequences.

The Positional-aware Perceiver Resampler (PPR) is a specialized module designed to extract and fuse both facial identity and full-body cues from reference images, enabling fine-grained, position-aware conditioning in diffusion-based generative models. PPR was introduced in the StoryMaker framework to address the challenge of creating multi-character, visually consistent image sequences, ensuring that individual character features—including face, clothing, hairstyle, and body—remain coherent across multiple generated images and narration steps (Zhou et al., 2024).

1. Functional Role within StoryMaker

PPR serves as the primary mechanism for integrating two types of reference-image features:

  • Facial identity embeddings (FfaceF_{\rm face}), sourced from a frozen ArcFace network, encapsulate the character’s facial traits.
  • Cropped character image embeddings (FcharF_{\rm char}), extracted using a frozen CLIP-ViT model, represent body appearance including attire and pose.

The module takes these two sets of high-dimensional features per character and distills them into a compact sequence of token embeddings cic_i. Each block of tokens within cic_i encodes one character’s holistic appearance; an additional block encodes background information. This token sequence is subsequently injected into the denoising U-Net of a Stable Diffusion–style pipeline via decoupled cross-attention, providing fine-grained, spatially targeted conditioning for multi-entity image synthesis.

2. Architectural Overview

PPR comprises the following main components:

  • Two Perceiver-style Resamplers (R1,R2)(R_1, R_2): Each resampler is independently applied to either FfaceF_{\rm face} or FcharF_{\rm char}, producing LL tokens of dimension DD for each feature set.
    • R1:Fface→E1∈RL×DR_1: F_{\rm face} \rightarrow E_1 \in \mathbb R^{L \times D}
    • R2:Fchar→E2∈RL×DR_2: F_{\rm char} \rightarrow E_2 \in \mathbb R^{L \times D}
  • Feature Fusion Block (MLP): Concatenates E1E_1 and E2E_2, then merges them with a learnable positional embedding (EposE_{\rm pos}) using a two-layer feedforward multilayer perceptron of hidden size $4D$.
  • Background Token Block (EbgE_{\rm bg}): A learnable block representing the background, isolated from all character tokens.
  • Token Sequence Construction:

    ci=[Ebg; Efuse(1); …; Efuse(N)]∈R(N+1)L×D,c_i = [E_{\rm bg};\, E_{\rm fuse}^{(1)};\, \ldots;\, E_{\rm fuse}^{(N)}] \in \mathbb R^{(N+1)L \times D},

    where each Efuse(k)E_{\rm fuse}^{(k)} is the fused representation for character kk.

These tokens serve as keys/values in the image-conditioned cross-attention mechanism for the diffusion model, enabling targeted influence of each character and background region.

Component Input Dimension Output Dimension
R1R_1 (face) Fface∈R1×512F_{\rm face}\in\mathbb R^{1\times512} E1∈RL×DE_1\in\mathbb R^{L\times D}
R2R_2 (body) Fchar∈RSc×1024F_{\rm char}\in\mathbb R^{S_c\times1024} E2∈RL×DE_2\in\mathbb R^{L\times D}
Fusion MLP [E1;E2]+Epos[E_1;E_2]+E_{\rm pos} Efuse∈RL×DE_{\rm fuse}\in\mathbb R^{L\times D}
Background token — Ebg∈RL×DE_{\rm bg}\in\mathbb R^{L\times D}
Final output (cic_i) — R(N+1)L×D\mathbb R^{(N+1)L\times D}

3. Key Mathematical Formulation

The PPR computation is formalized as follows:

  • Perceiver Resamplers (per character):

    E1=R1(Fface)=MultiHead(Q, Kf, Vf)Wo,E_1 = R_1(F_{\rm face}) = \mathrm{MultiHead}\Bigl(Q,\, K_f,\, V_f\Bigr) W_o,

    where Q∈RL×DQ\in\mathbb R^{L\times D}, Kf=FfaceWkK_f=F_{\rm face} W_k, Vf=FfaceWvV_f=F_{\rm face} W_v.

    E2=R2(Fchar)=MultiHead(Q, Kc, Vc)Wo,E_2 = R_2(F_{\rm char}) = \mathrm{MultiHead}\Bigl(Q,\, K_c,\, V_c\Bigr) W_o,

    where Kc=FcharWk′K_c=F_{\rm char} W_k', Vc=FcharWv′V_c=F_{\rm char} W_v'.

  • Feature Fusion and Positional Encoding:

    Ecat=[E1;E2]+Epos∈RL×2DE_{\rm cat} = [E_1; E_2] + E_{\rm pos} \in \mathbb R^{L\times 2D}

    Efuse=MLP(Ecat)∈RL×D,E_{\rm fuse} = \mathrm{MLP}(E_{\rm cat}) \in \mathbb R^{L\times D},

    where

    MLP(x)=W2 GELU(W1x+b1)+b2\mathrm{MLP}(x) = W_2\,\mathrm{GELU}(W_1 x + b_1) + b_2

  • Concatenation with Background:

    ci=[Ebg; Efuse(1);… ;Efuse(N)]∈R(N+1)L×Dc_i = [E_{\rm bg};\, E_{\rm fuse}^{(1)}; \dots; E_{\rm fuse}^{(N)} ] \in \mathbb R^{(N+1)L\times D}

  • Decoupled Cross-Attention Injection:

    Znew=Attention(Q,Kt,Vt)+γ Attention(Q,Ki,Vi)Z_{\rm new} = \mathrm{Attention}(Q, K_t, V_t) + \gamma\,\mathrm{Attention}(Q, K_i, V_i)

    with (Ki,Vi)(K_i, V_i) constructed from cic_i.

4. Separation of Features and Positional Consistency

To maintain distinctive per-character and background representations, the following strategies are enforced:

  • Each character block Efuse(k)E_{\rm fuse}^{(k)} (for k=1,…,Nk=1,\ldots,N) is maintained as a separate LL-token group. The relative sequencing within each group is further disambiguated by the learnable positional embeddings EposE_{\rm pos}.
  • The first token block EbgE_{\rm bg} isolates background information. This architectural constraint ensures character features do not collapse into background features, preserving spatial locality and semantic consistency throughout generation.

This approach prevents the blending of character and background attributes during synthesis, supporting precise narrative and visual storytelling in multi-character scenes.

5. Attention-Guided Loss for Spatial Decoupling

To avoid feature bleeding across character and background regions, PPR uses a dedicated attention supervision loss:

  • For each image-conditioned cross-attention layer, compute attention maps:

    P=Softmax(QKT/d)∈Rh×w×(N+1)LP = \mathrm{Softmax}\bigl(Q K^T / \sqrt{d}\bigr) \in \mathbb R^{h\times w\times (N+1)L}

  • Aggregate each character’s LL-token maps into a spatial map Ak∈Rh×wA_k\in\mathbb R^{h\times w}.
  • Given segmentation masks MkM_k, the PPR-aware loss is:

    Lattn=1N+1∑k=1N+1∥Ak−Mk∥22\mathcal{L}_{\rm attn} = \frac{1}{N+1} \sum_{k=1}^{N+1} \| A_k - M_k \|_2^2

  • This loss is averaged across all MM cross-attention layers and combined with the primary diffusion loss:

    L=LSD+λM∑l=1MLattn(l),λ=0.1\mathcal{L} = \mathcal{L}_{\rm SD} + \frac{\lambda}{M} \sum_{l=1}^M \mathcal{L}_{\rm attn}^{(l)}, \quad \lambda=0.1

This loss enforces spatial decoupling, compelling each latent token block’s attention impact to correspond to its associated visual region.

6. Implementation and Hyper-parameters

Key specifications guiding PPR’s implementation in StoryMaker include:

  • Resampler Dimensions:
    • FfaceF_{\rm face}: R1×512\mathbb R^{1 \times 512} (ArcFace)
    • FcharF_{\rm char}: RSc×1024\mathbb R^{S_c \times 1024} (CLIP-ViT-H/14, ScS_c typically 49 for 7×77\times 7 grid)
    • Latent dimension D=768D=768; tokens per block L=16L=16; up to N=2N=2 characters
    • Final cic_i dimension for N=2N=2: 3×16×768=36,8643\times16\times768=36,864
  • Attention and MLP:
    • 8 attention heads for R1R_1 and R2R_2
    • MLP hidden size $4D$
  • Training Protocol:
    • LoRA rank: 128 (for all injected WqW_q, WkW_k, WvW_v)
    • Optimizer: AdamW; lr=1e-4\text{lr}=1\text{e-4} for first 4,000 steps, 5e-55\text{e-5} for last 4,000 steps; total 8,000 steps
    • Batch: 8 images × 8 A100 GPUs
    • Freeze SDXL U-Net and encoders, train PPR and LoRA adapters only
    • Inference: UniPC sampler with 25 steps, classifier-free guidance set to 7.5

This configuration ensures compute efficiency and controlled specialization of the resampler and cross-attention interface.

7. Significance in Multi-Character Image Generation

PPR’s design is central to StoryMaker’s capacity to render images with holistic character and scene consistency. By jointly leveraging facial and body cues, encoding positional information, and supervising spatial attention, PPR provides a principled solution for multi-entity composition in generative diffusion models. The architectural isolation of token groups and supervision with segmentation masks allows for scalable, tuning-free personalization without entanglement of extraneous features or characters, directly advancing the state of the art in narrative text-to-image tasks (Zhou et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Positional-aware Perceiver Resampler (PPR).