Positional-aware Perceiver Resampler

Updated 17 November 2025

PPR is a specialized module that fuses facial and body embeddings to encode spatial and identity cues for multi-character image generation.
It employs dual Perceiver resamplers and a fusion MLP to distill high-dimensional features into compact tokens for targeted conditioning in diffusion pipelines.
The design enforces feature separation and uses attention-guided loss to maintain coherent character and background representations across image sequences.

The Positional-aware Perceiver Resampler (PPR) is a specialized module designed to extract and fuse both facial identity and full-body cues from reference images, enabling fine-grained, position-aware conditioning in diffusion-based generative models. PPR was introduced in the StoryMaker framework to address the challenge of creating multi-character, visually consistent image sequences, ensuring that individual character features—including face, clothing, hairstyle, and body—remain coherent across multiple generated images and narration steps (Zhou et al., 2024).

1. Functional Role within StoryMaker

PPR serves as the primary mechanism for integrating two types of reference-image features:

Facial identity embeddings ( $F_{\rm face}$ ), sourced from a frozen ArcFace network, encapsulate the character’s facial traits.
Cropped character image embeddings ( $F_{\rm char}$ ), extracted using a frozen CLIP-ViT model, represent body appearance including attire and pose.

The module takes these two sets of high-dimensional features per character and distills them into a compact sequence of token embeddings $c_i$ . Each block of tokens within $c_i$ encodes one character’s holistic appearance; an additional block encodes background information. This token sequence is subsequently injected into the denoising U-Net of a Stable Diffusion–style pipeline via decoupled cross-attention, providing fine-grained, spatially targeted conditioning for multi-entity image synthesis.

2. Architectural Overview

PPR comprises the following main components:

Two Perceiver-style Resamplers $(R_1, R_2)$ $(R_{1}, R_{2})$ : Each resampler is independently applied to either $F_{\rm face}$ $F_{face}$ or $F_{\rm char}$ $F_{char}$ , producing $L$ $L$ tokens of dimension $D$ $D$ for each feature set.
- $R_1: F_{\rm face} \rightarrow E_1 \in \mathbb R^{L \times D}$
- $R_2: F_{\rm char} \rightarrow E_2 \in \mathbb R^{L \times D}$
Feature Fusion Block (MLP): Concatenates $E_1$ and $E_2$ , then merges them with a learnable positional embedding ( $E_{\rm pos}$ ) using a two-layer feedforward multilayer perceptron of hidden size $4D$.
Background Token Block ( $E_{\rm bg}$ ): A learnable block representing the background, isolated from all character tokens.
Token Sequence Construction:

$c_i = [E_{\rm bg};\, E_{\rm fuse}^{(1)};\, \ldots;\, E_{\rm fuse}^{(N)}] \in \mathbb R^{(N+1)L \times D},$

where each $E_{\rm fuse}^{(k)}$ is the fused representation for character $k$ .

These tokens serve as keys/values in the image-conditioned cross-attention mechanism for the diffusion model, enabling targeted influence of each character and background region.

Component	Input Dimension	Output Dimension
$R_1$ (face)	$F_{\rm face}\in\mathbb R^{1\times512}$	$E_1\in\mathbb R^{L\times D}$
$R_2$ (body)	$F_{\rm char}\in\mathbb R^{S_c\times1024}$	$E_2\in\mathbb R^{L\times D}$
Fusion MLP	$[E_1;E_2]+E_{\rm pos}$	$E_{\rm fuse}\in\mathbb R^{L\times D}$
Background token	—	$E_{\rm bg}\in\mathbb R^{L\times D}$
Final output ( $c_i$ )	—	$\mathbb R^{(N+1)L\times D}$

3. Key Mathematical Formulation

The PPR computation is formalized as follows:

Perceiver Resamplers (per character):

$E_1 = R_1(F_{\rm face}) = \mathrm{MultiHead}\Bigl(Q,\, K_f,\, V_f\Bigr) W_o,$

where $Q\in\mathbb R^{L\times D}$ , $K_f=F_{\rm face} W_k$ , $V_f=F_{\rm face} W_v$ .

$E_2 = R_2(F_{\rm char}) = \mathrm{MultiHead}\Bigl(Q,\, K_c,\, V_c\Bigr) W_o,$

where $K_c=F_{\rm char} W_k'$ , $V_c=F_{\rm char} W_v'$ .
Feature Fusion and Positional Encoding:

$E_{\rm cat} = [E_1; E_2] + E_{\rm pos} \in \mathbb R^{L\times 2D}$

$E_{\rm fuse} = \mathrm{MLP}(E_{\rm cat}) \in \mathbb R^{L\times D},$

where

$\mathrm{MLP}(x) = W_2\,\mathrm{GELU}(W_1 x + b_1) + b_2$
Concatenation with Background:

$c_i = [E_{\rm bg};\, E_{\rm fuse}^{(1)}; \dots; E_{\rm fuse}^{(N)} ] \in \mathbb R^{(N+1)L\times D}$
Decoupled Cross-Attention Injection:

$Z_{\rm new} = \mathrm{Attention}(Q, K_t, V_t) + \gamma\,\mathrm{Attention}(Q, K_i, V_i)$

with $(K_i, V_i)$ constructed from $c_i$ .

4. Separation of Features and Positional Consistency

To maintain distinctive per-character and background representations, the following strategies are enforced:

Each character block $E_{\rm fuse}^{(k)}$ (for $k=1,\ldots,N$ ) is maintained as a separate $L$ -token group. The relative sequencing within each group is further disambiguated by the learnable positional embeddings $E_{\rm pos}$ .
The first token block $E_{\rm bg}$ isolates background information. This architectural constraint ensures character features do not collapse into background features, preserving spatial locality and semantic consistency throughout generation.

This approach prevents the blending of character and background attributes during synthesis, supporting precise narrative and visual storytelling in multi-character scenes.

5. Attention-Guided Loss for Spatial Decoupling

To avoid feature bleeding across character and background regions, PPR uses a dedicated attention supervision loss:

For each image-conditioned cross-attention layer, compute attention maps:

$P = \mathrm{Softmax}\bigl(Q K^T / \sqrt{d}\bigr) \in \mathbb R^{h\times w\times (N+1)L}$
Aggregate each character’s $L$ -token maps into a spatial map $A_k\in\mathbb R^{h\times w}$ .
Given segmentation masks $M_k$ , the PPR-aware loss is:

$\mathcal{L}_{\rm attn} = \frac{1}{N+1} \sum_{k=1}^{N+1} \| A_k - M_k \|_2^2$
This loss is averaged across all $M$ cross-attention layers and combined with the primary diffusion loss:

$\mathcal{L} = \mathcal{L}_{\rm SD} + \frac{\lambda}{M} \sum_{l=1}^M \mathcal{L}_{\rm attn}^{(l)}, \quad \lambda=0.1$

This loss enforces spatial decoupling, compelling each latent token block’s attention impact to correspond to its associated visual region.

6. Implementation and Hyper-parameters

Key specifications guiding PPR’s implementation in StoryMaker include:

Resampler Dimensions:
- $F_{\rm face}$ : $\mathbb R^{1 \times 512}$ (ArcFace)
- $F_{\rm char}$ : $\mathbb R^{S_c \times 1024}$ (CLIP-ViT-H/14, $S_c$ typically 49 for $7\times 7$ grid)
- Latent dimension $D=768$ ; tokens per block $L=16$ ; up to $N=2$ characters
- Final $c_i$ dimension for $N=2$ : $3\times16\times768=36,864$
Attention and MLP:
- 8 attention heads for $R_1$ and $R_2$
- MLP hidden size $4D$
Training Protocol:
- LoRA rank: 128 (for all injected $W_q$ , $W_k$ , $W_v$ )
- Optimizer: AdamW; $\text{lr}=1\text{e-4}$ for first 4,000 steps, $5\text{e-5}$ for last 4,000 steps; total 8,000 steps
- Batch: 8 images × 8 A100 GPUs
- Freeze SDXL U-Net and encoders, train PPR and LoRA adapters only
- Inference: UniPC sampler with 25 steps, classifier-free guidance set to 7.5

This configuration ensures compute efficiency and controlled specialization of the resampler and cross-attention interface.

7. Significance in Multi-Character Image Generation

PPR’s design is central to StoryMaker’s capacity to render images with holistic character and scene consistency. By jointly leveraging facial and body cues, encoding positional information, and supervising spatial attention, PPR provides a principled solution for multi-entity composition in generative diffusion models. The architectural isolation of token groups and supervision with segmentation masks allows for scalable, tuning-free personalization without entanglement of extraneous features or characters, directly advancing the state of the art in narrative text-to-image tasks (Zhou et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Positional-aware Perceiver Resampler (PPR).

Positional-aware Perceiver Resampler

1. Functional Role within StoryMaker

2. Architectural Overview

3. Key Mathematical Formulation

4. Separation of Features and Positional Consistency

5. Attention-Guided Loss for Spatial Decoupling

6. Implementation and Hyper-parameters

7. Significance in Multi-Character Image Generation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Positional-aware Perceiver Resampler

1. Functional Role within StoryMaker

2. Architectural Overview

3. Key Mathematical Formulation

4. Separation of Features and Positional Consistency

5. Attention-Guided Loss for Spatial Decoupling

6. Implementation and Hyper-parameters

7. Significance in Multi-Character Image Generation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research