Positional-aware Perceiver Resampler
- PPR is a specialized module that fuses facial and body embeddings to encode spatial and identity cues for multi-character image generation.
- It employs dual Perceiver resamplers and a fusion MLP to distill high-dimensional features into compact tokens for targeted conditioning in diffusion pipelines.
- The design enforces feature separation and uses attention-guided loss to maintain coherent character and background representations across image sequences.
The Positional-aware Perceiver Resampler (PPR) is a specialized module designed to extract and fuse both facial identity and full-body cues from reference images, enabling fine-grained, position-aware conditioning in diffusion-based generative models. PPR was introduced in the StoryMaker framework to address the challenge of creating multi-character, visually consistent image sequences, ensuring that individual character features—including face, clothing, hairstyle, and body—remain coherent across multiple generated images and narration steps (Zhou et al., 2024).
1. Functional Role within StoryMaker
PPR serves as the primary mechanism for integrating two types of reference-image features:
- Facial identity embeddings (), sourced from a frozen ArcFace network, encapsulate the character’s facial traits.
- Cropped character image embeddings (), extracted using a frozen CLIP-ViT model, represent body appearance including attire and pose.
The module takes these two sets of high-dimensional features per character and distills them into a compact sequence of token embeddings . Each block of tokens within encodes one character’s holistic appearance; an additional block encodes background information. This token sequence is subsequently injected into the denoising U-Net of a Stable Diffusion–style pipeline via decoupled cross-attention, providing fine-grained, spatially targeted conditioning for multi-entity image synthesis.
2. Architectural Overview
PPR comprises the following main components:
- Two Perceiver-style Resamplers : Each resampler is independently applied to either or , producing tokens of dimension for each feature set.
- Feature Fusion Block (MLP): Concatenates and , then merges them with a learnable positional embedding () using a two-layer feedforward multilayer perceptron of hidden size $4D$.
- Background Token Block (): A learnable block representing the background, isolated from all character tokens.
- Token Sequence Construction:
where each is the fused representation for character .
These tokens serve as keys/values in the image-conditioned cross-attention mechanism for the diffusion model, enabling targeted influence of each character and background region.
| Component | Input Dimension | Output Dimension |
|---|---|---|
| (face) | ||
| (body) | ||
| Fusion MLP | ||
| Background token | — | |
| Final output () | — |
3. Key Mathematical Formulation
The PPR computation is formalized as follows:
- Perceiver Resamplers (per character):
where , , .
where , .
- Feature Fusion and Positional Encoding:
where
- Concatenation with Background:
- Decoupled Cross-Attention Injection:
with constructed from .
4. Separation of Features and Positional Consistency
To maintain distinctive per-character and background representations, the following strategies are enforced:
- Each character block (for ) is maintained as a separate -token group. The relative sequencing within each group is further disambiguated by the learnable positional embeddings .
- The first token block isolates background information. This architectural constraint ensures character features do not collapse into background features, preserving spatial locality and semantic consistency throughout generation.
This approach prevents the blending of character and background attributes during synthesis, supporting precise narrative and visual storytelling in multi-character scenes.
5. Attention-Guided Loss for Spatial Decoupling
To avoid feature bleeding across character and background regions, PPR uses a dedicated attention supervision loss:
- For each image-conditioned cross-attention layer, compute attention maps:
- Aggregate each character’s -token maps into a spatial map .
- Given segmentation masks , the PPR-aware loss is:
- This loss is averaged across all cross-attention layers and combined with the primary diffusion loss:
This loss enforces spatial decoupling, compelling each latent token block’s attention impact to correspond to its associated visual region.
6. Implementation and Hyper-parameters
Key specifications guiding PPR’s implementation in StoryMaker include:
- Resampler Dimensions:
- : (ArcFace)
- : (CLIP-ViT-H/14, typically 49 for grid)
- Latent dimension ; tokens per block ; up to characters
- Final dimension for :
- Attention and MLP:
- 8 attention heads for and
- MLP hidden size $4D$
- Training Protocol:
- LoRA rank: 128 (for all injected , , )
- Optimizer: AdamW; for first 4,000 steps, for last 4,000 steps; total 8,000 steps
- Batch: 8 images × 8 A100 GPUs
- Freeze SDXL U-Net and encoders, train PPR and LoRA adapters only
- Inference: UniPC sampler with 25 steps, classifier-free guidance set to 7.5
This configuration ensures compute efficiency and controlled specialization of the resampler and cross-attention interface.
7. Significance in Multi-Character Image Generation
PPR’s design is central to StoryMaker’s capacity to render images with holistic character and scene consistency. By jointly leveraging facial and body cues, encoding positional information, and supervising spatial attention, PPR provides a principled solution for multi-entity composition in generative diffusion models. The architectural isolation of token groups and supervision with segmentation masks allows for scalable, tuning-free personalization without entanglement of extraneous features or characters, directly advancing the state of the art in narrative text-to-image tasks (Zhou et al., 2024).