SPG Module: Structure-Preserving Personalized Generation

Updated 25 November 2025

The paper introduces SPG modules that fuse identity preservation with stylistic personalization through dynamic gating mechanisms.
It employs dual adapters, multi-stage denoising with attention swaps, and region-adaptive normalization to maintain structural consistency.
The framework is applied across modalities including images, music, and pose synthesis, demonstrating improved controllability and prompt fidelity.

A Structure-Preserving Personalized Generation (SPG) module designates a class of architectural and algorithmic techniques that enable the generation of new content—images, music, or other modalities—with strong preservation of user- or instance-specific structure or identity, while supporting signal-compliant and controllable personalization. SPG modules achieve this by explicit structural alignment, multi-path conditioning, region-specific normalization, and/or dynamic weighting of personalized and structure-preserving features. Contemporary realizations span text-to-image diffusion systems with dual-path adapters, diffusion-based stagewise retouch pipelines, region-adaptive normalization networks for conditioned image synthesis, and non-neural nearest-neighbor alignment for symbolic music.

1. Core Architectural Paradigms

SPG modules instantiate different mechanisms to reconcile personalization and structure preservation, contingent on data modality and task. In advanced text-to-image personalization, exemplified by FlexIP (Huang et al., 10 Apr 2025), SPG augments a frozen diffusion backbone (such as Stable Diffusion) with two lightweight adapters: a Preservation Adapter to encode and inject instance identity, and a Personalization Adapter to control prompt-based stylistic transformation. These adapters operate through learned feature resampling and cross-attention blocks, with outputs fused by a dynamic gating mechanism—parameterized by a scalar $\gamma$ during both training and inference—that permits interpolation between structure and style.

In Layout-and-Retouch (Kim et al., 13 Jul 2024), the SPG methodology underlies a two-stage inference framework: Stage 1 generates layout-constrained images with high prompt fidelity, while Stage 2 uses multi-source attention swap and adaptive mask blending to retouch and inject the reference subject’s appearance, achieving both structural and personalized consistency.

For semantic person image generation, the SPG module is realized via per-region normalization (SEAN) blocks that inject local style codes into decoder features conditioned on target parsing maps, ensuring precise pose and appearance transfer (Lv et al., 2021). In the symbolic music domain (Dai et al., 2021), SPG comprises a non-neural, nearest-neighbor alignment procedure that explicitly matches new generative sections to structurally similar parts of a seed song, propagating fine-grained structure through the generative process.

2. Mathematical Foundations and Conditioning Mechanisms

SPG techniques depend on composite and dynamic feature conditioning. In FlexIP (Huang et al., 10 Apr 2025), the final adapter feature at each U-Net cross-attention layer is

$h = \gamma \cdot P + (1-\gamma) \cdot S$

where $P$ and $S$ are the outputs of the Preservation and Personalization adapters, and $\gamma \in [0, 1]$ is either data- or user-driven. During training, $\gamma$ is set to favor pure preservation or personalization according to the data domain, while at inference, $\lambda$ (identical to $\gamma$ ) is provided as an interactive control.

Layout-and-Retouch (Kim et al., 13 Jul 2024) formalizes two-step DDIM-based denoising and attention tensor override with mask-based feature blending:

Stage 1 alternates between vanilla and personalized model weights, switching at a pre-specified $\lambda_1$ .
Stage 2 swaps Q/K/V tensors between layout, reference, and target denoising paths, conditioned on the sampling schedule, with spatial masks enforcing region-level feature inheritance.

Region-Adaptive Normalization (SEAN) (Lv et al., 2021) computes per-part feature statistics, extracting style codes for each semantic segment, and modulates decoder activations using learned, region-dependent scaling and bias, blended between parsing-driven and style-driven factors as:

$\gamma(c,y,x) = \theta_\gamma S'(c,y,x) + (1-\theta_\gamma)C(c,y,x)$

with similar blending for $\beta$ .

In music, alignment distances are defined over normalized parameters (section length, order, variation flag, recurrence) and used to drive hard assignment in generation, preserving seed structure under user-specified or sampled new forms (Dai et al., 2021).

3. Algorithmic Implementation and Control

The training and inference workflows of SPG modules exploit modularity and minimal backbone retraining:

Adapters (FlexIP): Only the adapters are trained (with Resampler blocks of 3–6 layers, Perceiver-style cross-attention with MLPs and layer norms); the base UNet is frozen. Training uses a mixed dataset, alternating $\gamma$ for image and video samples to cover the trade-off continuum (Huang et al., 10 Apr 2025).
Dual-Stage Inference (Layout-and-Retouch): Both stages run standard diffusion sampling but override input tensors and masking per stage. The mask for feature blending in Stage 2 is constructed from a union of cross-attention-based and segmentation masks, then processed by distance transforms and normalization (Kim et al., 13 Jul 2024).
Region-Adaptive Normalization: For each semantic region, style codes are extracted by average pooling over active regions, projected to the desired latent dimension, and broadcast over the decoder activation (Lv et al., 2021).
Nearest-Neighbor Alignment for Music: At inference, for each new target section, the closest matching seed section is found using a weighted $\ell_1$ metric. Downstream generators (for chord, melody, bass) then inherit structure from the aligned seed section (Dai et al., 2021).

The following table summarizes core components across modalities:

Modality	Structure Encoding	Personalization	Structural Injection
Text-to-Image	Identity/CLIP/DINO	CLIP Prompt Adapter	Adapter fusion via $\gamma$ (Huang et al., 10 Apr 2025)
Diffusion-Image	Layout images	Reference subject	Attention swap & mask (Kim et al., 13 Jul 2024)
Person Pose Image	Parsing maps, pose	Per-region style	Region-adaptive normalization (Lv et al., 2021)
Symbolic Music	Section sequence	Seed song alignment	NN alignment on features (Dai et al., 2021)

4. Loss Functions, Training Objectives, and Quantitative Evaluation

SPG module training involves losses that explicitly balance structure and personalization. In FlexIP (Huang et al., 10 Apr 2025), the joint loss is:

$L_{\text{total}} = \alpha L_{\text{diff}} + \beta L_{\text{id}} + \gamma L_{\text{style}}$

where $L_{\text{diff}}$ is standard denoising score matching, $L_{\text{id}}$ enforces similarity to identity features (e.g., DINO or face encoder representations), and $L_{\text{style}}$ promotes prompt fidelity via CLIP embedding alignment.

In SEAN-based SPG (Lv et al., 2021), objectives include pixelwise $L_1$ , VGG perceptual, adversarial (PatchGAN), and parsing cross-entropy losses, with empirically chosen weightings.

For Layout-and-Retouch (Kim et al., 13 Jul 2024), end-to-end SPG training (not present in the original but outlined as an extension) would similarly blend CLIP prompt loss, DINO identity loss, and layout edge consistency.

Preferred metrics for evaluation include:

CLIP-I, DINO-I for identity (image)
CLIP-T for prompt alignment
FID, SSIM, LPIPS, and PCKh for perceptual and localization quality (image/pose)
Melody-contour and rhythm-onset similarity via DTW, and section-alignment accuracy (music) (Dai et al., 2021)

Empirical results in FlexIP show that dynamic gating achieves Pareto-optimal trade-offs, outperforming fixed $\lambda$ fusion. Ablation removing either adapter leads to over-rigid generation or identity drift. SEAN-based SPG achieves improved FID and LPIPS vs. SPADE and CC-FPSE baselines in pose/image synthesis. In music, SPG-based imitation scores higher in stylistic log-likelihood and is preferred in listener studies compared to unconstrained generation (Dai et al., 2021).

5. Application Domains and Modality-Specific Adaptations

SPG modules generalize across data types:

Conditional diffusion text-to-image: Instance-specific subject preservation with parameterized, prompt-driven personalization (Huang et al., 10 Apr 2025).
Personalized T2I with multi-stage attention: Explicit separation of background/layout from subject identity using staged denoising and attention swapping (Kim et al., 13 Jul 2024).
Pose-conditional person image synthesis: Region-based style propagation enables per-part appearance transfer and fine structure retention, outperforming global AdaIN and class-based SPADE normalization (Lv et al., 2021).
Personalized symbolic music generation: Structural template matching and inheritance adapts fine structure for new compositions while preserving musical identity (Dai et al., 2021).

A plausible implication is that SPG frameworks offer a unifying abstraction for controllable, structure-aware personalization, irrespective of the underlying modality.

6. Comparisons and Theoretical Context

SPG modules are empirically superior to methods that conflate or entangle structural and style conditioning. For example, AdaIN transfer lacks spatial or semantic targeting, while SPADE does not propagate source-instance style, and naive prompt fusion fails to resolve identity-style trade-offs. Ablation studies in SPGNet (Lv et al., 2021) demonstrate lower FID and LPIPS when employing two-stage region-adaptive normalization. Similarily, FlexIP's dynamic trade-off control surpasses any static approach in both identity and style axes (Huang et al., 10 Apr 2025). In music, the SPG alignment module produces imitations with statistically indistinguishable quality to original seeds per listener ratings, while non-SPG controls fail to preserve high-level structure (Dai et al., 2021).

7. Significance and Current Limitations

SPG modules establish a rigorous and generalizable recipe for personalized content generation under structure preservation constraints. Significance is evidenced by improved controllability, prompt fidelity, and semantic consistency in both vision and music domains. However, certain limitations are observed:

Reliance on high-quality reference data for identity encoding (e.g., DINO/CLIP)
Hyperparameter sensitivity (e.g., trade-off scalar $\lambda$ range)
Sparse annotations needed for region-wise style extraction in complex datasets
This suggests continued research is needed for unsupervised or self-supervised region decomposition, and neural alignment for symbolic modalities.

SPG remains an active area for cross-modal research into structure-conditioned personalized generation, with dynamic gating, attention manipulation, and explicit structure alignment as foundational mechanisms (Huang et al., 10 Apr 2025, Kim et al., 13 Jul 2024, Lv et al., 2021, Dai et al., 2021).