TLPO: Adaptive Multi-Expert Preference Optimization
- TLPO is a two-stage framework that decouples conflicting objectives in audio-driven portrait animation by adaptively fusing specialized expert modules across diffusion timesteps and transformer layers.
- It employs lightweight LoRA experts and a gating network to optimize motion naturalness, lip-sync accuracy, and visual quality while minimizing overfitting and interference.
- Empirical analysis shows TLPO outperforms state-of-the-art baselines in key metrics, demonstrating significant improvements in animation realism and technical performance.
Timestep-Layer Adaptive Multi-Expert Preference Optimization (TLPO) is a two-stage training framework designed for aligning diffusion-based audio-driven portrait animation models with fine-grained, multidimensional human preferences. TLPO enables diffusion models to simultaneously optimize for motion naturalness, lip-sync accuracy, and visual quality by decoupling these potentially conflicting objectives into specialized expert modules that are adaptively fused across both denoising timesteps and transformer layers. This mechanism avoids the overfitting and mutual interference often observed when applying scalarized or undifferentiated reward optimization, leveraging the intrinsic stage-wise functional decomposition present in diffusion transformer (DiT) architectures (Wang et al., 15 Aug 2025).
1. Multidimensional Preference Alignment in Diffusion Models
Audio-driven portrait animation must meet several human-perceived criteria, primarily motion naturalness (MN), lip-sync accuracy (LS), and visual quality (VQ). These objectives are inherently competing; for example, maximizing lip-sync alignment may degrade motion fluidity or detail. Prior methods that collapse objectives into a single scalar reward have been prone to over-optimization of one dimension at the expense of others. Furthermore, diffusion transformers exhibit stepwise specialization—the initial denoising stages govern global structure (e.g., motion), while later steps refine details (e.g., facial texture), and different transformer layers attend to varying spatial-frequency content. Uniform preference injection fails to leverage this internal organization, motivating a more granular, stage-and-layer-aware preference modulation (Wang et al., 15 Aug 2025).
2. TLPO Framework and Architecture
TLPO is built upon a frozen DiT-based latent diffusion transformer backbone, pre-trained as Wan2.1, and augmented with a 3D variational autoencoder. Input audio features are extracted using Wav2Vec2 and injected via cross-attention mechanisms at every DiT block.
Multi-Expert Structure
Three lightweight LoRA (Low-Rank Adaptation) expert modules are implemented in every linear sub-layer of each DiT block, each aligning with a single objective:
- : Motion Naturalness expert
- : Lip-Sync expert
- : Visual Quality expert
Each expert LoRA injects a low-rank delta to its respective frozen layer output . For every inference pass, the final activation for layer at timestep is:
Timestep-Layer Adaptive Fusion
A gating network, parameterized by for each layer , takes as input the timestep embedding 0 and outputs expert weights:
1
where 2, 3. This mechanism allows the model to dynamically select the degree of each expert's influence at each diffusion stage and transformer layer, with minimal computational overhead (less than 1% additional parameters) (Wang et al., 15 Aug 2025).
3. Training Strategy and Loss Functions
TLPO training proceeds in two distinct stages.
Stage 1: Single-Expert DPO
Each expert module is independently optimized via Direct Preference Optimization (DPO), using pairs of samples 4 curated for the target preference dimension and their respective reward function 5. The loss is:
6
where 7 is the denoising loss and 8 is a temperature hyperparameter. For the lip-sync expert 9, the loss is further reweighted via a lip-region mask 0:
1
Stage 2: Fusion-Gate Optimization
All expert modules are frozen, and only the fusion gate parameters are updated using "full-dimension" pairs (real vs degraded synthetic samples) and a combined DPO loss:
2
This two-stage design ensures that gradients never push experts into direct competition and enables the fusion gate to modulate contributions from each expert without disrupting their optimized directions.
4. Data, Hyperparameters, and Implementation
Training leverages the Talking-NSQ dataset (410,000 auto-scored preference pairs: 180,000 for MN, 100,000 for LS, 130,000 for VQ) for expert adaptation, and 18,000 full-dimension pairs (real vs degraded) for gate fusion. LoRA modules use a rank 3. Expert training employs an AdamW optimizer with LR = 4, 5, running MN/VQ for 10 epochs and LS for 20 epochs on 166A100 GPUs, while gate fusion uses LR = 7, 8 for 5 epochs, updating only gating weights. The model runs for 50 diffusion timesteps, each with approximately 24 DiT blocks of 8 linear sub-layers each (Wang et al., 15 Aug 2025).
5. Performance and Empirical Analysis
Empirical evaluation demonstrates that TLPO surpasses four state-of-the-art baselines (FantasyTalking, HunyuanAvatar, OmniAvatar, MultiTalk) across all core metrics. The following table summarizes key results:
| Metric | Baseline | TLPO |
|---|---|---|
| HKC (↑) | 0.838 | 0.895 |
| Sync-C (↑) | 3.154 | 5.704 |
| FID (↓) | 43.137 | 35.438 |
| FVD (↓) | 483.108 | 341.181 |
Ablation studies indicate that removing timestep gating, fusing at expert- or module-level, or using scalarized preference optimization significantly degrades performance, especially on motion and lip-sync dimensions. User ratings (on a 0–10 scale, 24 raters) show gains of 1.3 (MN), 0.8 (LS), and 1.0 (VQ) over the strongest baseline. Qualitative inspection reveals that TLPO yields more natural head/hand motion, accurate mouth shapes over long sequences, and sharper facial details compared to prior methods.
6. Limitations, Generalizations, and Future Directions
TLPO's two-stage training procedure introduces procedural complexity and necessitates careful curation of full-dimension preference pairs. While the gating mechanism adds minimal parameters, it can marginally increase inference latency. The framework, however, is broadly generalizable. The principal recipe—decoupling conflicting objectives into specialized expert adapters and dynamically reweighting them along network axes—may be extended to other multi-objective generative tasks, including:
- Text-to-image diffusion models balancing style and content
- Video style transfer, mediating temporal coherence and per-frame fidelity
- Any generative process where objectives are spatially, temporally, or functionally separable
Potential future directions noted include automatic discovery of new expert axes (e.g., emotion), meta-learning for adaptable gating policies, and unified, closed-loop optimization jointly training both reward model and generative process (Wang et al., 15 Aug 2025).
7. Summary
Timestep-Layer Adaptive Multi-Expert Preference Optimization (TLPO) enables diffusion-based generative models to resolve conflicts among multidimensional, possibly antagonistic, human preferences by isolating optimization processes and subsequently adaptively combining their influence at the level of diffusion timestep and network layer. Empirical results support TLPO's efficacy in aligning portrait animation outputs with human judgments on motion, lip synchronization, and visual quality, indicating its broader applicability for multi-objective preference alignment scenarios (Wang et al., 15 Aug 2025).