MLP-Based Retargeting Strategy

Updated 16 December 2025

MLP-based retargeting is a framework using multilayer perceptrons to compute spatially varying transformations across images, videos, and 3D data.
It leverages pure MLP deformation fields and hybrid Conv+MLP modules to generate importance maps and control keypoints for content-aware processing.
The approach integrates supervised and self-supervised training regimes with specialized loss functions to minimize distortion and preserve semantic content.

MLP-based retargeting strategies leverage multilayer perceptrons or MLP-like modules as core function approximators to compute spatially-varying transformations or importance distributions, guiding content-aware manipulation of images, video, and higher-dimensional data. Techniques range from explicit MLP deformation fields to MLP-based importance estimation, control modules for keypoint retargeting, and hybrid models blending convolutional backbones with fully connected fusion heads. These approaches are used to achieve content-preserving resizing, morphing, and controllable motion transfer, with architecture, loss functions, and pipeline integration tailored to specific modalities and application goals.

1. Network Architectures in MLP-Based Retargeting

MLP-based retargeting architectures fall into two main categories: pure MLP deformation fields and MLP-enhanced fusion or control modules.

MLP Importance Map Fusion: In "Content-aware media retargeting based on deep importance map," the retargeting module consists of a VGG-16 backbone producing multi-scale feature maps (F₃, F₄, F₅), followed by a three-branch fusion head where each branch comprises upsampling, convolutions, and final per-pixel “perceptron” (1×1 conv + sigmoid), yielding an importance map $M(I;\theta)\in [0,1]^{H\times W}$ for each input $I\in \mathbb{R}^{H\times W\times 3}$ (Le et al., 2021). The fusion head is interpretable as a shallow MLP operating on upsampled, concatenated multi-level features.
Pure MLP Deformation Fields: In "Retargeting Visual Data with Deformation Fields," a 4-layer MLP $D_\theta$ is defined over input coordinates (with sinusoidal positional encoding), outputting local displacements $D(p)$ along a user-specified direction $v$ for $p \in [0,1]^2$ (images) or $\mathbb{R}^3$ (3D data). Auxiliary MLPs $E_{\mathrm{net}}$ , $\Sigma_{\mathrm{net}}$ are used for energy and cumulative energy estimation, respectively (Elsner et al., 2023).
MLP-Based Motion and Keypoint Control: In "LivePortrait," fully-connected MLPs $\mathcal{S}, \mathcal{R}_{\mathrm{eyes}}, \mathcal{R}_{\mathrm{lip}}$ take as input flattened implicit keypoints and auxiliary parameters, outputting local per-keypoint displacements $\Delta \in \mathbb{R}^{K\times 3}$ for retargeting facial geometry or correcting stitching artifacts. Architectures are compact (4–6 hidden layers, ReLU), and the modules are designed for negligible runtime cost and plug-in integration with downstream warping/generation networks (Guo et al., 3 Jul 2024).
Hybrid Conv+MLP Policy Modules: In multi-operator RL-based retargeting, convolutional feature extractors are coupled with several fully-connected layers to produce latent representations used by policy and value heads or by LSTMs for sequential decision processes (Kajiura et al., 2020).

2. Formulation of Retargeting Tasks

The retargeting objective is to spatially adapt input data while minimizing content loss, geometric distortion, or perceptual inconsistency.

Importance-Map-Driven Retargeting:
- For each pixel $p=(x,y)$ , $E(p):=M(I;\theta)_p$ quantifies local importance. Low $E(p)$ regions are preferentially removed or distorted.
- In seam carving, vertical seam cost is $C(s)=\sum_{p\in s}E_i(p)$ ; the lowest-cost seam is iteratively removed (Le et al., 2021).
- In warping, images are segmented into patches $\{p_k\}$ , with patch energy $\omega_k = \frac{1}{|p_k|}\sum_{p\in p_k}E(p)$ . Mesh-warping penalties are weighted accordingly.
Deformation Field Retargeting:
- The deformation $p' = p + v D(p)$ maps spatial locations to retargeted coordinates. The MLP $D_\theta$ is trained so that distortion (stretch/shear) is localized to low-energy $E(p)$ regions (e.g., low image gradient or low semantic importance).
- Training losses penalize stretch ( $L_e$ ), shear ( $L_s$ ), boundary misalignment ( $L_b$ ), and non-monotonicity ( $L_m$ ), with $\lambda$ -weighted sum in total loss $L$ (Elsner et al., 2023).
MLP-Keypoint Retargeting in Portrait Animation:
- For a source frame $x_s$ and driving frame $x_d$ , retargeting is implemented as $x'_d = x_d + \Delta$ where $\Delta$ is predicted by an MLP. Modularity allows independent control of blending (stitching), eye pose, and lip pose, via separate MLPs with specific input features and conditional parameters (Guo et al., 3 Jul 2024).

3. Training Regimes and Loss Functions

Supervised Saliency Estimation:
- Cross-entropy loss is optimized between predicted per-pixel importance and human-annotated saliency masks $S_g\in\{0,1\}^{H\times W}$ :
$L(\theta) = -\sum_{p}[y_p\log s_p + (1-y_p)\log(1-s_p)]$

where $y_p = S_g(p),\ s_p = M(I;\theta)_p$ (Le et al., 2021).
Self-supervised Deformation Learning:
- Hand-crafted losses incorporate spatial energy (image gradient, surface curvature) with physics-inspired deformation constraints. The approach is fully self-supervised, relying only on the input data and target retargeting ratio (Elsner et al., 2023).
Conditional Self-reconstruction and Perceptual Losses:
- In keypoint retargeting, the training objective is to minimize the region-masked L1 error between generated output with retargeted keypoints and a reconstruction, with auxiliary regularization on $\|\Delta\|_1$ and conditional scalar consistency (e.g., for enforcing eye open/close ratio) (Guo et al., 3 Jul 2024).
Actor-Critic Reinforcement Learning:
- Losses are derived from self-play rewards based on BDW (block difference-based) perceptual distance, with per-action frequency balancing to avoid degenerate operator selection. Gradients for both policy and value networks accumulate per-step rewards weighted by frequency-adjusted terms (Kajiura et al., 2020).

4. Pipeline Integration and Applications

Strategy	Data Type(s)	Core Retargeting Step
Deep importance map (VGG+MLP)	Image, video	Seam carving/patch warping
MLP deformation field	Image, NeRF, mesh	Deformation via D_θ(p)
MLP-based keypoint retargeting	Portrait animation	Keypoint offset Δ prediction
Conv+MLP RL operator selection	Image	Operator policy π(a

Image resizing uses MLP-produced importance scores (or deformations) to minimize distortion of semantically-significant regions, with retargeting via seam carving, warping, or learned displacement fields (Le et al., 2021, Elsner et al., 2023).
Video retargeting extends image methods by applying the same importance or deformation mapping per-frame, ensuring temporal consistency through consistent weights or field application (Le et al., 2021).
Portrait animation leverages fully-connected control nets to enable fine-grained manipulation of facial expressions and pose transfer with per-keypoint offsets, modularized for stitching, eyes, and lips (Guo et al., 3 Jul 2024).

5. Quantitative and Qualitative Evaluation

Objective metrics for image retargeting include Aspect Ratio Similarity (ARS), FID (Fréchet Inception Distance), and BDW perceptual distance.
- The deep importance-map system achieves ARS ≈ 0.92 on RetargetMe, surpassing SC(2007) at 0.80 and warping at 0.89. On challenging images, it yields best ARS in 79% of test cases (Le et al., 2021).
- MLP deformation fields decrease FID from 52.57 → 46.68 (mean) on width shrinkage, outperforming seam carving (no FID reduction) (Elsner et al., 2023).
- RL-based pipelines match MULTIOP image quality (BDW 2.27 vs. 2.29) while accelerating processing by three orders of magnitude (Kajiura et al., 2020).
- LivePortrait’s MLP modules maintain <0.5% overhead and support real-time generation (12.8 ms/frame), with ablation studies confirming critical improvements in alignment and expression controllability (Guo et al., 3 Jul 2024).
Qualitative performance demonstrates more coherent boundaries, better preservation of geometric features, and improved control for high-frequency regions or multiple salient objects. Ablation studies consistently show that removing energy-based or region-specific loss components results in artifacts (jagged seams, shrinkage, boundary misalignments, or fold-back effects) (Le et al., 2021, Elsner et al., 2023, Guo et al., 3 Jul 2024).

6. Generalization, Limitations, and Domain Extensions

MLP-based retargeting shows significant generalization properties:

Modality-agnostic deformation: The deformation field approach operates directly on 2D images, 3D NeRFs, or triangle meshes, by adapting the energy definition and domain inputs (Elsner et al., 2023).
Self-supervised adaptation: Formulations relying only on hand-crafted losses are not constrained to the domain of images with saliency annotations, enabling retargeting for arbitrary visual data given only the input and target size (Elsner et al., 2023).
Plug-in control modules: Small MLPs for keypoint retargeting can be retrofitted into larger motion transfer or animation pipelines without retraining primary networks, providing direct manipulation interfaces for controlling visual features (Guo et al., 3 Jul 2024).

Limitations arise in purely importance-based methods, which can struggle if importance estimators or energy definitions do not align with high-level semantics, and in policies that collapse to uniform scaling or cropping if not adequately regularized (Le et al., 2021, Kajiura et al., 2020). Integration of multi-scale fusion, explicit geometric priors, or reinforcement learning mitigates some of these issues.

7. Summary and Outlook

MLP-based retargeting strategies represent a unifying framework for content-aware transformation of images, videos, neural fields, and meshes. Core features include lightweight, plug-in field or control estimators, direct parametrization of spatially-varying transformations, and the capacity for both supervised and self-supervised optimization without domain-specific procedural code. These approaches currently match or exceed classical methods on standard benchmarks, improve efficiency by orders of magnitude in hybrid settings, and broaden applicability to new data modalities and interactive control. The ongoing evolution of MLP-based modules—whether in fusion heads, deformation fields, or operator selection policies—suggests enduring relevance for real-time, generalizable, and highly controllable retargeting pipelines (Le et al., 2021, Kajiura et al., 2020, Elsner et al., 2023, Guo et al., 3 Jul 2024).