StyleYourSmile: One-Shot Cross-Domain Retargeting

Updated 4 July 2026

StyleYourSmile is a one-shot cross-domain face retargeting framework that generates images by preserving the source identity and style while transferring target facial geometry, expression, and pose.
It employs a disentangled architecture with separate identity, style, and spatial pathways using ArcFace, CLIP-based encoders, and ControlNet for effective conditioning.
Synthetic multi-style supervision via offline augmentation replaces curated paired data, reducing training complexity and achieving competitive quantitative metrics.

StyleYourSmile is a diffusion-based framework for one-shot cross-domain face retargeting that takes a source face image to define identity and domain-specific appearance and a target face image to define facial geometry, expression, and pose, then generates a new image that preserves the source identity and source domain-specific appearance while transferring the target expression and pose. Its defining claim is that this can be done without curated paired multi-style data, without test-time optimization, and with a substantially lighter setup than recent large video-diffusion systems. The method is centered on a disentanglement problem in which domain-invariant identity, domain-specific style, and expression/pose geometry must be separated and then recombined through conditional latent diffusion (Dey et al., 1 Dec 2025).

1. Problem domain and scope

Cross-domain face retargeting extends conventional face reenactment by introducing a further axis of variation beyond identity, expression, pose, and lighting: domain-specific style. In this setting, the difference between a real photograph, sketch, painting, or vintage portrait is not merely a texture shift. The paper emphasizes that cues such as dimples or wrinkles may appear as high-frequency shading in a photograph but as a few brush strokes in a painting, so a domain shift changes both appearance statistics and the way semantic structures are visually encoded (Dey et al., 1 Dec 2025).

The task formulation is correspondingly strict. The source image $I_{src}$ specifies whose face and what visual domain to preserve, while the target image $I_{tgt}$ specifies the desired facial geometry, expression, and pose. The generated output is expected to keep the source identity and style while reproducing the target geometry. In the paper’s usage, one-shot means that inference requires only a single source image and a single target image; there is no subject-specific fine-tuning, no optimization over a personal album, and no requirement to collect multiple images of the subject at test time (Dey et al., 1 Dec 2025).

This problem setting differs from adjacent portrait stylization tasks. Few-shot portrait stylization based on geometric alignment focuses on portrait-to-style translation from small style datasets and emphasizes TPS-based landmark alignment for stylization quality and mobile deployment, not expression/pose retargeting (Wang et al., 2022). One-shot stylization via disentanglement and recombination separates content and style identifiers in a latent diffusion model, but its target is exemplar-guided face stylization rather than source-style-preserving facial reenactment (Li et al., 2023). Multi-exemplar portrait stylization similarly transfers photographic style from exemplar collections while preserving facial structure, but does not retarget expression or pose (Song et al., 2017).

2. Disentangled architecture and conditional pathways

StyleYourSmile organizes the task around four components: style augmentation pre-processing, an identity encoder branch, a style encoder branch, and a diffusion retargeting network with ControlNet-based spatial conditioning (Dey et al., 1 Dec 2025). The conceptual separation is explicit: the identity path answers who, the style path answers how the source domain looks, and the target spatial path answers what expression and pose to perform.

Branch	Input	Representation and use
Identity	$I_{src}$	ArcFace embedding $f_{id}$ mapped to $C_{id}$ , fed directly to the U-Net
Style	$I_{src}$	CLIP image features mapped to $C_{sty}$ , fused with spatial condition in ControlNet
Expression/pose	$I_{tgt}$	3DMM landmarks plus masks, used as spatial ControlNet conditioning

Identity is extracted from the source image with an ArcFace-based encoder $E_{id}$ , producing a $512$-D embedding $I_{tgt}$ 0. Because the diffusion model expects CLIP-text-space conditioning, a shallow transformer projector $I_{tgt}$ 1 maps this embedding into identity tokens: $I_{tgt}$ 2 The paper explicitly treats this branch as the carrier of domain-invariant identity information (Dey et al., 1 Dec 2025).

Style is extracted from the same source image by a CLIP-based style encoder $I_{tgt}$ 3 that uses both patch-level and global tokens. A second shallow transformer projector $I_{tgt}$ 4, with learnable queries, maps those features into CLIP-text-space style tokens: $I_{tgt}$ 5 This representation is intended to capture domain-specific stylistic variation, including lighting, hair, local texture, color grading, and related perceptual details that the identity branch omits (Dey et al., 1 Dec 2025).

Expression and pose are not represented as a free latent code. Instead, the target image is processed by fitting a 3D morphable model with Deep3DFaceRecon, rendering target facial landmarks, extracting source and target foreground masks, blending those masks to reduce background artifacts, and assembling a composite spatial control image $I_{tgt}$ 6. The paper characterizes this control as “a composite of 3DMM landmarks and foreground masks,” and routes it through ControlNet rather than injecting it as an unconstrained latent (Dey et al., 1 Dec 2025).

A crucial architectural choice is the separation of identity and style injection. The identity tokens $I_{tgt}$ 7 are fed directly to the denoising U-Net, whereas the style tokens $I_{tgt}$ 8 are fused with the spatial condition inside ControlNet: $I_{tgt}$ 9 The paper’s ablations report that direct style-token injection into the U-Net distorts facial features and color tone, while routing style through ControlNet better preserves identity and retains style (Dey et al., 1 Dec 2025).

The generator uses Stable Diffusion v1-5 as the base U-Net, with the pretrained U-Net largely frozen and adapted through LoRA. The trainable components are $I_{src}$ 0, $I_{src}$ 1, the ControlNet branch, and LoRA adapters on top of the frozen denoising U-Net. The optimization objective is the standard noise-prediction loss adapted to the conditioning setup: $I_{src}$ 2 The paper does not report additional explicit identity consistency, style consistency, adversarial, perceptual, cycle, or contrastive losses, and states that such terms are absent from the method description (Dey et al., 1 Dec 2025).

3. Synthetic multi-style supervision and training setup

A central feature of StyleYourSmile is the replacement of curated multi-style paired supervision with offline synthetic style augmentation. Training starts from VoxCeleb1 face imagery and creates stylized variants using the training-free diffusion style-transfer method of Chung et al., referred to as Style Injection. This preprocessing is not integrated into the training loop; it is performed offline, which the paper notes lowers training complexity and makes it possible to filter failed stylizations using a face detector (Dey et al., 1 Dec 2025).

The style-transfer process is based on DDIM inversion of a content image $I_{src}$ 3 and a style image $I_{src}$ 4. At diffusion timestep $I_{src}$ 5, queries, keys, and values are extracted, and style is injected through decoder attention: $I_{src}$ 6

$I_{src}$ 7

To avoid undesirable color or structure bias from selecting either content or style latents alone, the method additionally applies AdaIN to the initial latents: $I_{src}$ 8 Here, $I_{src}$ 9 controls content preservation, and the paper later reports that $f_{id}$ 0 is selected by ArtFID ablation (Dey et al., 1 Dec 2025).

The training logic is that the same underlying identity is observed under multiple synthetic domains, encouraging the model to encode identity in a style-stable representation while placing domain-specific appearance in a separate representation. The paper describes this as a substitute for curated multi-style paired datasets. Supervision is therefore best characterized as not relying on curated paired multi-style annotations, while still using pseudo-paired stylized augmentations produced by preprocessing (Dey et al., 1 Dec 2025).

The implementation details reported in the paper are specific but incomplete. The base model is Stable Diffusion v1-5; the identity encoder is ArcFace; 3DMM fitting uses Deep3DFaceRecon; foreground segmentation uses an off-the-shelf segmentation network; the control module is ControlNet; and both projectors $f_{id}$ 1 and $f_{id}$ 2 are 3-layer transformer modules, following FaceAdapter. Training uses 40,000 epochs, learning rate $f_{id}$ 3, batch size 4, and 4× NVIDIA A5000 GPUs. The text itself notes a possible ambiguity that these “epochs” may possibly mean iterations, but the paper says epochs (Dey et al., 1 Dec 2025).

For quantitative evaluation, the paper uses 20 subjects from the VoxCeleb1 test split, 3 video sequences per subject, and 10 frames per sequence, totaling 600 frames. These are augmented with 5 domain styles of varying abstraction to produce 3000 augmented frames for quantitative analysis. Additional qualitative comparison uses 3 subjects from GeneFace, in-the-wild faces, and unseen style domains for out-of-domain testing (Dey et al., 1 Dec 2025).

The paper also states several implementation details that are absent: sampler type for final inference, number of diffusion inference steps, guidance scale, exact ControlNet injection layers, LoRA rank, latent resolution, and optimizer type. Those details are not specified and are not recoverable from the text (Dey et al., 1 Dec 2025).

4. Evaluation protocol and reported performance

The evaluation is divided into self-retargeting and cross-identity retargeting. In self-retargeting, source and target involve the same identity, so comparison to a ground-truth frame is possible. In cross-identity retargeting, source and target differ, so only non-reconstruction metrics are used (Dey et al., 1 Dec 2025).

For self-retargeting, the paper reports PSNR, LPIPS, CS-ID, expression error, pose error, and ArtFID. ArtFID is defined as

$f_{id}$ 4

where LPIPS measures content similarity to the original ground-truth image and FID measures style similarity to the augmented ground-truth image. For cross-identity retargeting, the reported metrics are CS-ID, expression error, and pose error (Dey et al., 1 Dec 2025).

The baselines are HyperReenact, ROME, Arc2Face, and DiffusionRig. The paper notes that video-diffusion baselines are omitted due to compute constraints (Dey et al., 1 Dec 2025).

Setting	Reported outcome	Comparison highlighted by the paper
Self-retargeting	StyleYourSmile: PSNR 19.889, LPIPS 0.146, CS-ID 0.615, Exp. 0.241, Pose 6.321, ArtFID 32.377	Arc2Face slightly exceeds CS-ID with 0.635, but StyleYourSmile is best on reconstruction, perceptual quality, expression, pose, and ArtFID
Cross-identity retargeting	StyleYourSmile: CS-ID 0.553, Exp. 0.333, Pose 8.072	Arc2Face has best CS-ID with 0.606, but StyleYourSmile has better expression and pose fidelity

In self-retargeting, the full baseline numbers are: HyperReenact PSNR 12.225, LPIPS 0.377, CS-ID 0.410, Exp. 0.368, Pose 7.334, ArtFID 35.536; ROME PSNR 10.037, LPIPS 0.511, CS-ID 0.189, Exp. 0.491, Pose 8.486, ArtFID 38.002; Arc2Face PSNR 9.403, LPIPS 0.455, CS-ID 0.635, Exp. 0.394, Pose 7.198, ArtFID 41.177; DiffusionRig PSNR 13.650, LPIPS 0.402, CS-ID 0.324, Exp. 0.273, Pose 7.034, ArtFID 35.392; and StyleYourSmile PSNR 19.889, LPIPS 0.146, CS-ID 0.615, Exp. 0.241, Pose 6.321, ArtFID 32.377 (Dey et al., 1 Dec 2025).

In cross-identity retargeting, HyperReenact reports CS-ID 0.270, Exp. 0.387, Pose 6.344; ROME 0.091, 0.420, 8.375; Arc2Face 0.606, 0.551, 9.071; DiffusionRig 0.221, 0.414, 9.111; and StyleYourSmile 0.553, 0.333, 8.072 (Dey et al., 1 Dec 2025).

The qualitative analysis is consistent with these metrics. The paper reports that Arc2Face ignores domain-specific cues and struggles to stay aligned with the control signal; DiffusionRig, without personalized fine-tuning, misses identity, while with fine-tuning on a small image set it can overfit or memorize and ignore target conditioning; HyperReenact provides decent retargeting but misses fine-grained style details and stylized-domain fidelity; and ROME can be competitive when segmentation is accurate but is brittle on in-the-wild images and misses texture details (Dey et al., 1 Dec 2025).

5. Ablations, failure modes, and methodological boundaries

The ablation study is tightly aligned with the method’s core design claims. In the augmentation comparison, the paper evaluates AdaIN, AdaAttn, AdaConv, DiffStyle, and StyleInject using ArtFID and runtime. The reported ArtFID values are 30.93, 30.35, 31.86, 41.46, and 28.08 respectively, so StyleInject is selected as the best augmentation method among those listed. The same section reports that $f_{id}$ 5 gives the best ArtFID (Dey et al., 1 Dec 2025).

The style-conditioning-route ablation compares direct style-token injection into the denoising U-Net against the final ControlNet routing. Direct U-Net injection, called C1, yields ID 0.599 and Style 32.414, while the final model yields ID 0.615 and Style 32.377. The paper’s qualitative discussion states that direct U-Net style injection worsens facial fidelity and color stability and can distort facial features and tone (Dey et al., 1 Dec 2025).

A second ablation, C2, keeps the denoising U-Net fully frozen and removes LoRA adaptation. This yields ID 0.607 and Style 32.381, compared with ID 0.615 and Style 32.377 for the final model. The numerical difference is modest, but the paper describes the qualitative improvement as meaningful for style retention (Dey et al., 1 Dec 2025).

These ablations support four specific conclusions stated in the paper: synthetic style diversity helps disentanglement; separate identity and style pathways matter; style should be routed through ControlNet rather than directly through the U-Net; and lightweight U-Net adaptation improves domain-style rendering (Dey et al., 1 Dec 2025).

The supplementary material also lists explicit failure cases. The model can fail on animations/anime, because the underlying face detector may not recognize those faces; on faces with complex CGI effects, which can cause significant identity drift; and under certain occlusions, where source-specific attributes may be ignored, including an example in which an eye patch is not preserved even though expression transfer still works (Dey et al., 1 Dec 2025).

The paper also places ethical risk within scope. It notes that misuse for misinformation is possible and suggests digital watermarks as a mitigation. This does not alter the technical method, but it clarifies that efficient cross-domain face retargeting retains the standard synthetic-media risks associated with face reenactment systems (Dey et al., 1 Dec 2025).

6. Position within adjacent research areas

StyleYourSmile occupies a specific point in the broader literature on face stylization, mouth editing, and smile visualization. Portrait stylization work based on geometric alignment shows that explicit facial geometry can improve transfer from very small style datasets and even support real-time mobile deployment, but its problem is portrait-to-style translation rather than source-style-preserving reenactment (Wang et al., 2022). One-shot face stylization by disentangling content and style identifiers in latent diffusion similarly addresses identity-preserving stylization from a single artistic target, but it does not separate target expression and pose into a dedicated geometry pathway (Li et al., 2023). Diffusion-based data augmentation for few-shot stylization likewise enlarges small style datasets offline and trains a fast stylizer, which is conceptually close to StyleYourSmile’s use of synthetic augmentation, but it remains a stylization method rather than a cross-domain retargeting system (Matiyali et al., 23 Aug 2025).

Image-based retargeting in StyleYourSmile also differs from explicitly 3D formulations. Stylized 3D morphable models maintain fixed mesh connectivity and explicit expression control over shape, texture, and expression parameters, while exemplar-based 3D portrait stylization separates geometric style transfer from texture optimization in canonical UV space (Lee et al., 15 Aug 2025, Han et al., 2021). StyleYourSmile does not produce a stylized 3D mesh or reuse a 3D expression basis at inference; instead, it uses 3DMM-derived landmarks and masks only as spatial control for image generation (Dey et al., 1 Dec 2025). This suggests complementarity rather than replacement: image-space retargeting offers one-shot deployment without mesh reconstruction, whereas 3D stylization offers explicit animatability.

The method is also adjacent to mouth- and smile-centric systems without being reducible to them. Style-based localized lip synchronization preserves lower-face identity details through mask-guided spatial encoding and style-space adaptation, but it is audio-driven and optimized for speech alignment rather than cross-domain style preservation (Guan et al., 2023). Landmark-guided diverse smile generation models multiple smile trajectories from a neutral face using multimodal landmark dynamics, but it is class-conditioned video generation rather than source-style-preserving retargeting (Wang et al., 2018). In dental aesthetics, data-driven smile design and 3D structure-guided tooth alignment address personalized candidate generation, attractiveness filtering, or realistic aligned-teeth previews from ordinary photographs, but their objectives are aesthetic selection or orthodontic visualization rather than cross-domain reenactment (Lin et al., 15 Sep 2025, Dou et al., 2023).

Taken together, these neighboring literatures indicate why StyleYourSmile matters technically. It is not simply a portrait stylizer, not a pure identity-preservation model, not a smile-design recommender, and not a 3D avatar system. Its contribution lies in making domain-invariant identity, domain-specific appearance, and target geometry independently controllable within a one-shot diffusion framework that does not require curated paired multi-style data or test-time personalization (Dey et al., 1 Dec 2025). A plausible implication is that future systems may combine this disentangled image-space retargeting with more explicit mouth-region control, 3D animatability, or dental-structure priors, but those extensions are outside the scope of the method as reported.