PixelSmile: Toward Fine-Grained Facial Expression Editing

Published 26 Mar 2026 in cs.CV and cs.AI | (2603.25728v1)

Abstract: Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces a novel diffusion-based framework that employs symmetric contrastive learning and latent interpolation for continuous, disentangled facial expression editing.
The paper presents the FFE dataset of 60,000 images and the FFE-Bench benchmark, enabling multi-dimensional evaluation of identity consistency and edit controllability.
The paper demonstrates superior performance with significantly improved editing accuracy, reduced semantic confusion, and robust identity preservation against competing methods.

Fine-Grained Facial Expression Editing via PixelSmile

Introduction

PixelSmile addresses fundamental challenges in controllable facial expression editing, notably the underlying semantic entanglement between expression categories and the limitations imposed by discrete supervision. Existing diffusion-based and identity-consistent generation models, while highly capable in visual realism and overall editing, struggle with precise, fine-grained, and linearly-controllable manipulation—particularly when expression categories have intrinsic semantic overlap. PixelSmile introduces (1) the FFE dataset with continuous affective annotation, (2) the FFE-Bench benchmark for rigorous multi-dimensional evaluation, and (3) a diffusion-based framework leveraging symmetric contrastive learning and flow-matching-based latent interpolation for disentangled, identity-consistent expression editing.

Figure 1: Inherent expression overlap causes systematic confusion for annotators, recognition, and generative models; PixelSmile resolves this via continuous supervision and symmetric training, shown schematically alongside the FFE dataset and framework.

Semantic Overlap and Data-Centric Reformulation

A key insight from the analysis is that facial expressions constitute a continuous, overlapping manifold, rather than strictly separable categories. Human raters, classifiers, and generative models all suffer confusion (e.g., fear vs. surprise, anger vs. disgust), exacerbated by rigid, one-hot supervision and discrete-labeled datasets. To resolve structured confusion, PixelSmile introduces the FFE dataset: 60,000 curated images (real and anime) varying expression intensity for the same identity, annotated with 12-dimensional continuous score vectors via VLMs and validated with human oversight. This continuous annotation captures semantic overlap and enables systematic study of fine-grained editing and disentanglement.

FFE-Bench, built atop FFE, introduces precise metrics:

Mean Structural Confusion Rate (mSCR): Quantifies confusion between overlapping pairs.
Harmonic Editing Score (HES): Fuses expression score with identity similarity, requiring both for a high score.
Control Linearity Score (CLS): Evaluates whether expression intensity follows a linear response to editing coefficients.
Editing Accuracy: Measures top-1 expression correctness.

These metrics specifically address the limitations of legacy benchmarks that could not measure either controllability or disentanglement adequately.

The PixelSmile Framework

PixelSmile is a parameter-efficient framework built on a Multi-Modal Diffusion Transformer (MMDiT) with LoRA adaptation. It combines two crucial mechanisms:

Latent Interpolation in Text Space: Linear interpolation occurs between "neutral" and target expression text embeddings. The interpolation coefficient $\alpha$ allows continuous scaling of expression intensity. Supervision (score matching) ensures visual intensity aligns with the chosen $\alpha$ , while extrapolations beyond $\alpha = 1$ are possible.
Figure 2: The method interpolates between neutral and target expression embeddings for continuous control and utilizes symmetric joint training to separate confusing pairs while preserving identity.
Symmetric Joint Training: For each confusing category pair, symmetric contrastive objectives are employed—alternating which sample is the positive/negative in the triplet loss. This symmetric design reduces directional bias and directly separates entangled categories in the learned representation. An identity loss (ArcFace-based) regularizes the system against identity drift, especially at high intensities.

Quantitative results highlight PixelSmile's strong improvements in editing accuracy, disentanglement, and linear controllability, while maintaining identity consistency.

Figure 3: PixelSmile achieves a superior trade-off: wider range of expression manipulation at high identity fidelity compared to baselines.

Experimental Results

General Editing Comparison

On FFE-Bench, PixelSmile achieves the lowest mSCR (0.0550 vs. Nano Banana Pro 0.1754, GPT-Image 0.1107) and highest editing accuracy for basic expressions (0.8627). Competing models either preserve identity but weaken edit strength (e.g., Seedream, Qwen-Edit), or create visible change with identity drift (e.g., GPT-Image). Baselines with negative control linearity scores or rapid identity drop do not achieve meaningful controllability.

Figure 4: PixelSmile yields clearer, more discriminable expression edits than general-purpose systems, which degrade on strong or ambiguous edit instructions.

Linear Control Baselines

With respect to controllability, only PixelSmile achieves strong, monotonic expression editing across the entire intensity spectrum without sacrificing identity—CLS-6 of 0.8078 and HES of 0.4723 are best-in-class. Prior latent-slider methods either exhibit negligible expression modulation (to maintain identity) or cause sharp identity collapse at high intensities.

Figure 5: Smooth, monotonic expression control by PixelSmile, without unstable transitions or identity degradation, for both "happy" and "surprised".

Ablation Studies

Ablation demonstrates:

Removal of identity loss drastically increases identity drift (altered hair, skin, etc.) at strong intensities.
Removing contrastive loss or symmetric structure eliminates category disentanglement, reverting edits to the source identity or causing severe confusion.
Alternative triplet losses (Hinge, Log-Ratio) trade off edit/identity: InfoNCE is optimal for balancing both.
Replacing FFE with MEAD (discrete, less diverse) reduces all key metrics, establishing FFE’s importance for disentangled, controllable editing.
Figure 6: Without identity regularization, high-intensity edits alter persistent features; PixelSmile (full) preserves these reliably.

Figure 7: Without contrastive or symmetric loss, structural confusion persists—PixelSmile achieves distinct edits for overlapping classes.

Figure 8: Symmetric training—though slightly slower to converge—consistently yields lower confusion rates (lower mSCR) than the asymmetric variant.

User Study and Expression Compositionality

User studies confirm machine metrics: PixelSmile is rated highest for edit continuity and, nearly as strongly, for identity preservation; competitors trade off one for the other.

Figure 9: Human annotators rate PixelSmile highest for balancing smooth editability with identity retention.

Beyond single expressions, PixelSmile supports compositional editing—blending multiple basic emotions yields plausible compound expressions, with nine out of fifteen pairwise blends succeeding in perceptual tests.

Figure 10: PixelSmile supports smooth blending of multiple emotive categories, producing coherent compound expressions.

Implications and Future Directions

Practically, PixelSmile establishes a scalable, extensible pipeline for high-quality, linearly-controllable and disentangled facial editing, with robustness to domain (real, anime) and expression complexity. Theoretically, it validates symmetric contrastive learning and continuous affect space as foundations for further development, with impact for affective computing, personalized media synthesis, and fine-grained expression transfer.

Potential future developments include:

Extension to video and 4D avatars (temporal/3D disentanglement)
Multi-modal editing combining expression, age, and other attributes along continuous manifolds
Integration with large, multimodal foundation models for joint vision-language understanding and closed-loop human feedback
Real-time editing and adaptive user-in-the-loop systems for creative industries

Conclusion

PixelSmile, together with its FFE dataset and FFE-Bench protocol, provides a structured, data-centric, and methodologically rigorous platform for fine-grained, compositional facial expression editing. Symmetric supervision, flow-matching latent control, and continuous annotation jointly eliminate longstanding obstacles of semantic confusion and nonlinear or identity-destroying edits. This research establishes new trajectories for affective modeling and high-precision visual editing under both research and applied contexts.

Markdown Report Issue