SketchFaceGS: Real-Time Sketch-Driven Face Editing and Generation with Gaussian Splatting

Published 21 Apr 2026 in cs.GR and cs.CV | (2604.19202v1)

Abstract: 3D Gaussian representations have emerged as a powerful paradigm for digital head modeling, achieving photorealistic quality with real-time rendering. However, intuitive and interactive creation or editing of 3D Gaussian head models remains challenging. Although 2D sketches provide an ideal interaction modality for fast, intuitive conceptual design, they are sparse, depth-ambiguous, and lack high-frequency appearance cues, making it difficult to infer dense, geometrically consistent 3D Gaussian structures from strokes - especially under real-time constraints. To address these challenges, we propose SketchFaceGS, the first sketch-driven framework for real-time generation and editing of photorealistic 3D Gaussian head models from 2D sketches. Our method uses a feed-forward, coarse-to-fine architecture. A Transformer-based UV feature-prediction module first reconstructs a coarse but geometrically consistent UV feature map from the input sketch, and then a 3D UV feature enhancement module refines it with high-frequency, photorealistic detail to produce a high-fidelity 3D head. For editing, we introduce a UV Mask Fusion technique combined with a layer-by-layer feature-fusion strategy, enabling precise, real-time, free-viewpoint modifications. Extensive experiments show that SketchFaceGS outperforms existing methods in both generation fidelity and editing flexibility, producing high-quality, editable 3D heads from sketches in a single forward pass.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces a novel, dual-stage feedforward pipeline that synchronizes 2D sketches with 3D Gaussian rendering for photorealistic head modeling.
It leverages transformer-based feature extraction and StyleGAN2 modulation to ensure precise geometry, appearance, and real-time editing with high fidelity.
The approach achieves superior FID/KID scores and unmatched editing performance, offering a scalable, interactive solution for 3D facial synthesis.

Real-Time 3D Sketch-Driven Face Editing and Generation with Gaussian Splatting

Introduction and Motivation

SketchFaceGS addresses crucial limitations in 3D facial head synthesis by introducing the first real-time, sketch-driven system capable of both generating and interactively editing photorealistic 3D Gaussian head models. The framework builds on 3D Gaussian Splatting (3DGS), which achieves high-quality real-time rendering but historically lacks intuitive and fine-grained control for interactive design, especially from sparse and ambiguous 2D sketches. This work augments 3DGS techniques with a feed-forward, optimization-free pipeline leveraging modern generative priors and novel feature fusion strategies, bridging the conceptual gap between 2D designer input and dynamically editable 3D models.

Figure 1: Overview of the SketchFaceGS Framework, illustrating the dual-pipeline for generation from sketch/reference and for real-time, precise, view-consistent editing using UV Mask Fusion and layer-wise feature fusion.

Methodology

Coarse-to-Fine Feed-Forward Generation

The SketchFaceGS framework decomposes generation into two synergistic stages:

Coarse Stage: Independent Transformer branches extract geometry from the input sketch and appearance from a reference image, producing geometry-consistent UV feature maps. AdaIN-based alignment resolves identity and structure conflicts between sketch and reference modalities, followed by feature-level fusion.
Fine Stage: A U-Net-based UV feature enhancement module predicts multi-scale modulation parameters for a pre-trained StyleGAN2-based generator (from GGHead). These parameters, including a global identity vector and multi-resolution spatial features, condition synthesis within the StyleGAN backbone, directly producing a UV map encoding Gaussian attributes for 3DGS rendering with photorealistic detail.

This architecture leverages 3D generative priors to map abstract, ambiguous 2D sketches into dense, physically consistent and realistic 3D models in a feed-forward pass, eliminating the need for instance-level optimization.

Figure 2: Photorealistic 3D head synthesis from hand-drawn sketch and reference image (top), with support for detailed local editing (bottom).

Sketch-Driven Real-Time Editing

Interactive editing is achieved by UV Mask Fusion and layer-wise feature fusion:

UV Mask Fusion: Edits (drawn/erased pixels) in 2D are back-projected via ray casting to localize affected 3D Gaussians, which are then mapped to canonical UV space. This process generates precise binary masks corresponding to edited regions, robust to arbitrary sketch modifications.
Layer-wise Feature Fusion: Rather than compositing at the 3DGS level (which introduces boundary artifacts), the method fuses features within each generator layer. At every StyleGAN layer, masks are downsampled to the feature resolution, and unedited regions retain original features while only edited regions are updated. This enables continuous, artifact-free, structurally consistent updates that are stable across iterative or multi-step edits.
Figure 3: Continuous multi-step editing from diverse viewpoints, demonstrating fidelity and stability under sequential modifications.

Training and Losses

A three-stage training strategy is adopted: coarse-stage multi-view learning, fine-stage single-view learning, and an explicit editing objective. Losses include pixelwise L1, perceptual, LPIPS, color-consistency, and adversarial losses to optimize for geometric consistency, photorealism, and local/global identity preservation.

Experimental Evaluation

Quantitative and Qualitative Results

SketchFaceGS establishes new benchmarks in both generation and editing of 3D head models given hand-drawn sketches:

Generation: Outperforms S3D, SketchFaceNeRF, and Nano-LAM on FID and KID. Numerical results indicate superior photorealism and sketch faithfulness (FID 92.65; KID 4.00), with high fidelity to geometry and appearance under challenging sketch/reference discrepancies.
Editing: Demonstrates high-quality, real-time, view-consistent editing with substantially lower latency and higher frame rates (0.3s per edit; 243FPS render), eclipsing state-of-the-art approaches that are optimization-based or require costly 2D-to-3D pipelines.
Figure 4: Qualitative comparison for sketch-to-3D generation against S3D, Nano-LAM, SketchFaceNeRF, and SketchFaceGS, illustrating superior geometric and appearance fidelity.

Figure 5: Qualitative comparison for sketch-driven 3D editing. Only SketchFaceGS achieves real-time, artifact-free, sketch-faithful edits across multiple viewpoints.

Quantitatively, SketchFaceGS yields the lowest FID, KID, and highest PSNR/SSIM on unedited regions, indicating both overall generation quality and minimal identity drift in non-edited areas.

Ablation Studies

Ablations on the generation pipeline highlight critical components: removal of the UV feature enhancement or AdaIN alignment leads to oversmoothed results and identity artifacts, while dropping the global identity vectors or using a naive CNN for appearance extraction degrades both photorealism and appearance transfer.

Figure 6: Generation pipeline ablation—full model achieves optimal realism, detail, and identity consistency.

Ablations of editing strategies show that direct 3D Gaussian compositing causes seams and loss of identity, and naive full-head regeneration loses context. Only layer-wise feature fusion provides seamless, structurally and chromatically stable edits under repeated or fine-grained manipulations.

Figure 7: Editing module ablation—layer-wise feature fusion (e) yields the most consistent and artifact-free editing results.

Implications and Future Directions

This work substantially extends intuitive, user-driven 3D head modeling and editing, rendering photorealistic, high-fidelity avatars from sketches with real-time feedback. Practically, this establishes a new paradigm for interactive avatar creation, content design, and digital asset generation, lowering the barrier for non-specialists and enabling robust, iterative workflows.

Theoretically, the unified feed-forward architecture indicates the viability of bridging sparse human inputs and high-dimensional generative priors for dense 3D modeling. The novel UV Mask Fusion and feature space fusion strategies may generalize to other conditional generative modeling tasks, and to multi-modal interactive systems beyond head avatars.

Limitations and future work include sensitivity to extreme out-of-distribution sketches or accessories not captured by the generative prior, and slight identity shifts in the presence of significant geometric/appearance conflict. Extensions to dynamic editing for facial animation, improved identity consistency losses, and augmenting priors for rare attributes are natural next steps.

Conclusion

SketchFaceGS delivers the first real-time, optimization-free framework for sketch-driven generation and editing of photorealistic 3D Gaussian heads. By combining a coarse-to-fine dual-Transformer architecture, a modulation-based generative prior, and layer-wise UV Mask Fusion, it achieves high-fidelity, interactive, artist-friendly 3D modeling with superior quantitative and qualitative performance across benchmarks. This framework is a key advancement in controllable, efficient, and robust human face synthesis and editing in graphics and vision.

Markdown Report Issue