Scene Style Transfer Overview

Updated 3 July 2025

Scene style transfer is a technique that applies the stylistic attributes of a reference source to images, videos, or 3D models while retaining the original scene’s structure and semantics.
It employs deep learning architectures like CNNs and NeRF along with tailored loss functions to balance style fidelity, content preservation, and temporal or view consistency.
The approach enables both photorealistic and artistic outputs for applications in film, VR, and design, though it faces challenges in computational demand and fine-tuning style strength.

Scene style transfer is the process of modifying the visual appearance of a scene—whether represented as 2D images, 3D models, videos, or point clouds—by imparting the stylistic attributes (such as color distribution, texture, and visual motifs) of a reference style source to the target content, while preserving the content’s structural and semantic integrity. This field intersects computer vision, graphics, and machine learning, encompassing photorealistic relighting and recoloring, painterly and non-photorealistic rendering, and immersive 3D/VR applications. It has evolved from early neural techniques for static images to highly sophisticated, physically-based, semantically aware, and 3D-consistent algorithms demonstrated across a range of visual modalities.

1. Key Principles and Loss Formulations

Central to scene style transfer are loss objectives that balance stylistic fidelity, content preservation, and—especially for dynamic or 3D content—structural and temporal consistency. The canonical neural style transfer framework constructs the stylized output by minimizing a composite objective over the input $x$ :

$\mathcal{L}_{\text{total}} = \sum_{l} \alpha_l \mathcal{L}^{(l)}_{\text{content}} + \sum_{l} \beta_l \mathcal{L}^{(l)}_{\text{style}} + \lambda \ \text{(others)}$

where the key terms cover:

Content Loss $(\mathcal{L}_{\text{content}})$ : Penalizes deviations from high-level structural features of the content (typically measured at intermediate CNN layers).
Style Loss $(\mathcal{L}_{\text{style}})$ : Measures mismatch in spatial feature correlations (Gram matrices) between output and style references, optionally region-restricted for spatial semantics.
Temporal/Geometric Consistency: Additional losses enforce steady appearance over time for video (1807.00273) or consistency across views for 3D/mesh-based approaches.
Photorealism Regularization: For photorealistic output, Laplacian/Matting Laplacian losses and locally affine color constraints preserve natural color relationships (1807.00273).

The optimization is typically performed with gradient-based methods (e.g., Adam, L-BFGS), iteratively refining either the output pixels (2D) or parameters of explicit/implicit 3D representations.

2. Model Architectures and Semantic Guidance

Early style transfer methods focused on global image statistics. Modern scene style transfer leverages architectures sensitive to semantic regions, geometry, and modality:

Convolutional Neural Networks (CNNs): Widely used for static images and video frames, often built on VGG-16/VGG-19 for feature extraction. Style and content losses are drawn from different layers according to their hierarchical abstraction (1807.00273, 2501.09420).
Semantic Segmentation and Masking: Automatic region labeling via segmentation models (e.g., ADE20K, ODISE) enables region-aware or object-specific style transfer, preventing improbable style blending across semantically distinct areas (1807.00273, 1906.01134, 2502.10377). Spatially-varying content weights or region masks preserve central objects or enable compositional transfer.
Domain-Adaptive Encoders: Domain-awareness modules estimate "artisticness" or "domainness" of styles, adjusting the network’s skip connections and reconstruction fidelity to interpolate between photorealistic and painterly results (2108.04441).
3D-Specific Representations: For 3D scenes, architectures include neural radiance fields (NeRF), point cloud networks (e.g., PointNet (1903.05807)), explicit mesh texture optimization (2112.01530), 3D Gaussian Splatting (2407.09473, 2503.22218), and combinations with deep feature distillation.

3. Scene Structure, Temporal, and Multi-View Consistency

Preserving the structure and consistency of the target scene is imperative in scene-level applications, especially for video and 3D:

Temporal Consistency for Video: Temporal losses penalize frame-to-frame deviations, employing optical flow for warping and occlusion exclusion. Short-term losses handle instantaneous coherence; long-term losses incorporate appearance memory (1807.00273, 2212.09247).
3D View Consistency: For NeRF, mesh, or point cloud representations, methods enforce that style is applied consistently across different camera angles or views (2112.01530, 2401.05516, 2407.09473). Some use flow-based or patch-aligned perceptual losses for warping between views (2408.13508, 2502.10377).
Depth and Angle Awareness: Optimization adapted by depth and surface orientation regularizes stylization scale and pattern across surfaces, preventing stretch and maintaining pattern uniformity (2112.01530).
Global-Local Feature Alignment: For 3D and semantic correspondence, matching local (region/object) and global (whole-scene) distributions prevents style mixing and maintains functional coherence (2305.15732, 2502.10377, 2503.22218).

4. Photorealistic and Artistic Approaches

Scene style transfer is employed in both photorealistic and artistic contexts, with some methods bridging both within a unified model:

Photorealistic Transfer: Focuses on subtle color, tone, and illumination changes, preserving realism and structure. Techniques emphasize local affine transformations in color space, Matting Laplacian constraints, and explicit matching of low-level features (1807.00273, 2212.09247).
Artistic Transfer: Pursues bolder changes, such as mimicking brushstrokes, non-local textures, or color palettes. CNNs or diffusion-based generators are optimized to match higher-order feature correlations and stylized details (2105.00865, 2406.13393).
Hybrid and Unified Approaches: Networks with domainness indicators or feed-forward AdaIN-based style injection enable seamless transition between photorealistic and artistic effect, dictated by the style reference (2108.04441, 2401.05516).
Multiscale and Analytic Techniques: Methods such as GIST (2412.02214) utilize analytic multiscale (Wavelet, Contourlet) decompositions, aligning content and style subbands with optimal transport. This yields fast, training-free, and photorealistic style transfer, supporting both scene structure fidelity and flexible stylization.

5. Modality-Specific Innovations

Scene style transfer encompasses diverse modalities, each with targeted strategies:

Video: Temporal-aware CNN architectures, self-supervised decoupled normalization (2212.09247), or Matting Laplacian regularization for frame coherence (1807.00273).
3D Point Clouds: Order-invariant PointNet-based networks allowing independent transfer of color (from images or point clouds) and geometry (1903.05807).
3D Meshes and Textures: Optimization of mesh textures via differentiable rendering using depth- and angle-aware regularization, with results compatible with real-time graphics engines (2112.01530).
Radiance Fields & Splatting: Large-scale, real-time style transfer on radiance field and 3DGS representations. Innovations include feed-forward AdaIN stylization in 3D feature space (2401.05516), multi-reference (semantically-matched) AdaIN (2401.05516), object-aware splatting and segmented editing (2407.09473, 2503.22218).
Language-Guided Transfer: Language-conditioned frameworks align global and local style codes from text to 3D geometry with special divergence losses, increasing expressivity and generalization (2305.15732).

6. Applications, Evaluation, and Limitations

Scene style transfer methods are applied in:

Film and Video Production: Adding stylized effects, domain adaptation, recoloring, or creating painterly animations while preserving continuity and realism (1807.00273, 2212.09247).
Virtual and Augmented Reality: Immersive scene relighting, architectural walkthroughs, and creative asset generation with semantic/region control (2112.01530, 2407.09473).
Interior Design and Virtual Staging: Instance-aware or semantic-aware transfer for realistic visualization of furniture and décor (2502.10377).
Interactive Tools: Fast, user-driven tools for artistic exploration, batch stylization, or parameter tuning (2105.00865, 2412.02214).
Generalization and Large-scale Deployment: Methods such as G3DST (2408.13508) and FPRF (2401.05516) enable feed-forward, scene-agnostic style transfer across arbitrary, large, and unseen environments.

Evaluation is typically both qualitative (user studies, visualizations of temporal/multi-view consistency) and quantitative (SSIM, LPIPS, perceptual metrics, ArtFID, CHD, DSD, and correspondence with ground truth for pose/structure).

Common limitations include computational demand for iterative/optimization-based methods, dependence on segmentation or depth/geometry estimation quality, and trade-offs between style strength and structural fidelity.

Summary Table: Representative Approaches and Key Features

Method/Modality	Structural Consistency	Semantic/Region Masks	Real-Time/Feed-Forward	Explicit 3D Support	Multi-Reference Control	Language Input	Training-Free
(1807.00273) (2D/vid)	Yes (temporal loss)	Yes	No	No	No	No	No
(2112.01530) (mesh)	Yes (3D and view)	Partial	Yes	Yes (mesh)	No	No	No
(2401.05516) (NeRF)	Yes (view-consistent)	Partial	Yes	Yes (NeRF)	Yes (local AdaIN)	No	No
(2408.13508) (G3DST)	Yes (opt. flow loss)	Partial	Yes	Yes (generalizable NeRF)	No	No	Yes
(2412.02214) (GIST)	Yes	Yes	Yes	No	Yes	No	Yes
(2502.10377) (ReStyle3D)	Yes (geometry)	Yes (open vocabulary)	Yes	Yes (multi-view, no dense mesh)	No	No	Partial
(2503.22218) (ABC-GS)	Yes (3DGS, FAST)	Yes	Yes	Yes (3DGS)	Yes	No	No
(2305.15732) (CLIP3Dstyler)	Yes	No	Yes	Yes (point cloud)	Yes	Yes	Yes

References

Photorealistic Style Transfer for Videos (1807.00273)
Neural Style Transfer for Point Clouds (1903.05807)
Style Transfer With Adaptation to the Central Objects of the Scene (1906.01134)
LiveStyle -- An Application to Transfer Artistic Styles (2105.00865)
Domain-Aware Universal Style Transfer (2108.04441)
StyleMesh: Style Transfer for Indoor 3D Scene Reconstructions (2112.01530)
An Overview of Color Transfer and Style Transfer for Images and Videos (2204.13339)
UPST-NeRF: Universal Photorealistic Style Transfer of Neural Radiance Fields for 3D Scene (2208.07059)
ColoristaNet for Photorealistic Video Style Transfer (2212.09247)
CLIP3Dstyler: Language Guided 3D Arbitrary Neural Style Transfer (2305.15732)
FPRF: Feed-Forward Photorealistic Style Transfer of Large-Scale 3D Neural Radiance Fields (2401.05516)
CoARF: Controllable 3D Artistic Style Transfer for Radiance Fields (2404.14967)
Style-NeRF2NeRF: 3D Style Transfer From Style-Aligned Multi-View Images (2406.13393)
StyleSplat: 3D Object Style Transfer with Gaussian Splatting (2407.09473)
G3DST: Generalizing 3D Style Transfer with Neural Radiance Fields across Scenes and Styles (2408.13508)
Towards Multi-View Consistent Style Transfer with One-Step Diffusion via Vision Conditioning (2411.10130)
GIST: Towards Photorealistic Style Transfer via Multiscale Geometric Representations (2412.02214)
Dynamic Neural Style Transfer for Artistic Image Generation using VGG19 (2501.09420)
ReStyle3D: Scene-Level Appearance Transfer with Semantic Correspondences (2502.10377)
ABC-GS: Alignment-Based Controllable Style Transfer for 3D Gaussian Splatting (2503.22218)