Controllable Stylized Distillation (CSD) for 3D Modeling

Updated 2 April 2026

CSD is a method that employs score distillation with tunable style weights to balance content semantics and stylistic appearance in 3D reconstructions.
It leverages mechanisms like self-attention swapping, negative prompt guidance, and multi-stage cascaded diffusion for consistent style integration in neural fields, meshes, and splats.
Empirical studies show CSD enhances geometric fidelity and stylization quality, outperforming traditional SDS approaches in cross-view consistency and detail preservation.

Controllable Stylized Distillation (CSD) refers to a family of optimization strategies for 3D generative modeling that tightly couple diffusion model distillation with explicit, tunable control of stylistic appearance. CSD enables 3D reconstructions—neural fields, meshes, or gaussian splats—to exhibit both content semantics dictated by a text prompt and visual style faithfully reflecting an arbitrary reference image. Typical CSD frameworks intertwine score distillation gradients from generative diffusion models with architectural interventions or guidance mechanisms that modulate the degree and spatial consistency of stylization. Its effectiveness has been validated in settings such as text-to-3D via neural radiance fields, 3D Gaussian Splatting, and mesh-based local texture editing (Kompanowski et al., 2024, Yang et al., 11 Aug 2025, Decatur et al., 2023).

1. Mathematical Formulations and Core Losses

CSD generalizes standard Score Distillation Sampling (SDS) and related losses to mix competing “score opinions” from “content” and “style” configurations of a diffusion-based denoiser. Given rendered image $x = g(\theta)$ (from parameters $\theta$ of the 3D representation), noise level $t$ , and Gaussian noise $\varepsilon$ , two main variants are prominent:

Mixture-based Stylized Score Distillation (Kompanowski et al., 2024): For text-to-3D with style control,

$\nabla_\theta \mathcal{L}_{\mathrm{SSD}} = \mathbb{E}_{t, \varepsilon} \Big[\, \omega(t) \Big( (1-\lambda)\,\epsilon_\phi(z_t|y) + \lambda\,\hat\epsilon_\phi(z_t|y,s) - \varepsilon \Big)\, \frac{\partial x}{\partial\theta} \Big],$

where $\epsilon_\phi(\cdot|y)$ is the denoiser for content prompt $y$ , $\hat\epsilon_\phi(\cdot|y,s)$ its style-injected sibling, and $\lambda$ weighs style strength.

Reconstruction-Free, Negative Guidance CSD (Yang et al., 11 Aug 2025): For 3DGS style transfer,

$\nabla_\theta \mathcal{L}_{\mathrm{CSD}} = \mathbb{E}_{t,\varepsilon}\Big[\,\omega(t) \cdot (\Phi^{\mathrm{tgt}} - \Phi^{\mathrm{src}}) \cdot \frac{\partial z_t^{\mathrm{tgt}} }{ \partial \theta }\,\Big]$

with $\theta$ 0 using positive and negative prompts to isolate style and suppress content from the reference image.

Table: CSD Loss Gradient Structure

Variant	Core Loss Formula	Style Modulation
SSD (text-to-3D NeRF)	$\theta$ 1	$\theta$ 2 weight
CSD (3DGS, FantasyStyle)	$\theta$ 3, $\theta$ 4 uses positive/negative CFG	negative prompt, $\theta$ 5

CSD consistently departs from classic SDS by enabling a direct, tunable mixture of content and stylized branches and, in some settings, omitting the original distribution's reconstruction term to favor crisp stylization (Yang et al., 11 Aug 2025).

2. Mechanisms for Style Incorporation and Control

CSD distinguishes itself by the mechanisms through which style information from the reference image is injected and controlled:

Self-Attention Key/Value Swapping (Kompanowski et al., 2024): In Stable Diffusion U-Net, style keys and values ( $\theta$ 6, $\theta$ 7) extracted from a style-augmented prompt are swapped into the self-attention for the content rendering. This mixing is performed per attention block, with all other network weights frozen, producing $\theta$ 8 for the stylized branch.
Negative Prompt Guidance via IP-Adapter (Yang et al., 11 Aug 2025): Style and content embeddings are disentangled, with content suppressed via negative prompts and stylization isolated by positive guidance in classifier-free guidance (CFG). This ensures the style branch imparts color/texture while not leaking content structure from the style image.
Stage-wise Cascaded Score Distillation (Decatur et al., 2023): Using a multi-stage diffusion backbone, CSD applies score-matching gradients at every resolution, permitting both global (base stage) and local (super-resolution stages) style control. Stage weights $\theta$ 9 modulate the influence of each level, enabling a trade-off between coherence and detail.

3. Training Procedures and Schedules

The optimization process in CSD frameworks is iteratively staged to avoid geometric collapse and balance stylization fidelity:

Style-Ratio Scheduling: The style mixture weight $t$ 0 is critical. Square-root or quadratic ramps— $t$ 1 or $t$ 2—enable gradual introduction of stylization, preserving plausible geometry early and increasing style intensity late in optimization (Kompanowski et al., 2024).
Multi-Resolution and Multi-View Synthesis: In mesh-based and 3DGS settings, optimization is carried out across multiple resolutions (cascades) and/or views per iteration. For 3DGS, Multi-View Frequency Consistency (MVFC) is also applied, utilizing 3D Fourier filtering to replace low frequencies with shared noise across views, fostering style consistency (Yang et al., 11 Aug 2025).
Gradient Steps and Backbones: For NeRF, AdamW optimizes density and color MLP parameters; for 3D Gaussian Splatting, only the color parameters are updated. Diffusion model weights are kept frozen. Attention hooks or external encoders (BLIP-2, IP-Adapter) provide necessary embeddings for prompt augmentation and negative guidance.

4. Empirical Results and Ablation Studies

CSD frameworks demonstrate significant improvements over prior state-of-the-art baselines—in both stylization sharpness and cross-view consistency—validated by qualitative and quantitative experiments:

Dream-in-Style (NeRF, (Kompanowski et al., 2024)): CSD exceeds baselines (“style-in-prompt,” neural-style loss, textual-inversion) in GPTEval3D Elo ratings (overall Elo 1140 vs 1000 anchor) across text alignment, geometry, style, detail, and plausibility. Combined content and style branches preserve shape and transfer fine visual patterns; style-only branches degrade geometry.
FantasyStyle (3DGS, (Yang et al., 11 Aug 2025)): CSD achieves superior ArtFID (43.52 vs. 45.31 for StyleGaussian, 44.70 for SGSST), lower FID_style (347.61), and improved consistency metrics (LPIPS, both short/long-range). Ablations confirm that reconstruction-free gradients and negative guidance prevent smoothing and leakage, and that frequency-consistent filters resolve cross-view style conflicts.
3D Paintbrush (Mesh, (Decatur et al., 2023)): Cascaded CSD enables a smooth dial between rightly localized but blurry texture (low-stage only) and sharp but semantically drifting detail (high-stage only). Joint optimization of localization, texture, and background outperforms independent or staged approaches; direct CSD loss across all cascades yields both precise placement and intricate style.

5. Comparative Frameworks and Theoretical Distinctions

CSD is distinct from conventional approaches by virtue of:

Direct Score Mixture or Negative Prompt Subtraction: Rather than relying on feature-level VGG losses or in-prompt stylization heuristics, CSD fuses (or subtracts) learned denoiser scores, under explicit style and content conditioning.
Flexible Modality and Representation Support: CSD has been instantiated for neural radiance fields (Kompanowski et al., 2024), 3D Gaussian Splatting (Yang et al., 11 Aug 2025), and differentiable mesh pipelines (Decatur et al., 2023), demonstrating architectural scope beyond a single 3D representation.
Explicit, Tunable Style Localization: The introduction of per-stage or per-iteration dial parameters ( $t$ 3, $t$ 4) allows explicit, fine-grained control by the researcher, enabling smooth trade-offs between geometry, semantic content, and stylization.

A plausible implication is that CSD strategies will remain adaptable as new diffusion backbones, prompt encoders, and geometric representations emerge, provided they support the requisite disentangling of style versus content signals and support model-freezer hooks for style injection.

6. Implementation Parameters and Practical Considerations

Implementation varies by setting:

Backbones:
- Dream-in-Style: NeRF via Threestudio/NerfAcc, Stable Diffusion 1.4/1.5 U-Net.
- FantasyStyle: Open3DGS/Stable Diffusion (with IP-Adapter for style/content separation).
- 3D Paintbrush: Differentiable mesh renderer, cascaded diffusion model.
Guidance:
- $t$ 5 (guidance): 7.5 for NFSD/CSD; up to 100 for vanilla SDS.
- Style mixture/negative guidance schedule: square-root optimal for most cases, quadratic for highly abstract styles.
Hardware:
- RTX 4090 (24GB, ≈1.5–2.5h for stylized object in NeRF).
- For 3DGS, 2× NVIDIA L20 (48GB).
Hyperparameters:
- Batch size: typically 8–32 views per step for 3DGS.
- Timestepping: uniform sampling in $t$ 6 or discrete steps.
- Resolution/Stage weights: $t$ 7 for each cascade level to interrogate global vs. local detail.

7. Impact, Significance, and Limitations

CSD consolidates recent innovations in 3D generative modeling, directly addressing previously unresolved challenges: geometry–stylization trade-off, content leakage, and multi-view consistency. Empirical studies demonstrate that CSD produces objects with faithful geometry dictated by text prompts and stylistic attributes closely tracking arbitrary reference images, surpassing VGG-based and prompt-based transfer schemes in both artifact reduction and user preference metrics (Kompanowski et al., 2024, Yang et al., 11 Aug 2025, Decatur et al., 2023).

A potential limitation, observed especially in abstract or high-complexity styles, is sensitivity to schedule and mixture weight selection; misconfigured ramps or excessive style weight can yield geometry collapse, over-stylization, or diminished semantic coherence. CSD effectiveness is also bounded by the expressive capacity of the underlying diffusion model and encoder’s disentangling proficiency. Future work may extend CSD to more localized or spatially-varying controls, or to hybrid pipelines supporting partial style transfer across object regions.