Papers
Topics
Authors
Recent
2000 character limit reached

SplatFont3D: Structure-Aware 3D Fonts

Updated 6 December 2025
  • SplatFont3D is a structure-aware text-to-3D artistic font generation framework that transforms 2D glyphs into immersive 3D fonts while maintaining semantic integrity.
  • It employs a multi-stage pipeline combining Glyph2Cloud, 3D Gaussian splatting with score distillation sampling, and dynamic component assignment for precise part-level stylization.
  • The framework enables robust multi-view rendering and style transfer, making it ideal for applications such as VR/AR, animation, and video game design.

SplatFont3D is a structure-aware text-to-3D artistic font generation framework that leverages 3D Gaussian splatting and enables precise part-level style control for glyphs. Unlike previous artistic font generation (AFG) approaches, which have focused almost exclusively on 2D representations, SplatFont3D generates 3D fonts that capture fine-grained semantic and geometric constraints intrinsic to glyphs. This enables the rendering of fonts from arbitrary viewpoints, making them suitable for immersive and interactive 3D environments such as video games, animation, and AR/VR, while simultaneously addressing the challenges of part-level stylization, semantic preservation, and the absence of large-scale 3D font datasets (Gan et al., 29 Nov 2025).

1. Problem Formulation and Motivation

Most AFG research addresses 2D flat designs, which fail to capture the spatial and multi-view consistency necessary for 3D and immersive applications. Transitioning to 3D-AFG allows for novel-view synthesis (all 2D renderings become special cases), supports integration in 3D environments, and introduces new requirements:

  • Semantic structure constraints: Glyph integrity (e.g., distinguishability of “A”) must be maintained during stylization. Existing text-to-3D methods trained on generic objects or using CLIP-guided priors are inadequate for glyph preservation under strong style transformations.
  • Part-level style control: Design workflows frequently require component-level modifications (e.g., coloring specific strokes), but implicit 3D representations like NeRF or undifferentiated point clouds lack explicit part structural decomposition.
  • Data scarcity: No large-scale 3D artistic font dataset currently exists, precluding supervised training and dictating reliance on priors derived from available 2D data sources.

SplatFont3D addresses these core challenges by combining a shape–style tradeoff module (Glyph2Cloud), optimization of explicit 3D Gaussian geometry under a 2D diffusion prior via Score Distillation Sampling, and a robust Dynamic Component Assignment strategy for disentangling and preserving part-level semantics.

2. Pipeline Architecture and Methodological Components

The SplatFont3D framework is organized into three sequential stages:

  • Stage A: Glyph2Cloud (G2C) Inputs include a printed 2D glyph image xpx_p and a style prompt yy (global or per component). The module produces a stylized glyph xgx_g, a segmentation heatmap HgH_g, and an initial 3D point cloud P0P_0. A latent diffusion model ϕ\phi (e.g., Stable Diffusion) is used to balance shape reconstruction with style transfer and, via a latent injection strategy, interpolates between shape fidelity and stylistic expressiveness.
  • Stage B: 3D Gaussian Splatting with Score Distillation Sampling (SDS) The initial point cloud P0P_0 is converted into a set of 3D Gaussians G={(μi,Σi,ci,αi)}G = \{ (\mu_i, \Sigma_i, c_i, \alpha_i) \}, where μi\mu_i is position, Σi\Sigma_i is covariance, cic_i is color, and αi\alpha_i is opacity. The Gaussians are rendered differentiably, then optimized through SDS, which uses a pretrained 2D diffusion prior ϕx\phi_x to distill gradients that shape the 3D parameters θ\theta so that multi-view renderings align with the style prompt yy.
  • Stage C: Dynamic Component Assignment (DCA) DCA prevents component drift and entanglement during optimization. At each iteration, Gaussians are reassigned to glyph components by projecting them to 2D, using segmentation heatmaps per component, and applying a centroid-weighted criterion to maintain part boundary coherence.

The overall workflow ensures both global semantic and stylistic consistency, and enables explicit local style controls at the glyph component level.

3. Technical Details

3.1 Glyph2Cloud Module

The G2C module operates on the latent space of a pretrained diffusion model ϕ\phi for glyph reconstruction and stylization:

  • For diffusion step tt:
    • Generate a shape-guided latent zst=ϕ(zpt,y,t)z_s^t = \phi(z_p^t, y, t).
    • Shape loss: Lshape=D(zs0)xp1L_{\text{shape}} = \lVert D(z_s^0) - x_p \rVert_1, where DD is the diffusion auto-decoder.
    • For the last KK steps, blend zstz_s^t with zptz_p^t via z~t=αzst+(1α)zpt\tilde z^t = \alpha \odot z_s^t + (1-\alpha) \odot z_p^t; denoise for t[K,0]t \in [K,0].
    • Final decode: xg=D(z~0)x_g = D(\tilde z^0).

Shape–style tradeoff is continuously adjustable via α\alpha and KK. The stylized glyph xgx_g undergoes segmentation (e.g., by ClipSeg), with thresholded heatmaps HgH_g yielding a binary mask. Foreground points are sampled and assigned depth, generating P0P_0.

3.2 3D Gaussian Splatting and SDS Optimization

Each point in P0P_0 initializes a Gaussian:

  • μiR3\mu_i \in \mathbb{R}^3 (3D position)
  • ΣiR3×3\Sigma_i \in \mathbb{R}^{3\times3} (small isotropic covariance)
  • ciR3c_i \in \mathbb{R}^3 (projected color)
  • αi(0,1)\alpha_i \in (0,1) (opacity, set via mask confidence or uniform)

Rendered pixels C(x)C(x) are computed by summing the contributions of splatted Gaussians in the image plane, accounting for transmittance TiT_i.

Parameter optimization uses Score Distillation Sampling:

LSDS(θ)=Eϵ,t[w(t)ϵϕ(zt;y,t)ϵ2]L_{\text{SDS}}(\theta) = \mathbb{E}_{\epsilon, t} \left[ w(t) \lVert \epsilon_\phi(z^t; y, t) - \epsilon \rVert^2 \right]

where zt=αtx+σtϵz^t = \alpha_t x + \sigma_t \epsilon is the noisy latent, ϵN(0,I)\epsilon \sim \mathcal{N}(0, I), and w(t)w(t) schedules timestep weighting. Gradients backpropagate through the differentiable rendering to update θ\theta.

3.3 Dynamic Component Assignment

Addressing component drift, DCA repeatedly (every NN steps):

  • Renders a front-view image.
  • For each pixel pp and component mm with heatmap Hgm(p)H_g^m(p), computes a label:

M(p)=argmaxm[log(Hgm(p)+δ)βpuHm2]M(p) = \arg\max_m \left[ \log(H_g^m(p) + \delta) - \beta \lVert p - u_H^m \rVert_2 \right]

where uHmu_H^m is the centroid; δ0\delta \to 0, β\beta penalizes distance. Every Gaussian’s component assignment is updated to match its projected position.

This procedure maintains explicit, disentangled, and robust component grouping as the geometry evolves.

4. Quantitative Evaluation and Comparative Analysis

The experimental setup utilizes 44 glyphs across three script types, with two global styles and both global and part-level stylizations, comprising 1,760 glyph–style pairs. All data are synthetic; there is no real 3D supervision. Baseline comparisons include DreamFusion, DreamFont3D, Latent-NeRF, MVDream, GaussianDreamer(Pro), GSGEN, and Trellis.

Key outcome metrics:

  • Semantic consistency: CLIP, BLIP-2 + GPT-4 Alignment (scale 1–5)
  • Visual quality: ImageReward (“Quality”), V-LPIPS, V-CLIP

Summary of results:

Model Global CLIP Part-Level CLIP Quality V-LPIPS V-CLIP
SplatFont3D 0.80 0.84 53.11 0.18
DreamFont3D 0.82 0.81 35.62 0.19
  • SplatFont3D demonstrates a higher part-level CLIP score (0.84) relative to DreamFont3D (0.81), and a significantly higher “Quality” metric (53.11 versus 35.62). Multi-view consistency (V-LPIPS) is comparable.
  • Rendering performance: SplatFont3D achieves \sim40–60 FPS for 1024×10241024 \times 1024 frames (RTX 3090) and uses \sim8 GB GPU memory vs. NeRF’s 15–20 GB with slower (<<5 FPS) rates.
  • Ablation studies confirm that removal of Glyph2Cloud leads to shape drift and loss of recognizability, while omission of Dynamic Component Assignment results in blurred or entangled part-level renderings. The full model is required for optimal metric values and robust component separation.

5. Practical Implications and Current Limitations

SplatFont3D is the first zero-data pipeline to provide explicit, drift-robust part-level style control for 3D artistic glyphs that can be rendered efficiently for both global and part-specific stylization (Gan et al., 29 Nov 2025). It effectively leverages 2D diffusion priors as a bridge from limited 2D data to high-quality 3D representations, efficiently optimizing explicit geometry suitable for real-world deployment.

However, several limitations persist:

  • Style diversity is constrained by the underlying 2D diffusion prior’s training distribution; generalization to significantly out-of-distribution styles is limited.
  • Extremely fine-grained part decomposition (e.g., modeling >>6 components per glyph) increases optimization difficulty and can degrade visual fidelity, as shown in qualitative and quantitative ablations.
  • SDS optimization requires several GPU-hours per glyph, representing a computational bottleneck.
  • No explicit 3D shape priors (e.g., SDFs) are yet incorporated to further stabilize resulting geometry.

6. Future Directions

Possible extensions identified include:

  • Integration of explicit 3D shape priors (e.g., signed distance functions) to enhance geometric robustness.
  • Joint training of a lightweight, part-aware diffusion prior operating directly on 3D Gaussians to further improve fidelity and decouple component stylization.
  • Development of interactive tools for real-time, stroke-level editing and refinement, leveraging the explicit part structure and rapid rendering made possible by Gaussian splatting.

These directions aim to expand SplatFont3D’s applicability in design pipelines and support further research in structure- and semantics-aware 3D font generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to SplatFont3D.