SplatFont3D: Structure-Aware 3D Fonts
- SplatFont3D is a structure-aware text-to-3D artistic font generation framework that transforms 2D glyphs into immersive 3D fonts while maintaining semantic integrity.
- It employs a multi-stage pipeline combining Glyph2Cloud, 3D Gaussian splatting with score distillation sampling, and dynamic component assignment for precise part-level stylization.
- The framework enables robust multi-view rendering and style transfer, making it ideal for applications such as VR/AR, animation, and video game design.
SplatFont3D is a structure-aware text-to-3D artistic font generation framework that leverages 3D Gaussian splatting and enables precise part-level style control for glyphs. Unlike previous artistic font generation (AFG) approaches, which have focused almost exclusively on 2D representations, SplatFont3D generates 3D fonts that capture fine-grained semantic and geometric constraints intrinsic to glyphs. This enables the rendering of fonts from arbitrary viewpoints, making them suitable for immersive and interactive 3D environments such as video games, animation, and AR/VR, while simultaneously addressing the challenges of part-level stylization, semantic preservation, and the absence of large-scale 3D font datasets (Gan et al., 29 Nov 2025).
1. Problem Formulation and Motivation
Most AFG research addresses 2D flat designs, which fail to capture the spatial and multi-view consistency necessary for 3D and immersive applications. Transitioning to 3D-AFG allows for novel-view synthesis (all 2D renderings become special cases), supports integration in 3D environments, and introduces new requirements:
- Semantic structure constraints: Glyph integrity (e.g., distinguishability of “A”) must be maintained during stylization. Existing text-to-3D methods trained on generic objects or using CLIP-guided priors are inadequate for glyph preservation under strong style transformations.
- Part-level style control: Design workflows frequently require component-level modifications (e.g., coloring specific strokes), but implicit 3D representations like NeRF or undifferentiated point clouds lack explicit part structural decomposition.
- Data scarcity: No large-scale 3D artistic font dataset currently exists, precluding supervised training and dictating reliance on priors derived from available 2D data sources.
SplatFont3D addresses these core challenges by combining a shape–style tradeoff module (Glyph2Cloud), optimization of explicit 3D Gaussian geometry under a 2D diffusion prior via Score Distillation Sampling, and a robust Dynamic Component Assignment strategy for disentangling and preserving part-level semantics.
2. Pipeline Architecture and Methodological Components
The SplatFont3D framework is organized into three sequential stages:
- Stage A: Glyph2Cloud (G2C) Inputs include a printed 2D glyph image and a style prompt (global or per component). The module produces a stylized glyph , a segmentation heatmap , and an initial 3D point cloud . A latent diffusion model (e.g., Stable Diffusion) is used to balance shape reconstruction with style transfer and, via a latent injection strategy, interpolates between shape fidelity and stylistic expressiveness.
- Stage B: 3D Gaussian Splatting with Score Distillation Sampling (SDS) The initial point cloud is converted into a set of 3D Gaussians , where is position, is covariance, is color, and is opacity. The Gaussians are rendered differentiably, then optimized through SDS, which uses a pretrained 2D diffusion prior to distill gradients that shape the 3D parameters so that multi-view renderings align with the style prompt .
- Stage C: Dynamic Component Assignment (DCA) DCA prevents component drift and entanglement during optimization. At each iteration, Gaussians are reassigned to glyph components by projecting them to 2D, using segmentation heatmaps per component, and applying a centroid-weighted criterion to maintain part boundary coherence.
The overall workflow ensures both global semantic and stylistic consistency, and enables explicit local style controls at the glyph component level.
3. Technical Details
3.1 Glyph2Cloud Module
The G2C module operates on the latent space of a pretrained diffusion model for glyph reconstruction and stylization:
- For diffusion step :
- Generate a shape-guided latent .
- Shape loss: , where is the diffusion auto-decoder.
- For the last steps, blend with via ; denoise for .
- Final decode: .
Shape–style tradeoff is continuously adjustable via and . The stylized glyph undergoes segmentation (e.g., by ClipSeg), with thresholded heatmaps yielding a binary mask. Foreground points are sampled and assigned depth, generating .
3.2 3D Gaussian Splatting and SDS Optimization
Each point in initializes a Gaussian:
- (3D position)
- (small isotropic covariance)
- (projected color)
- (opacity, set via mask confidence or uniform)
Rendered pixels are computed by summing the contributions of splatted Gaussians in the image plane, accounting for transmittance .
Parameter optimization uses Score Distillation Sampling:
where is the noisy latent, , and schedules timestep weighting. Gradients backpropagate through the differentiable rendering to update .
3.3 Dynamic Component Assignment
Addressing component drift, DCA repeatedly (every steps):
- Renders a front-view image.
- For each pixel and component with heatmap , computes a label:
where is the centroid; , penalizes distance. Every Gaussian’s component assignment is updated to match its projected position.
This procedure maintains explicit, disentangled, and robust component grouping as the geometry evolves.
4. Quantitative Evaluation and Comparative Analysis
The experimental setup utilizes 44 glyphs across three script types, with two global styles and both global and part-level stylizations, comprising 1,760 glyph–style pairs. All data are synthetic; there is no real 3D supervision. Baseline comparisons include DreamFusion, DreamFont3D, Latent-NeRF, MVDream, GaussianDreamer(Pro), GSGEN, and Trellis.
Key outcome metrics:
- Semantic consistency: CLIP, BLIP-2 + GPT-4 Alignment (scale 1–5)
- Visual quality: ImageReward (“Quality”), V-LPIPS, V-CLIP
Summary of results:
| Model | Global CLIP | Part-Level CLIP | Quality | V-LPIPS | V-CLIP |
|---|---|---|---|---|---|
| SplatFont3D | 0.80 | 0.84 | 53.11 | 0.18 | — |
| DreamFont3D | 0.82 | 0.81 | 35.62 | 0.19 | — |
- SplatFont3D demonstrates a higher part-level CLIP score (0.84) relative to DreamFont3D (0.81), and a significantly higher “Quality” metric (53.11 versus 35.62). Multi-view consistency (V-LPIPS) is comparable.
- Rendering performance: SplatFont3D achieves 40–60 FPS for frames (RTX 3090) and uses 8 GB GPU memory vs. NeRF’s 15–20 GB with slower (5 FPS) rates.
- Ablation studies confirm that removal of Glyph2Cloud leads to shape drift and loss of recognizability, while omission of Dynamic Component Assignment results in blurred or entangled part-level renderings. The full model is required for optimal metric values and robust component separation.
5. Practical Implications and Current Limitations
SplatFont3D is the first zero-data pipeline to provide explicit, drift-robust part-level style control for 3D artistic glyphs that can be rendered efficiently for both global and part-specific stylization (Gan et al., 29 Nov 2025). It effectively leverages 2D diffusion priors as a bridge from limited 2D data to high-quality 3D representations, efficiently optimizing explicit geometry suitable for real-world deployment.
However, several limitations persist:
- Style diversity is constrained by the underlying 2D diffusion prior’s training distribution; generalization to significantly out-of-distribution styles is limited.
- Extremely fine-grained part decomposition (e.g., modeling 6 components per glyph) increases optimization difficulty and can degrade visual fidelity, as shown in qualitative and quantitative ablations.
- SDS optimization requires several GPU-hours per glyph, representing a computational bottleneck.
- No explicit 3D shape priors (e.g., SDFs) are yet incorporated to further stabilize resulting geometry.
6. Future Directions
Possible extensions identified include:
- Integration of explicit 3D shape priors (e.g., signed distance functions) to enhance geometric robustness.
- Joint training of a lightweight, part-aware diffusion prior operating directly on 3D Gaussians to further improve fidelity and decouple component stylization.
- Development of interactive tools for real-time, stroke-level editing and refinement, leveraging the explicit part structure and rapid rendering made possible by Gaussian splatting.
These directions aim to expand SplatFont3D’s applicability in design pipelines and support further research in structure- and semantics-aware 3D font generation.