Generating Surface for Text-to-3D using 2D Gaussian Splatting (2510.06967v1)

Published 8 Oct 2025 in cs.CV and cs.AI

Abstract: Recent advancements in Text-to-3D modeling have shown significant potential for the creation of 3D content. However, due to the complex geometric shapes of objects in the natural world, generating 3D content remains a challenging task. Current methods either leverage 2D diffusion priors to recover 3D geometry, or train the model directly based on specific 3D representations. In this paper, we propose a novel method named DirectGaussian, which focuses on generating the surfaces of 3D objects represented by surfels. In DirectGaussian, we utilize conditional text generation models and the surface of a 3D object is rendered by 2D Gaussian splatting with multi-view normal and texture priors. For multi-view geometric consistency problems, DirectGaussian incorporates curvature constraints on the generated surface during optimization process. Through extensive experiments, we demonstrate that our framework is capable of achieving diverse and high-fidelity 3D content creation.

Summary

The paper presents DirectGaussian, a method that leverages 2D Gaussian Splatting and multi-view priors to produce high-fidelity 3D surfaces from textual descriptions.
It optimizes Gaussian surfels with curvature constraints and multi-view consistency to reduce computational overhead compared to traditional mesh-based methods.
Evaluation shows that DirectGaussian outperforms existing techniques in CLIP-based text-image consistency, yielding robust and visually coherent 3D models.

Generating Surface for Text-to-3D using 2D Gaussian Splatting

The paper "Generating Surface for Text-to-3D using 2D Gaussian Splatting" (2510.06967) presents DirectGaussian, a novel method for generating high-fidelity 3D object surfaces from textual descriptions through the innovative use of 2D Gaussian Splatting. It addresses the complex challenges associated with traditional 3D content generation methods by employing a unique strategy that leverages conditional text generation models and multi-view normal and texture priors. The approach promises significant improvements in rendering texturally rich and geometrically consistent 3D models from text inputs.

Method Overview

DirectGaussian is comprised of three core components: constructing a dataset of Gaussian surfels aligned with textual descriptions, generating coarse Gaussian surfels, and optimizing these surfels using multi-view geometric priors. The method incorporates curvature constraints to ensure consistency in the rendered surfaces, which are essential for achieving high-quality visual outputs from various perspectives.

Figure 1: DirectGaussian generates 3D objects from text captions, employing multi-view diffusion models guided by text for enhanced geometric consistency.

The core idea is to render 3D objects by computing Gaussian surfels that align well with text-conditioned multi-view imagery. This approach circumvents the reliance on traditional point cloud or mesh-based methods, which are often constrained by high computational overheads or predefined topologies.

Gaussian Splatting and Optimization

DirectGaussian takes full advantage of the Gaussian Splatting technique to render 3D surfaces. Unlike methods that generate exhaustive sets of multi-view images or directly synthesize entire 3D structures, this approach focuses on creating and refining a parametrized surface using Gaussian elements coded according to multi-view priors.

The paper details the process of creating a dataset called TextGaussian, mapping text descriptions to Gaussian parameters. This dataset is tailored to provide the necessary geometric and textural priors to kick start the Gaussian generation process. By optimizing these initial parameters using carefully constructed loss functions, DirectGaussian achieves robustness in final surface quality.

Figure 2: A gallery of text-to-3D generation results from DirectGaussian, demonstrating high-quality textured outputs.

Surface curvature constraints implemented in a 360-degree surround-view configuration ensure that the generated geometry maintains coherence across disparate viewing angles. This method uses normal consistency techniques to further enhance the geometric fidelity of the generated models.

Evaluation and Results

The experimental evaluation highlights DirectGaussian's superior performance in comparison to other contemporary text-to-3D methods, especially when assessed using the CLIP metric, which evaluates text-image consistency. The qualitative visualizations affirm the method's capability to produce visually appealing and semantically relevant 3D models, showcasing significant advancement over traditional strategies.

Figure 3: Comparison with alternative techniques, illustrating DirectGaussian's ability to produce coherent and detailed geometric structures.

A user study further reinforces these findings, where DirectGaussian was preferred over other leading methods, underscoring its effectiveness in aligning with textual inputs.

Future Implications

DirectGaussian represents a substantial progression in the field of text-to-3D generation, offering a potentially scalable framework for applications across various digital content creation domains, including virtual reality and animation. Future research could expand upon this work by exploring richer datasets, enhancing the fidelity of texture mappings, and integrating more complex text-conditioned rendering techniques.

Conclusion

The research encapsulated in this paper provides a valuable contribution to the field of generative AI by introducing an innovative framework for efficiently translating text into high-fidelity 3D models. DirectGaussian's success lies in its ability to blend multi-view geometric consistency with the nuanced rendering of surfaces using Gaussian splatting, paving the way for advanced applications in 3D content creation.