TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction

Published 25 May 2026 in cs.CV | (2605.26115v1)

Abstract: Sparse-view 3D reconstruction is increasingly addressed with feed-forward splatting networks that predict explicit primitives directly from images. Yet most existing methods remain centered on Gaussian primitives and expose surfaces only indirectly: extracting a usable mesh for downstream simulation, physics reasoning, or embodied interaction still requires expensive post-hoc steps that break the feed-forward promise. This limitation is especially pronounced in pose-free settings, where scene structure and camera parameters must be estimated jointly from sparse observations. We present TriSplat, a feed-forward reconstruction network that represents scenes with oriented triangle primitives and directly exports simulation-ready mesh scenes from a single forward pass. Given input images, the network predicts local 3D point maps, triangle attributes, camera poses, and optional intrinsics. Rather than regressing triangle orientation as an unconstrained latent variable, our approach constructs geometry normals from the predicted point maps, refines them with an image-conditioned normal head, and converts them into stable local frames for triangle parameterization. A mono-normal bootstrap schedule further stabilizes early training, while opacity and blur scheduling progressively sharpens the learned surface representation for direct mesh extraction. Experiments on RealEstate10K and DL3DV show that this representation produces more geometry-faithful reconstructions than Gaussian feed-forward baselines while maintaining competitive novel-view rendering quality. Because the rendering primitives are themselves surface triangles, the output can be directly ingested by physics engines, collision detectors, and standard rendering pipelines without any conversion, making it a practical simulation-ready solution for feed-forward 3D scene reconstruction.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper presents a novel feed-forward pipeline that outputs triangle primitives directly, eliminating the need for post-hoc mesh conversion.
The paper leverages geometry-anchored orientation and progressive surface sharpening to achieve crisp, accurate meshes efficiently.
The paper demonstrates superior performance and real-time applicability across benchmarks, enabling direct use in robotics and simulation.

TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction

Introduction

TriSplat introduces a feed-forward paradigm for simulation-ready 3D scene reconstruction from sparse, unposed images. The method targets a critical gap: most recent feed-forward pipelines (e.g., Gaussian splatting networks) produce primitives that require costly and lossy mesh extraction procedures to support downstream physics and embodied simulation tasks. TriSplat shifts the representation to explicitly oriented triangle primitives, enabling direct mesh export suitable for consumption by physics engines, collision detectors, and rendering pipelines without intermediate conversion steps. TriSplat jointly predicts per-pixel triangles, point maps, camera poses, and optionally intrinsics, leveraging a geometry-anchored orientation pipeline and progressive surface sharpening to produce accurate and simulation-ready triangle meshes.

Figure 1: Overview of TriSplat; sparse input images are processed via a DINOv2 backbone with attention decoding and three parallel heads, producing 3D point maps, camera poses, and triangle primitive attributes for direct mesh export.

Methodology

Representation and Prediction Pipeline

TriSplat employs a DINOv2 backbone with transformer-based local-global attention decoding. The architecture outputs three dense heads: local 3D point maps, per-pixel triangle attributes (density, scale, quaternion, SH appearance, blur), and camera pose regression. Each image pixel yields a triangle primitive, whose center, scale, and orientation are mapped from the predicted geometry and refined normal field. Triangles are parameterized using canonical templates, transformed into world coordinates, and rendered with a differentiable triangle rasterizer.

Geometry-Anchored Orientation

Triangle orientation is anchored to predicted geometry. Surface normals are computed via finite differences on the dense point maps, further smoothed and refined by a learned U-Net. Early training instability is mitigated by a mono-normal bootstrap, blending geometry-driven and teacher-estimated (Omnidata) normals with a cosine schedule. Validity-aware masking eliminates degenerate or boundary pixels, guaranteeing robust orientation estimation. Tangent frames are constructed per triangle, with normals and tangents dictating rotation and face winding.

Progressive Surface Sharpening

Triangles are highly sensitive to orientation and placement errors. To maximize gradient coverage during early optimization, TriSplat schedules two softness parameters: opacity mapping (progressively binarizing primitive densities) and edge blur (decaying spatial softness). This curriculum transitions the representation from forgiving soft primitives to crisp, mesh-ready triangles, ensuring stable convergence and accurate surface definition.

Supervision and Losses

Training is supervised by a composite loss: photometric (pixel-wise MSE and LPIPS), relative pose loss (translation/rotation), and normal cosine similarity. Scheduled sampling decays reliance on ground-truth poses. Triangle mesh extraction is trivial—triangle primitives are exported directly, avoiding all post-processing typical of Gaussian pipelines.

Experimental Evaluation

TriSplat is evaluated on RealEstate10K (RE10K), DL3DV, and zero-shot ScanNet. The evaluation protocol emphasizes mesh quality (Chamfer distance, F1, recall, precision), mesh-based novel-view synthesis (PSNR, SSIM, LPIPS), depth, and normal accuracy. Baselines encompass Gaussian feed-forward models (MVSplat, DepthSplat, AnySplat, YoNoSplat) and surface-aware variants (MeshSplat, SurfelSplat).

Surface and Rendering Quality

TriSplat delivers the strongest surface geometry metrics and mesh-rendering PSNR across all regimes.

RE10K (6 views): TriSplat attains CD 0.190 and F1 0.622 versus YoNoSplat's CD 0.267, F1 0.443. Mesh-rendering PSNR is 24.69 dB for TriSplat, +2.75 dB over the strongest Gaussian competitor.
DL3DV: TriSplat outperforms all baselines in CD and F1 by significant margins across 6, 12, and 24 views.

Qualitative visualizations demonstrate that TSDF-fused Gaussian meshes systematically under-cover thin structures and blur surfaces, while TriSplat retains sharp triangle detail and more complete geometry.

Figure 2: DL3DV mesh-rendering; TriSplat maintains more complete surfaces versus Gaussian baselines, avoiding missing geometry and inconsistent mesh structure.

Figure 3: DL3DV textured mesh comparison; TriSplat exports coherent textures, while Gaussian-to-TSDF conversion fragments geometry and loses scene extent.

Figure 4: RE10K mesh-rendering; TriSplat preserves sharper silhouettes and triangle detail, Gaussian meshes introduce floaters and structure loss.

Figure 5: RE10K textured mesh; TriSplat yields cleaner triangle meshes, while TSDF-fused Gaussian meshes show artifacted and fragmented regions.

Depth and Normal Transfer

Zero-shot evaluation on ScanNet confirms TriSplat's generalization. TriSplat produces smoother normals and sharper depth boundaries under domain shift. AbsRel and mean angular normal error outperform all baselines (AbsRel 0.188, mean error 27.9°), with strong surface alignment.

Figure 6: ScanNet depth and normal comparison; TriSplat achieves smoother normals and sharper, boundary-aligned depth under domain shift.

Efficiency

TriSplat's feed-forward mesh extraction enables sub-1.3s runtime for up to 24 views, an order of magnitude faster than Gaussian baselines (which require TSDF fusion scaling with volume). The fastest Gaussian competitor requires 18.7s (6 views), while TriSplat delivers results in 0.57s.

Figure 7: Runtime comparison; TriSplat is structurally faster, avoiding post-hoc mesh extraction of Gaussian baselines.

Simulation-Ready Demonstrations

TriSplat meshes are directly imported into Unity and NVIDIA Isaac Sim for navigation, interaction, and robotic locomotion. Collision and physics modules operate without manual cleanup, confirming simulation-readiness for embodied agents—by contrast, Gaussian baselines need mesh conversion and cleanup.

Figure 8: Unity and Isaac Sim demonstration; TriSplat meshes allow direct interaction, locomotion, and collision with no mesh conversion or repair.

Ablation Study

Key architectural ablations show each component is critical:

Removing geometry anchoring or mono-normal bootstrap significantly degrades F1 and PSNR.
Removing normal refinement or progressive sharpening increases mesh artifacts and rendering loss.
Triangle-native export eliminates primitive-to-mesh degradation seen in Gaussian pipelines.

Implications and Future Directions

TriSplat's triangle-native design reconceptualizes simulation readiness as an inherent property of the representation, not a post-processing stage. Immediate mesh export translates to direct applicability in robotics, AR/VR, and physics simulation, setting a new baseline for feed-forward surface reconstruction. However, limitations remain: triangle soups are non-manifold and not watertight, restricting applications in finite-element analysis. Triangle density ties to input resolution, highlighting the need for adaptive tessellation and topology-aware exports.

Efforts moving forward should extend triangle-native mesh prediction to watertight surfaces, topology refinement, and decoupled adaptive triangle density. Integrating semantic cues or affordance modeling could further enhance physics-driven and embodied scene understanding.

Conclusion

TriSplat establishes an explicit triangle-native feed-forward pipeline for simulation-ready 3D reconstruction, jointly predicting geometry, appearance, and pose from sparse, unposed images, and exporting meshes usable in physics engines without conversion. The methodology eliminates quality loss from volumetric fusion, achieves superior surface and rendering metrics, generalizes across domains, and enables real-time simulation applications, marking a significant step in practical feed-forward scene reconstruction (2605.26115).

Markdown Report Issue