Feedforward Latent Triangle Splatting (FLAT)
- The paper presents a novel feedforward approach that directly decodes explicit, surface-aligned triangle primitives from video diffusion latents, overcoming challenges in primitive orientation and gradient flow.
- It utilizes a ray-centered rotation parameterization and Cholesky-based shape regression to efficiently predict triangle poses and geometries, ensuring non-degenerate and stable outputs.
- Empirical results demonstrate FLAT’s superiority, with improved PSNR, SSIM, and LPIPS metrics over volumetric methods, yielding render-ready assets for graphics pipelines.
Feedforward Latent Triangle Splatting (FLAT) is a methodology for directly decoding explicit surface-aligned triangle primitives from the latent representations of video diffusion models in a single feedforward pass. Unlike prior solutions that synthesize volumetric primitives such as 3D Gaussians, which lack well-defined surfaces and are not readily deployable in graphics pipelines or real-time simulation, FLAT produces triangle splats that can be easily post-processed into standard, opaque mesh assets for downstream applications. The approach addresses major geometric and optimization challenges inherent to flat primitive prediction, particularly the instability arising from primitive orientation sensitivity and poor gradient flow during training. FLAT introduces a ray-centered rotation parameterization and a differentiable triangle rendering window tailored to improve stability and accuracy, enabling high-fidelity, geometrically accurate scene generation from single images (Kupyn et al., 23 Jun 2026).
1. Foundations and Motivation
Existing feedforward latent scene decoders in the context of generative modeling typically output volumetric primitives (e.g., 3D Gaussian splats) that are visually compelling but lack an explicit surface representation. This limits their integration with simulation, physics engines, and real-time graphics pipelines, where surface assets are required. The central problem is whether compressed video diffusion latents encode sufficient structural information to recover surface primitives, specifically triangle splats, in a single decoding step. The FLAT technique is designed to address this by mapping diffusion latents directly to explicit surface-aligned triangles.
Triangle splats present significant regression and optimization challenges compared with their volumetric counterparts. Predicting oriented surface primitives requires precise estimation of both shape and orientation, with error sensitivity leading to poor gradient flow and fragile convergence.
2. Ray-Centered Rotation Parameterization
A cornerstone of FLAT is its use of a ray-centered coordinate system for parameterizing each triangle’s pose and shape. Each predicted triangle is anchored to a viewing “anchor ray” characterized by:
- Ray origin
- Ray direction (unit-length)
An orthonormal local basis is constructed:
- (with a fixed up vector)
The local-to-world rotation matrix is . The center of the triangle, , is placed along the ray: , where 0 is a predicted depth.
To obtain the final triangle orientation, a residual rotation 1 is parameterized by three angles:
- Two tilt angles: 2 about 3 and 4
- One spin angle: 5 about 6
Rodrigues’ formula is used to construct these elementary rotations, with the resultant residual:
7
Embedding in world coordinates is achieved via:
8
and
9
3. Triangle Shape and Normal Regression
FLAT eschews direct per-vertex regression in favor of a lower-triangular (Cholesky) 2×2 matrix 0, guaranteeing non-degenerate (positive area) triangles:
1
The canonical local triangle 2 is an equilateral triangle centered at the origin in the XY plane with unit area. This shape is transformed by 3 (zero-padded for Z), recentralized, and then further rotated and placed in the world:
4
5
Since 6 lies in the local XY-plane, the face normal 7 is mapped to world coordinates via 8. Orthogonality ensures 9 remains unit-length.
4. Gradient Flow and Optimization Stability
Several mechanisms are implemented to ensure stable optimization and effective gradient flow:
- Non-degeneracy: Enforcing 0 prevents triangle collapse.
- Camera-facing initialization: Setting 1 at initialization ensures all triangles are visually present and contribute gradients from the start.
- Local residual regression: Small local-angle residuals are more numerically stable to regress than global SO(3) poses or quaternions; direct global rotation prediction often leads to vanishing coverage and “dead” primitives.
- Shape-orientation decoupling: Learning 2D shape (2) and 3D pose separately simplifies the learning problem and prevents entangled optimization paths.
Ablation studies demonstrate that omitting the ray-centered residual rotation or substituting it with global quaternions or naive per-vertex offsets leads to catastrophic collapse in performance, with global world-space quaternion regression resulting in PSNR < 10 dB and SSIM < 0.4 on RealEstate10K. The full FLAT architecture achieves PSNR 21.45, SSIM 0.710, LPIPS 0.245, outperforming naïve and less structured representations (Kupyn et al., 23 Jun 2026).
5. Differentiable Rendering and Window Function
A novel product window function is introduced within FLAT’s differentiable triangle renderer. This function ameliorates poor gradient flow at the silhouette boundaries of triangle splats, facilitating more informative supervision and improved convergence during training. This window can be combined with the ray-centered parameterization or ablated independently for benchmarking contributions.
Triangle splats predicted by FLAT are rendered differentiably, imposing a strong correspondence between scene representation and image evidence. At test time, a lightweight refinement pass aggregates the set of predicted triangles (“triangle soup”) into a fully opaque, contiguous surface suitable for real-time rendering in standard graphics engines.
6. Benchmarking, Comparison, and Tradeoffs
FLAT’s performance is systematically benchmarked against both volumetric Gaussian (3DGS) and 2D Gaussian/triangle splatting baselines using identical pipelines, isolating the geometric and representational differences. Comparisons reveal FLAT’s superiority in geometric accuracy metrics, while maintaining parity in image-level quality. The explicit surface representation produced by FLAT, combined with its well-conditioned optimization procedure, positions it as a critical advance for downstream use—
| Representation | PSNR | SSIM | LPIPS |
|---|---|---|---|
| Global Quaternion | < 10 | < 0.4 | > 0.4 |
| 3-Offsets | 20.09 | 0.674 | 0.289 |
| Triangle Window + Param + Residual | 20.65 | 0.693 | 0.282 |
| Alt Dec (LongLRM) + Param + Residual | 21.24 | 0.701 | 0.275 |
| Full FLAT | 21.45 | 0.710 | 0.245 |
These empirical results confirm the necessity of the ray-centered residual rotation for learnability and geometric fidelity, with the Cholesky shape and triangle window each yielding incremental benefits (Kupyn et al., 23 Jun 2026).
7. Downstream Applications, Limitations, and Future Directions
The output of FLAT is a set of surface-aligned triangle splats that can be directly converted into real-time renderable, game-engine-compatible assets. By decoding explicit geometric representations from video diffusion latents, FLAT strengthens the integration of generative 3D scene synthesis with graphics, simulation, and other physically-based downstream tasks.
A plausible implication is that the ray-centered parameterization concepts may generalize to other surface-primitive decoding or generative 3D modeling architectures, given their demonstrated numerical advantages. However, the challenge of learning in ambiently complex, unconstrained settings and the translation to even more structured topologies (e.g., watertight meshes) remain open. Continued analysis of surface-aligned primitive encoding may yield further insight into latent-geometry correspondences and broader applicability.