Papers
Topics
Authors
Recent
Search
2000 character limit reached

Feedforward Latent Triangle Splatting (FLAT)

Updated 25 June 2026
  • The paper presents a novel feedforward approach that directly decodes explicit, surface-aligned triangle primitives from video diffusion latents, overcoming challenges in primitive orientation and gradient flow.
  • It utilizes a ray-centered rotation parameterization and Cholesky-based shape regression to efficiently predict triangle poses and geometries, ensuring non-degenerate and stable outputs.
  • Empirical results demonstrate FLAT’s superiority, with improved PSNR, SSIM, and LPIPS metrics over volumetric methods, yielding render-ready assets for graphics pipelines.

Feedforward Latent Triangle Splatting (FLAT) is a methodology for directly decoding explicit surface-aligned triangle primitives from the latent representations of video diffusion models in a single feedforward pass. Unlike prior solutions that synthesize volumetric primitives such as 3D Gaussians, which lack well-defined surfaces and are not readily deployable in graphics pipelines or real-time simulation, FLAT produces triangle splats that can be easily post-processed into standard, opaque mesh assets for downstream applications. The approach addresses major geometric and optimization challenges inherent to flat primitive prediction, particularly the instability arising from primitive orientation sensitivity and poor gradient flow during training. FLAT introduces a ray-centered rotation parameterization and a differentiable triangle rendering window tailored to improve stability and accuracy, enabling high-fidelity, geometrically accurate scene generation from single images (Kupyn et al., 23 Jun 2026).

1. Foundations and Motivation

Existing feedforward latent scene decoders in the context of generative modeling typically output volumetric primitives (e.g., 3D Gaussian splats) that are visually compelling but lack an explicit surface representation. This limits their integration with simulation, physics engines, and real-time graphics pipelines, where surface assets are required. The central problem is whether compressed video diffusion latents encode sufficient structural information to recover surface primitives, specifically triangle splats, in a single decoding step. The FLAT technique is designed to address this by mapping diffusion latents directly to explicit surface-aligned triangles.

Triangle splats present significant regression and optimization challenges compared with their volumetric counterparts. Predicting oriented surface primitives requires precise estimation of both shape and orientation, with error sensitivity leading to poor gradient flow and fragile convergence.

2. Ray-Centered Rotation Parameterization

A cornerstone of FLAT is its use of a ray-centered coordinate system for parameterizing each triangle’s pose and shape. Each predicted triangle is anchored to a viewing “anchor ray” characterized by:

  • Ray origin r0R3r_0 \in \mathbb{R}^3
  • Ray direction rdS2r_d \in S^2 (unit-length)

An orthonormal local basis (u,v,w)(u, v, w) is constructed:

  • w=rdw = r_d
  • u=normalize(upworld×rd)u = \text{normalize}(up_\text{world} \times r_d) (with upworldup_\text{world} a fixed up vector)
  • v=w×uv = w \times u

The local-to-world rotation matrix is Rframe=[u,v,w]R3×3R_\text{frame} = [u, v, w] \in \mathbb{R}^{3 \times 3}. The center of the triangle, pcenterp_\text{center}, is placed along the ray: pcenter=r0+Drdp_\text{center} = r_0 + D r_d, where rdS2r_d \in S^20 is a predicted depth.

To obtain the final triangle orientation, a residual rotation rdS2r_d \in S^21 is parameterized by three angles:

  • Two tilt angles: rdS2r_d \in S^22 about rdS2r_d \in S^23 and rdS2r_d \in S^24
  • One spin angle: rdS2r_d \in S^25 about rdS2r_d \in S^26

Rodrigues’ formula is used to construct these elementary rotations, with the resultant residual:

rdS2r_d \in S^27

Embedding in world coordinates is achieved via:

rdS2r_d \in S^28

and

rdS2r_d \in S^29

3. Triangle Shape and Normal Regression

FLAT eschews direct per-vertex regression in favor of a lower-triangular (Cholesky) 2×2 matrix (u,v,w)(u, v, w)0, guaranteeing non-degenerate (positive area) triangles:

(u,v,w)(u, v, w)1

The canonical local triangle (u,v,w)(u, v, w)2 is an equilateral triangle centered at the origin in the XY plane with unit area. This shape is transformed by (u,v,w)(u, v, w)3 (zero-padded for Z), recentralized, and then further rotated and placed in the world:

(u,v,w)(u, v, w)4

(u,v,w)(u, v, w)5

Since (u,v,w)(u, v, w)6 lies in the local XY-plane, the face normal (u,v,w)(u, v, w)7 is mapped to world coordinates via (u,v,w)(u, v, w)8. Orthogonality ensures (u,v,w)(u, v, w)9 remains unit-length.

4. Gradient Flow and Optimization Stability

Several mechanisms are implemented to ensure stable optimization and effective gradient flow:

  • Non-degeneracy: Enforcing w=rdw = r_d0 prevents triangle collapse.
  • Camera-facing initialization: Setting w=rdw = r_d1 at initialization ensures all triangles are visually present and contribute gradients from the start.
  • Local residual regression: Small local-angle residuals are more numerically stable to regress than global SO(3) poses or quaternions; direct global rotation prediction often leads to vanishing coverage and “dead” primitives.
  • Shape-orientation decoupling: Learning 2D shape (w=rdw = r_d2) and 3D pose separately simplifies the learning problem and prevents entangled optimization paths.

Ablation studies demonstrate that omitting the ray-centered residual rotation or substituting it with global quaternions or naive per-vertex offsets leads to catastrophic collapse in performance, with global world-space quaternion regression resulting in PSNR < 10 dB and SSIM < 0.4 on RealEstate10K. The full FLAT architecture achieves PSNR 21.45, SSIM 0.710, LPIPS 0.245, outperforming naïve and less structured representations (Kupyn et al., 23 Jun 2026).

5. Differentiable Rendering and Window Function

A novel product window function is introduced within FLAT’s differentiable triangle renderer. This function ameliorates poor gradient flow at the silhouette boundaries of triangle splats, facilitating more informative supervision and improved convergence during training. This window can be combined with the ray-centered parameterization or ablated independently for benchmarking contributions.

Triangle splats predicted by FLAT are rendered differentiably, imposing a strong correspondence between scene representation and image evidence. At test time, a lightweight refinement pass aggregates the set of predicted triangles (“triangle soup”) into a fully opaque, contiguous surface suitable for real-time rendering in standard graphics engines.

6. Benchmarking, Comparison, and Tradeoffs

FLAT’s performance is systematically benchmarked against both volumetric Gaussian (3DGS) and 2D Gaussian/triangle splatting baselines using identical pipelines, isolating the geometric and representational differences. Comparisons reveal FLAT’s superiority in geometric accuracy metrics, while maintaining parity in image-level quality. The explicit surface representation produced by FLAT, combined with its well-conditioned optimization procedure, positions it as a critical advance for downstream use—

Representation PSNR SSIM LPIPS
Global Quaternion < 10 < 0.4 > 0.4
3-Offsets 20.09 0.674 0.289
Triangle Window + Param + Residual 20.65 0.693 0.282
Alt Dec (LongLRM) + Param + Residual 21.24 0.701 0.275
Full FLAT 21.45 0.710 0.245

These empirical results confirm the necessity of the ray-centered residual rotation for learnability and geometric fidelity, with the Cholesky shape and triangle window each yielding incremental benefits (Kupyn et al., 23 Jun 2026).

7. Downstream Applications, Limitations, and Future Directions

The output of FLAT is a set of surface-aligned triangle splats that can be directly converted into real-time renderable, game-engine-compatible assets. By decoding explicit geometric representations from video diffusion latents, FLAT strengthens the integration of generative 3D scene synthesis with graphics, simulation, and other physically-based downstream tasks.

A plausible implication is that the ray-centered parameterization concepts may generalize to other surface-primitive decoding or generative 3D modeling architectures, given their demonstrated numerical advantages. However, the challenge of learning in ambiently complex, unconstrained settings and the translation to even more structured topologies (e.g., watertight meshes) remain open. Continued analysis of surface-aligned primitive encoding may yield further insight into latent-geometry correspondences and broader applicability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Feedforward Latent Triangle Splatting (FLAT).