FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation

Published 23 Jun 2026 in cs.CV | (2606.24876v1)

Abstract: Generating explorable 3D scenes from a single image requires strong generative priors and accurate geometric representations suitable for downstream use. Current video diffusion models offer high-quality generation and implicitly encode multi-view geometric structure in latent space. However, existing feedforward latent scene decoders typically output volumetric 3D Gaussians that lack a well-defined surface, limiting their use in simulation or standard graphics pipelines. This motivates decoding surface-aligned primitives that are not only renderable but also closer to explicit geometric assets. We ask whether compressed video diffusion latents can be mapped directly to explicit surface primitives in a single pass. To this end, we introduce FLAT and, for the first time, show that triangle splats can be decoded directly from video diffusion latents. Compared with decoding 3D Gaussians, predicting flat primitives is notoriously more challenging due to high sensitivity to primitive orientations, oftentimes leading to poor gradient flow. FLAT solves with two key ingredients: a ray-centered rotation parameterization for triangle regression and a novel product window function that improves gradient flow during differentiable triangle rendering. On standard benchmarks, FLAT achieves significantly better geometric accuracy while maintaining competitive visual quality compared to state-of-the-art feedforward baselines. We further show that a lightweight test-time refinement step converts the predicted triangle soup into a fully opaque, game-engine-ready representation that supports real-time rendering. By evaluating 3DGS, 2DGS, and triangle splatting variants under an identical training setup, we provide the first systematic analysis of representation tradeoffs in feedforward scene generation. The project page is available at https://flat-splat.github.io

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a feedforward decoder that constructs explicit, surface-aligned triangle representations from video diffusion latents, surpassing blob-like 3D Gaussians.
Its ray-centered local triangle parameterization and modified window function enhance stability and gradient flow, yielding high-fidelity geometric reconstructions.
Empirical evaluations on RealEstate10K and DL3DV show improved PSNR, SSIM, and normal consistency, confirming real-time mesh rendering capability.

Feedforward Latent Triangle Splatting (FLAT): Explicit Geometric Scene Generation from Video Diffusion Latents

Motivation and Context

Single-image 3D scene generation is fundamentally under-constrained, requiring strong priors and explicit geometry to support downstream use in graphics, simulation, and robotics. The surge of latent video diffusion models has enabled high-quality, multi-view-consistent generative priors. However, prior feedforward latent decoders have been limited to volumetric 3D Gaussians—semi-transparent, blob-like primitives incompatible with standard graphics pipelines and incapable of precise surface representation. Mesh extraction from such representations is unreliable and computationally expensive. Direct feedforward decoding to explicit, surface-aligned primitives remains unsolved due to stability and gradient issues in differentiable rendering.

Methodological Contributions

FLAT introduces a fully feedforward pipeline for explicit triangle splatting from the latent space of a frozen video diffusion model. The approach yields geometrically accurate, physically grounded scene representations suitable for real-time rendering using opaque surface primitives. Key innovations include:

Ray-centered local triangle parameterization: Stability in triangle regression is achieved through a constrained Cholesky-style shape transform and residual rotations around a ray-aligned frame. Triangles are predicted relative to local rays in camera space, avoiding degenerate cases and ensuring nonzero area.
Modified window function for differentiable rendering: The product-based window function improves gradient flow over traditional max-edge reduction, enhancing training stability and supervision across primitive boundaries.
Lightweight mesh conversion post-processing: After inference, semi-opaque triangle "soup" is converted into game-engine-ready, fully opaque meshes via a fast optimization and connectivity stitching, enabling immediate compatibility with standard rasterization engines.

The decoder architecture leverages transfer learning by reusing the backbone of a pretrained VAE RGB decoder (Wan-2.1), conditionally infused with per-pixel camera ray embeddings, resulting in efficient decoding of local appearance and geometry.

Empirical Evaluation

FLAT is evaluated on RealEstate10K and DL3DV, and directly compared against 3D Gaussian Splatting (3DGS) and 2D Gaussian Splatting (2DGS) under identical protocol, controlling for hyperparameters and representation. Results demonstrate:

Geometric quality: Cosine similarity of predicted normal maps against ground-truth reaches 0.853 for FLAT triangles, outpacing 2DGS (0.587). 3DGS baseline yields near-random normals due to volumetric ambiguity.
Rendering fidelity: PSNR and SSIM scores are competitive with state-of-the-art feedforward pipelines. FLAT's triangle model achieves 21.45 PSNR and 0.245 LPIPS on RealEstate10K, only marginally lower than the highest 3DGS baselines (22.39 PSNR, 0.203 LPIPS).
Explicit mesh extraction: Opaque mesh conversion yields superior mesh quality (21.23 PSNR, 0.749 SSIM) compared to TSDF (2DGS, 15.89 PSNR, 0.633 SSIM) and GS2Mesh (3DGS, 14.18 PSNR, 0.619 SSIM). Meshes extracted from FLAT are less sensitive to post-processing hyperparameters and have high local connectivity, with only 0.02% degenerate faces and average triangle degree near manifold ideal.
Pipeline flexibility: By attaching the FLAT decoder to different Wan-2.1 variants, diverse generation modes (image-to-video, text-to-video, video-to-video, interactive) seamlessly produce explicit, surface-aligned geometry.

Ablations confirm that the combination of ray-centered parameterization, modified window function, and residual rotation is essential for stable, high-fidelity feedforward decoding. Post-optimization further tightens alignment, with a short refinement improving PSNR from 21.45 to 23.01 and SSIM from 0.710 to 0.790.

Theoretical Implications

FLAT demonstrates that explicit surface-aligned geometry is directly recoverable from compressed video diffusion latents, bypassing expensive per-scene optimization. This shifts the paradigm from volumetric generative approaches toward explicit, mesh-compatible representations, suggesting latent generative models encode sufficient geometric priors for feedforward decoding. The ray-centered triangle parameterization and product window formulation represent advances in stable differentiable rendering for flat primitives, extending the range of geometrically faithful representations accessible from generative models.

The systematic comparison of 3DGS, 2DGS, and triangle splatting under controlled conditions illuminates tradeoffs: volumetric Gaussians are optimal for visual quality and pixel-level metrics, but triangles yield sharper details, explicit surfaces, and mesh-aligned geometry, vital for real-time rendering and downstream asset creation.

Practical Impact and Future Directions

FLAT enables practical, real-time 3D scene generation compatible with game engines and simulation platforms, using only a single input image. The decoder-swap design allows direct deployment with existing video generation pipelines, with improvements in upstream generative models instantly propagating to explicit scene generation. The method substantially lowers the barrier for asset creation in AR/VR, gaming, robotics, and digital twin applications.

Limitations remain in scene scale and mesh density; thin structures, reflections, and highly complex surfaces are challenging for triangle representation. Further research should focus on watertight mesh extraction, scaling to persistent large-scale environments, and integrating with long world-consistent generative video models. Increasing training scale and incorporating richer multi-view supervision will likely improve fidelity and robustness.

Conclusion

FLAT delivers a feedforward solution for geometrically accurate, explicit 3D scene generation from video diffusion latents. The methodology advances parameterization and rendering for triangle splatting, achieving high geometric quality, mesh compatibility, and efficient pipeline integration. Results validate the practical and theoretical feasibility of direct surface primitive decoding, paving the way for explicit feedforward scene generation tightly integrated with latent generative models (2606.24876).

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

FLAT: Turning One Picture into an Explorable 3D World

What is this paper about?

This paper introduces FLAT, a new way to turn a single photo into a 3D scene you can explore from different angles in real time. Instead of building a scene from blurry “blobs” (which are easy for computers but bad for making game-ready models), FLAT learns to build the scene from flat triangle pieces—just like how most video games are made.

What questions are the researchers asking?

The team focuses on three simple questions:

Can we go from one picture to a full 3D scene quickly, without slow per-scene fine-tuning?
Can we predict clear, solid surfaces (made of triangles) instead of fuzzy volumes (blobs), so the result works in game engines?
What’s the trade-off between different 3D building blocks (3D Gaussians, 2D Gaussians, and triangles) when it comes to image quality, geometry accuracy, and ease of mesh export?

How does their method work (in everyday terms)?

Think of building a 3D scene like making a diorama:

A video diffusion model is like a smart “world imagination engine” trained on tons of videos. It doesn’t directly produce a finished 3D world, but it knows a lot about how the world should look from different views.
FLAT takes the “ideas” from this engine (called latent features) and, in a single pass, turns them into many small triangle pieces placed in 3D space. These triangles form the surfaces of the scene (walls, floors, furniture), much like the triangles used in video game graphics.

Two key innovations make this possible:

Ray-centered triangle prediction (stable orientation)
- Instead of guessing triangle positions and angles directly (which can easily go wrong), FLAT places each triangle around a camera ray (a line from the camera into the scene). It predicts:
  - How far along the ray the triangle sits (depth),
  - How big and skewed the triangle is (using a safe shape transform),
  - How it tilts relative to the ray (small “residual” rotations rather than full 3D spins).
- This avoids triangles collapsing into lines or vanishing because they’re facing the wrong way.
A better “soft edge” for rendering (smooth training)
- To train with triangles, the system needs a way to softly decide how much each pixel belongs to each triangle. If you use a harsh boundary, the learning signals dry up.
- FLAT uses a “product window” function that softly extends influence just beyond triangle edges, encouraging gradients (learning signals) to flow to all three vertices. It’s like slightly blurring the edges so the model learns more reliably where the triangle should go.

Training and pipeline (big picture):

Input: one image + a planned camera path (how you’ll move through the scene).
A frozen video diffusion model produces a compact video-like feature map (the latent).
FLAT’s scene decoder (adapted from a pre-trained video model’s decoder) reads both the visual latent and camera information and outputs triangle positions, colors, and opacities.
Optional quick refinement: a short cleanup step turns the semi-transparent triangle “soup” into an opaque, clean mesh ready for real-time engines.

What did they find, and why does it matter?

Main results:

Better geometry with triangles: FLAT’s triangle-based scenes have much more accurate surface shapes and normals (the directions surfaces face). This matters for realism, lighting, physics, and interactions.
Competitive visuals: While 3D Gaussian methods (fuzzy blobs) often squeeze out slightly higher pixel-matching scores (like PSNR), FLAT’s triangles still produce strong image quality and look good from new views.
Game-engine friendly: Triangles are a native format in graphics engines. With a lightweight post-processing step, FLAT creates opaque meshes far more reliably than trying to extract meshes from Gaussian blobs. In tests, its meshes rendered much better than those extracted from 2D/3D Gaussians under the same conditions.

Trade-offs they measured:

3D Gaussians (3DGS): Great visual scores and easy to train, but poor surface precision and hard to convert into clean meshes.
2D Gaussians (2DGS): Better surface awareness than 3DGS but still “soft,” and mesh extraction is tricky.
Triangles (FLAT): Best at accurate surfaces and easiest to convert to game-ready meshes; visuals remain strong and real-time renderable.

Why it matters:

If you want scenes for games, AR/VR, robotics, or simulation, you need accurate surfaces—not just pretty pictures. FLAT brings us closer to one-click, from-photo-to-3D-world pipelines that run fast and export cleanly.

What’s the potential impact?

For creators and developers: Faster creation of explorable 3D scenes from a single photo, with outputs that plug directly into standard engines (Unreal, Unity, etc.).
For research and industry: A step toward combining powerful generative video models with explicit, accurate 3D geometry, improving both quality and practicality.
For future tools: Encourages more work on explicit surface prediction from generative models, paving the way for interactive world-building tools that don’t require hours of optimization.

In short: FLAT shows that we can decode triangles—the building blocks of real-time 3D—directly from a generative video model’s “thoughts,” producing scenes that look good, have accurate shapes, and export easily for real-time use.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper, to guide future research:

Dependence on a single frozen video diffusion backbone (Uni3C/Wan-2.1): transferability to other video generators and robustness to latent distribution shifts are not evaluated.
Sensitivity to camera pose errors and scale: training relies on MapAnything and RealCam-Vid; robustness to noisy or miscalibrated test-time cameras (e.g., ARKit/phone sensors) is unreported.
Bias and quality of pseudo-geometry supervision: normals (NormalCrafter) and depth (MiDaS-style disparity) may inject bias; the impact of supervision noise and options for label-free or self-supervised training are not analyzed.
Limited appearance modeling: triangles appear to use per-primitive constant color/opacity without view-dependent BRDFs; handling specularities, reflections, and varying illumination remains open.
Non-surface phenomena: triangles are non-volumetric; performance on fog, semi-transparency, hair/fur, and thin volumetric structures is not addressed; hybrid surface–volume decoding remains unexplored.
Primitive budget and adaptivity: predicting one triangle per 2×2 pixel tile (after removing the final upsample) may limit fine detail and thin structures; adaptive or learned primitive allocation is not studied.
Window function hyperparameters: sensitivity of the ε shift and sharpness σ to training stability and silhouette sharpness is not quantified; no ablation of these parameters is provided.
Theoretical properties of the product window function: formal analysis of bias, gradient behavior near edges, and consistency vs. max-based windows is missing.
Rotation parameterization limits: the ray-centered residual rotation may struggle for grazing angles, wide FOVs, and highly oblique surfaces; failure cases and angle-dependent behavior are not characterized.
Long-trajectory consistency: the decoder is trained on short, causal sequences but used for longer paths; drift, geometry consistency, and stability over very long camera motions are not measured.
Latent consistency: video generators can be multi-view inconsistent; how such inconsistencies propagate to geometry and how to mitigate them is not studied.
Mesh conversion reliance on post-optimization: while “lightweight,” compute cost, latency, and sensitivity to trajectory coverage (especially for occluded/backside surfaces) are not quantified.
Mesh quality beyond photometrics: manifoldness, watertightness, self-intersections, triangle quality, and compatibility with physics engines are not evaluated.
Domain generalization: training/evaluation focuses on RealEstate10K and DL3DV; performance on highly reflective/transparent objects, outdoor clutter, or object-centric domains remains unknown.
Occlusion and coverage: the opaque mesh is derived from frames along a chosen trajectory; guarantees on coverage of unseen/occluded areas for arbitrary future views are not provided.
Temporal stability: no metrics or analysis of flicker/shimmer in rendered sequences (photometric and normal consistency across frames) are reported.
Real-time performance characterization: FPS, triangle counts, memory footprint, and device targets (desktop vs. mobile) are not benchmarked.
Comparative fairness and stability: 2DGS supervision with externally predicted normals diverged; improved 2DGS objectives and a deeper comparison across representations under optimal training remain open.
Rasterization robustness: front-to-back alpha compositing with soft triangles may encounter z-fighting/intersections; robustness of depth ordering and occlusion handling is not analyzed.
Robustness to test-time camera inaccuracies: tolerance to moderate intrinsics/extrinsics errors and lens distortions is not measured.
Multi-input scalability: the approach targets single-image input; how performance scales with multiple input images or multi-modal conditioning (e.g., text) is not explored.
Training efficiency: training requires 8×H100 for 200k iterations; avenues for distillation, parameter-efficient finetuning, or smaller/lightweight decoders are not investigated.
Uncertainty quantification: no mechanism to estimate or expose uncertainty in hallucinated unseen regions or low-support geometry for downstream decision-making.
Geometry metrics: evaluation focuses on normal similarity; distance-to-scan, absolute/relative depth accuracy, and metric scale accuracy are not reported.
Material and UV interoperability: triangles are colored per-primitive; procedures for generating UV-mapped textures/materials for standard engines are not described.
Backbone evolution: stability under updates to the video diffusion backbone and methods to re-align the decoder with changing latent distributions remain open.
Dynamic scenes: the method assumes static scenes; extension to 4D (time-varying geometry/appearance) is not addressed.

View Paper Prompt View All Prompts

Practical Applications

Practical Applications of FLAT (Feedforward Latent Triangle Splatting)

FLAT decodes explicit, non-volumetric triangle primitives directly from video diffusion latents in a single pass, producing geometrically accurate, mesh-ready 3D scenes from a single image with real-time rendering and an optional lightweight opaque-mesh conversion. This enables new workflows where strong generative priors meet standard graphics pipelines.

Below are actionable applications derived from the paper’s findings, organized by deployment horizon.

Immediate Applications

These applications can be piloted or deployed now with available compute and existing DCC/game-engine tooling.

Sector: Software, Gaming, Film/VFX
- Use case: One-click level blockouts and previz from a single concept frame or plate
- Description: Generate explorable 3D environments from a single image; export opaque meshes to Unreal/Unity/Blender for blocking, scouting camera moves, and set extensions.
- Tools/workflows:
- “Image → FLAT → Triangles → Mesh (glTF/FBX/OpenUSD) → Unreal/Unity”
- Editor plugins (Unreal/Unity) that wrap the FLAT pipeline and mesh-conversion postprocess
- Assumptions/dependencies:
- Access to a frozen video diffusion backbone (e.g., Uni3C/Wan-2.1) and GPU for inference
- Hallucinated occluded content—suitable for art/previz, not authoritative reconstruction
Sector: XR/AR
- Use case: Room/scene expansion and parallax AR from one snapshot
- Description: Turn a single smartphone photo into an explorable parallax scene or lightweight room proxy; export meshes for AR anchors and occlusion masks.
- Tools/workflows:
- Mobile capture → off-device FLAT inference → mesh export → ARKit/ARCore import
- Assumptions/dependencies:
- Off-device inference or cloud execution; on-device feasible only with model distillation
Sector: Robotics Simulation (R&D), Autonomy (simulation)
- Use case: Rapid generation of physics-friendly simulation environments
- Description: Use triangle-based output for physics-ready surfaces (unlike volumetric Gaussians) to create domain-randomized training scenes from sparse images.
- Tools/workflows:
- “Photo bank → FLAT → opaque mesh → Isaac Sim/Unity Robotics/Gazebo”
- Batch pipelines to synthesize scene variations for data augmentation
- Assumptions/dependencies:
- Not for ground-truth geometry; suitable for diversity and domain randomization, not safety-critical validation
Sector: Real Estate, AEC, Marketing
- Use case: Interactive walk-through teasers and virtual staging from listing photos
- Description: Create short navigable clips or interactive mini-tours from a single image to boost engagement; stage furniture on the exported mesh.
- Tools/workflows:
- Listing image → FLAT → mesh + textures → web viewer (three.js/Babylon.js) or USDZ for AR
- Assumptions/dependencies:
- Include disclaimers; geometry is approximate and not for measurements or compliance
Sector: E-commerce, Creative Marketing
- Use case: Product scene backdrops and interactive “3D postcards” from hero shots
- Description: Surround a product photo with an explorable scene for web/mobile experiences; export meshes for lightweight web rendering.
- Tools/workflows:
- Product photo → FLAT → mesh → WebGL/React-Three-Fiber component
- Assumptions/dependencies:
- Creative acceptance of hallucinated content; brand compliance review
Sector: Academia, Education (Computer Graphics and Vision)
- Use case: Teaching differentiable rendering and explicit-vs-volumetric representation tradeoffs
- Description: Use the paper’s parameterization (ray-centered triangles, product window function) in assignments comparing 3DGS/2DGS/triangle splats under identical conditions.
- Tools/workflows:
- Classroom labs: swap window functions, rotation parameterizations; evaluate normal accuracy vs PSNR/LPIPS
- Assumptions/dependencies:
- Access to pretrained VAE/diffusion latents; course compute budget
Sector: 3D Tooling and Engine Vendors
- Use case: Engine-native triangle splat rendering and mesh extraction utilities
- Description: Integrate FLAT-like triangle splat rasterization and postprocess as engine features; offer importers for FLAT outputs and conversion to opaque meshes.
- Tools/workflows:
- Engine modules: differentiable triangle splat shaders; mesh densification/edge stitching utilities
- Assumptions/dependencies:
- Adoption of the paper’s window function and ray-centered parameterization for stability
Sector: Mapping/SLAM Research
- Use case: Priors for SfM/SLAM bootstrapping from a single frame
- Description: Use FLAT geometry as a prior to stabilize early mapping and to propose surfaces behind occlusions (research-use only).
- Tools/workflows:
- SLAM pipeline → optional FLAT prior → robust fusion with incoming observations
- Assumptions/dependencies:
- Handle hallucination carefully; use as a soft prior, never as ground truth

Long-Term Applications

These require further research, scaling, or productization (e.g., tighter geometric guarantees, on-device performance, regulatory clarity).

Sector: Autonomous Driving, Mobile Robotics (deployment)
- Use case: Online single-shot 3D scene scaffolds for planning and prediction
- Description: Generate geometric scaffolds from sparse inputs for fast planning in novel environments.
- Tools/workflows:
- Perception stack → FLAT-derived scaffold → uncertainty-aware planning
- Assumptions/dependencies:
- Needs calibrated metric scale, bounded hallucination, reliability under domain shift; safety certification
Sector: Industrial Digital Twins (Energy, Utilities, Manufacturing)
- Use case: Rapid twin bootstrapping from sparse imagery for training and procedure rehearsal
- Description: Quickly generate plausibly structured environments as a starting point for twins, then refine with scans.
- Tools/workflows:
- Field photos → FLAT mesh → merge with LiDAR/photogrammetry → physics/IoT integration
- Assumptions/dependencies:
- Geometry must be verified; integrate with metrology-grade data for fidelity
Sector: On-Device AR/VR (Consumer)
- Use case: Real-time on-device image-to-world reconstruction for interactive AR
- Description: Run compressed diffusion + FLAT decoding locally for instant scene parallax and anchoring.
- Tools/workflows:
- Distilled video-latent backbones; mobile GPU kernels for triangle splats and mesh postprocess
- Assumptions/dependencies:
- Aggressive model compression, energy-aware scheduling, privacy-preserving on-device inference
Sector: Procedural World Generation at Scale (Gaming, UGC Platforms)
- Use case: Mixed text+image conditioned world builders with explicit, editable geometry
- Description: Combine FLAT with text-conditioned video diffusion to produce editable, engine-ready worlds en masse.
- Tools/workflows:
- Prompt + references → video latents → FLAT → semantic retopo and material assignment
- Assumptions/dependencies:
- Multi-modal conditioning, semantic scene graphs, quality control for marketplace assets
Sector: Telepresence and Remote Collaboration
- Use case: Bandwidth-efficient 3D telepresence from sparse frames
- Description: Reconstruct a remote scene from minimal imagery and stream meshes rather than video.
- Tools/workflows:
- Low-rate image capture → FLAT reconstruction → progressive mesh streaming
- Assumptions/dependencies:
- Robustness to motion/lighting changes, temporal consistency across re-captures
Sector: Healthcare, Public Safety Training
- Use case: Scenario generation for simulations (ER rooms, disaster sites) from limited visuals
- Description: Create plausible training spaces for procedure rehearsal and triage training.
- Tools/workflows:
- Reference photos → FLAT → domain-tuned assets and physics
- Assumptions/dependencies:
- Domain-specific constraints, validation by SMEs; not for diagnostic or architectural accuracy
Sector: Finance/Real Estate Analytics
- Use case: “3D-ized” comps for marketing analytics and engagement prediction
- Description: Generate 3D previews of listings to study engagement; augment pricing models with 3D interaction metrics.
- Tools/workflows:
- Portfolio images → FLAT → web 3D → interaction telemetry → model features
- Assumptions/dependencies:
- Ethical disclosures; avoid using hallucinated geometry for appraisal or compliance
Sector: Standards and Policy (Cross-sector)
- Use case: Provenance, disclosures, and safety guidelines for generated 3D scenes
- Description: Establish standards for labeling generated geometry, indoor privacy, and data governance.
- Tools/workflows:
- C2PA-like provenance for 3D assets; “Generated 3D” watermarks; disclosure in listings and research
- Assumptions/dependencies:
- Multi-stakeholder alignment; legal frameworks for indoor reconstruction and training data consent
Sector: Academic Research Roadmap
- Use case: Architectures and losses for explicit-primitive decoding
- Description: Extend ray-centered parameterization, window functions, and normal supervision to other primitives (meshes with connectivity, convexes).
- Tools/workflows:
- Benchmarks with matched training setups (3DGS/2DGS/triangles) and geometry metrics; open datasets of video latents and camera embeddings
- Assumptions/dependencies:
- Community access to pretrained video VAEs and licensing for research redistribution

Key Cross-Cutting Dependencies and Assumptions

Reliance on frozen video diffusion backbones (e.g., Uni3C/Wan-2.1); inference cost and licensing affect deployment.
Camera conditioning quality and metric scale (e.g., MapAnything) materially impact geometric plausibility and usability.
Geometry is plausible but not survey-grade; use cases requiring precise measurements must fuse with scans.
Postprocessing (opacity binarization, stitching) is lightweight but still needed to achieve engine-ready opaque meshes.
Training data provenance and indoor privacy: generated 3D from private images requires clear consent and disclosure.
Model robustness: occlusions, reflective surfaces, and unusual layouts may induce artifacts; risk management needed for production.

View Paper Prompt View All Prompts

Glossary

2D Gaussian Splatting (2DGS): A surface-aligned splatting representation that models opaque-ish 2D Gaussian primitives for geometrically accurate rendering. "We also train 3DGS [33] and 2DGS variants [28] under identical conditions, enabling direct comparison of the representations."
3D Gaussian Splatting (3DGS): A volumetric representation using anisotropic 3D Gaussians to render radiance fields efficiently in real time. "A number of approaches thus follow a generate-then-optimize paradigm [67, 73, 17] wherein a 3D Gaussian Splatting [33] or NeRF [45] representation is optimized to fit frames generated by the video model."
Alpha-compositing: The process of accumulating colors and opacities along depth to produce the final rendered image. "The rendered image is then obtained by accumulating the contributions of all overlapping triangles in front-to-back depth order, following the standard alpha-compositing equation used in differentiable splatting methods [28, 33, 26]."
Anisotropic (Gaussians): Direction-dependent Gaussian primitives with different spreads along principal axes. "3D Gaussian Splatting [33] showed that collections of anisotropic 3D Gaussians enable high-quality real-time rendering."
Camera intrinsics and extrinsics: Intrinsics define the camera’s internal parameters; extrinsics define its pose in world coordinates. "where each Pt = (Kt, Rt, tt) denotes camera intrinsics and extrinsics"
Cholesky-style shape transform: A lower-triangular parameterization with positive diagonals ensuring non-degenerate 2D triangle shapes. "each decoder token predicts a ray-centered triangle defined by a constrained Cholesky-style shape transform and residual rotations around a ray-aligned frame"
Differentiable rasterization: A smoothing of rasterization operations to allow gradient-based optimization. "To enable differentiable rasterization, we assign each pixel p a soft coverage value Im (p) € [0,1] via a window function described below."
Differentiable triangle rendering: Rendering formulation for triangles that provides gradients for learning. "We represent the scene as a set of triangle splats, following differentiable triangle rendering [25]."
Disparity (scale-invariant): An inverse-depth measure used with a loss that is invariant to global scale. "We also supervise rendered depth with a scale-invariant disparity loss, as in MiDaS [48]."
Feedforward (decoder/model): A single-pass prediction without per-scene optimization or iterative refinement. "We introduce FLAT, a feedforward model that directly predicts semi-opaque triangle-splatting primitives [25, 24] from the latent space of a frozen video diffusion model in a single forward pass."
Incenter (of a triangle): The point equidistant from all triangle edges where the inscribed circle is centered. "Let Sm be the triangle incenter and let Pm = - maxi Lm,i(Sm) denote its inradius in screen space."
Inradius (of a triangle): The radius of the largest circle that fits inside a triangle. "Let Sm be the triangle incenter and let Pm = - maxi Lm,i(Sm) denote its inradius in screen space."
Latent space (video latents): Compressed feature space produced by the video diffusion VAE used for decoding scene parameters. "We ask whether compressed video diffusion latents can be mapped directly to explicit surface primitives in a single pass."
LPIPS (perceptual loss): A learned perceptual metric for comparing images that correlates with human judgment. "In line with other feedforward 3D models [76, 3, 38], we use a pixel-wise L2 loss, along with a perceptual LPIPS loss [71], between the rendered and target frames."
Mamba (state space model architecture): A linear-time sequence model used in some decoders; here found insufficient for non-volumetric primitives. "Employing the LongLRM Mamba-based decoder used in Lyra also under- performs, suggesting that its limited capacity is insufficient for decoding complex non-volumetric primitives."
Marching cubes: A classic isosurface extraction algorithm from volumetric data. "Thus, traditional marching cubes or TSDF surface-extraction methods simply fail in most scenes."
MeshSplatting: A method that enables differentiable rendering with opaque meshes and connectivity. "MeshSplatting [24] further extends this line of work by enabling connectivity, allowing for differentiable mesh optimization."
NeRF (Neural Radiance Fields): A volumetric representation that models view-dependent radiance to synthesize novel views. "A number of approaches thus follow a generate-then-optimize paradigm [67, 73, 17] wherein a 3D Gaussian Splatting [33] or NeRF [45] representation is optimized to fit frames generated by the video model."
Opaque mesh: A triangle mesh with near-binary opacities suitable for standard graphics pipelines. "We further show that a lightweight test-time refinement step converts the predicted triangle soup into a fully opaque, game-engine-ready representation that supports real-time rendering."
Opacity regularization: A training term that encourages triangles’ opacities toward desired values to stabilize rendering. "we apply an opacity regularization term, as commonly used in feedforward 3D Gaussian methods [76, 3], and remove triangles with opacity below 40%."
Pinhole camera model: A projection model mapping 3D points to 2D pixels using camera intrinsics and pose. "we project each vertex to the image plane with a standard pinhole camera model."
Plücker ray embedding: A 6D representation of 3D lines/rays using direction and moment components. "Starting from the pixel-aligned Plücker ray embedding Ipl =(o×d, d),"
PSNR (Peak Signal-to-Noise Ratio): An image quality metric measuring reconstruction fidelity. "and directly optimizes pixel-wise metrics such as PSNR."
Quaternion: A 4D rotation parameterization commonly used for 3D orientations. "We found this decomposition to be more numerically stable than predicting a full 3D rotation, such as a quaternion, for each triangle."
Ray-centered rotation parameterization: Predicting triangle orientation relative to the camera ray to stabilize training. "FLAT solves with two key ingredients: a ray-centered rotation parameterization for triangle regression and a novel product window function that improves gradient flow during differentiable triangle rendering."
Ray-tangent frame: A local coordinate frame tangent to the viewing ray used to lift 2D triangles to 3D. "Finally, the local 2D triangle is lifted to 3D using a ray-tangent frame."
Residual rotations (around a ray-aligned frame): Small rotations predicted relative to a canonical ray-aligned orientation. "By predicting residual rotations around a ray-aligned frame, triangles inherit position from the predicted ray depth, shape from the Cholesky-style 2D transform, and orientation from a locally constrained rotation."
RPPC parameterization (ray closest point): Encoding rays by the point closest to the origin and direction to expose position and depth. "This RPPC parameterization better exposes the ray position and relative depth to the decoder."
Sigmoid-based window function: A smooth coverage function based on sigmoids used in splatting; contrasted with the proposed product window. "Window Function: Comparison of sigmoid-based window function [26, 14], max edge distance is used in [25] and ours."
SSIM (Structural Similarity Index): A perceptual similarity metric focusing on structural fidelity. "PSNR ↑ SSIM ↑ LPIPS ↓"
TSDF (Truncated Signed Distance Function): A volumetric field representing distance to the surface, used for mesh extraction. "Thus, traditional marching cubes or TSDF surface-extraction methods simply fail in most scenes."
Triangle soup: A set of unconnected, semi-transparent triangles prior to mesh cleanup. "a lightweight test-time refinement step converts the predicted triangle soup into a fully opaque, game-engine-ready representation"
Triangle splats: Soft, semi-opaque triangle primitives used for differentiable rendering. "We represent the scene as a set of triangle splats, following differentiable triangle rendering [25]."
Variational Autoencoder (VAE): A neural autoencoder with a probabilistic latent space used to compress video frames. "The VAE encoder temporarily downsamples the video by a factor of rt = 4 and spatially by Ts = 8."
Video diffusion model: A generative model that denoises sequences to synthesize videos, providing strong priors. "Recent advances in video diffusion models [55, 46, 59, 35, 47] offer a viable path towards this goal."
Volumetric representation: Scene parameterization that models volume and transparency rather than explicit surfaces. "these are volumetric, semi-transparent blobs that are well- suited to training scene decoders via differentiable rendering."
Window function (triangle rendering): A smooth coverage function that approximates hard triangle boundaries to enable gradients. "A window function replaces the hard triangle with a smooth approximation, enabling effective gradient flow."
Zero-convolutional blocks: Zero-initialized convolutional layers used to inject conditioning signals safely. "We introduce camera conditioning via zero-convolutional blocks and attach lightweight output heads that map intermediate decoder features to triangle parameters rather than RGB pixels."

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Collections

GitHub

FLAT | Feedforward Latent Triangle Splatting

FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation

Summary

Feedforward Latent Triangle Splatting (FLAT): Explicit Geometric Scene Generation from Video Diffusion Latents

Motivation and Context

Methodological Contributions

Empirical Evaluation

Theoretical Implications

Practical Impact and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

FLAT: Turning One Picture into an Explorable 3D World

What is this paper about?

What questions are the researchers asking?

How does their method work (in everyday terms)?

What did they find, and why does it matter?

What’s the potential impact?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Practical Applications of FLAT (Feedforward Latent Triangle Splatting)

Immediate Applications

Long-Term Applications

Key Cross-Cutting Dependencies and Assumptions

Glossary

Open Problems

Continue Learning

Collections

GitHub

Tweets

Don't miss out on important new AI/ML research

FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation

Summary

Feedforward Latent Triangle Splatting (FLAT): Explicit Geometric Scene Generation from Video Diffusion Latents

Motivation and Context

Methodological Contributions

Empirical Evaluation

Theoretical Implications

Practical Impact and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

FLAT: Turning One Picture into an Explorable 3D World

What is this paper about?

What questions are the researchers asking?

How does their method work (in everyday terms)?

What did they find, and why does it matter?

What’s the potential impact?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Practical Applications of FLAT (Feedforward Latent Triangle Splatting)

Immediate Applications

Long-Term Applications

Key Cross-Cutting Dependencies and Assumptions

Glossary

Open Problems

Continue Learning

Collections

GitHub

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research