RenderFormer: Transformer-Based Neural Rendering
- The paper introduces a dual-stage transformer architecture that maps tokenized scene data to radiance patches, bypassing recursive light transport calculations.
- It employs a view-independent stage for modeling global light transport and a view-dependent stage for efficient image synthesis from ray-bundle tokens.
- Experimental results demonstrate competitive photorealism and a 50× speedup over Monte Carlo path-tracing, highlighting its potential for interactive graphics.
RenderFormer is a fully neural, transformer-based neural rendering framework that generates photorealistic images with full global illumination directly from triangle mesh scene descriptions. Distinct from traditional physical simulation-based renderers, RenderFormer does not require explicit per-scene training or fine-tuning, nor does it rely on recursive light transport calculations. Instead, it formulates image synthesis as a sequence-to-sequence mapping—transforming a tokenized scene representation into a tokenized image representation—via a dual-stage transformer architecture. RenderFormer achieves rapid and high-fidelity rendering of complex scenes, generalizes across novel objects and lighting conditions, and demonstrates competitive performance to Monte Carlo path-tracing approaches at a fraction of their computational cost (Zeng et al., 28 May 2025).
1. Foundations and Problem Formulation
The classical rendering equation,
recursively expresses light transport through scene geometry, materials, and light sources. RenderFormer circumvents this process, learning a mapping
where %%%%2%%%% is a sequence of triangle tokens encoding scene geometry and appearance, is a sequence of ray-bundle tokens parameterizing the camera view, and encodes radiance over image patches. This paradigm replaces explicit simulation with a direct, globally informed, token-based mapping learned entirely from data.
Key components:
- Input tokens : Each %%%%6%%%% encodes per-vertex normals, GGX BRDF reflectance parameters, and emission.
- Output tokens : Each summarizes radiance for an -pixel patch, representing 64 rays.
- Ray-bundle tokens : Each encodes directions for the 64 rays of its corresponding patch.
- Sequence-to-sequence architecture: The mapping is realized by composing a view-independent transformer and a view-dependent transformer .
2. View-Independent Transformer Stage
This stage models all triangle-to-triangle light transport in a scene, producing latent per-triangle features that represent global illumination independently of camera view.
Embedding and Encoding
- Triangle tokens: Each is constructed as the sum of two components:
- Vertex normals: Per-vertex normals are NeRF-style positional encoded (6 frequencies), linearly projected to , and RMS normalized.
- Material/emission: A 10-D vector containing diffuse and specular albedo, roughness, and emission, linearly projected and RMS normalized.
- Relative positional encoding: A 3D rotary positional encoding (3D-RoPE) is applied to all three vertex positions per triangle, using 6 frequencies. This produces a $54$-angle set, sine/cosine embedded, effecting a block-diagonal rotation on the first $108$ dimensions of every $128$-D attention head to ensure translation covariance and spatial awareness across attention layers.
- Architecture:
- triangle tokens + $16$ global register tokens.
- 12 transformer layers, each implementing: RMS-Norm, 6-head multi-head self-attention (QK-Norm), residual connections, and SwiGLU feed-forward layers ().
- Self-attention as light transport:
- For triangles at layer :
with . - The self-attention weights serve as a learned model of inter-triangle light contribution.
3. View-Dependent Transformer Stage
This stage synthesizes the image from the perspective of a given camera, using cross-attention from ray-bundle tokens to the globally encoded triangle features.
Ray-Bundle Tokenization and Processing
Tokenization: For each patch , $64$ camera rays are bundled; their directions are concatenated into a $192$-D vector, linearly projected to and RMS normalized, yielding .
Architecture:
- 6 transformer layers, each with RMS-Norm, cross-attention (queries , keys/values ), self-attention, SwiGLU feed-forward, and residuals.
- The final 4 layers' outputs are input to a compact dense vision transformer that decodes a $768$-D bundle token per patch.
- A linear projection maps each bundle token to log-encoded HDR radiance for its $64$ rays.
- Cross-attention as view-dependent transport:
- For each and at layer :
- This aggregates view-dependent information from the full set of triangles.
4. Training Protocol and Loss Definitions
Training is fully supervised and conducted entirely on synthetic data with randomized layouts, materials, and lighting.
Losses:
- pixel loss on log-transformed HDR:
- Perceptual LPIPS loss (on tone-mapped images):
- Total objective:
- Optimization setup:
- Optimizer: AdamW, batch size 128 across 8 A100 GPUs, decoupled weight decay.
- Learning rate: warm-up to over $8$k steps, then cosine decay.
- Resolution and scheduling:
- Pre-training: pixels, triangles, $500$k iterations.
- Fine-tuning: pixels, triangles, $100$k iterations.
- Data diversity:
- 2 million synthetic scenes, 8 million renders per target resolution.
- Training scenes constructed from 1–3 Objaverse objects, remeshed to triangles, in four layout templates.
- Materials employ random albedo/roughness per object/triangle, parametrized by GGX BRDF.
- 1–8 triangle emitters per scene, intensity in , placed outside geometry.
- Cameras are always outside mesh, variable FOV and distances times the bounding box.
- On-the-fly scene rotation via RoMa for rotational generalization.
5. Experimental Results and Generalization
RenderFormer’s performance is evaluated both quantitatively and qualitatively, and its scalability and failure modes are systematically analyzed.
| Metric | Score on 512² |
|---|---|
| PSNR | 29.8 dB |
| SSIM | 0.952 |
| LPIPS | 0.055 |
| HDR-FLIP | 0.175 |
- Comparative performance: On held-out test scenes at with up to 4096 tris, RenderFormer achieves 29.8 dB PSNR, 0.952 SSIM, 0.055 LPIPS, and 0.175 HDR-FLIP. Compared to Cycles path-tracing (4096 spp plus denoising), RenderFormer renders 50× faster (0.06 s vs 3–4 s per frame).
- Qualitative effects: Realizes sub-triangle shadows, soft/hard shadowing, diffuse interreflection, indirect lighting, glossy/specular reflections (up to 2–3 bounces on average), multiple emitters, and consistent shading under camera movement.
- Generalization:
- Handles up to 6000 triangles at inference (with graceful loss of detail).
- Robust to field of view and camera distances outside the training distribution, as long as camera is exterior to the mesh.
- Performance degrades gracefully with >8 light sources or sources inside the mesh; compositing renders for separate light conditions mitigates this.
- Early extension to textured BRDFs via per-triangle texture rasterization yielded blurred, yet plausible, outputs.
- Failure modes: Issues arise with excessive scene complexity, internal light sources, or significant deviations from training distributions, but failures are typically nondestructive.
6. Context and Implications
RenderFormer demonstrates that transformer-based, sequence-to-sequence frameworks can “solve” the rendering equation over triangle meshes with global illumination, solely via learned inter-object attention and ray-triangle aggregation. By eschewing per-scene adaptation and explicitly physical simulation, RenderFormer can efficiently synthesize accurate, globally lit images for arbitrary scenes in a single pass, providing a substantial speedup over Monte Carlo renderers at comparable fidelity (Zeng et al., 28 May 2025).
A plausible implication is that transformer-only neural rendering pipelines represent a viable alternative to both classic rendering and earlier neural field/implicit-representation approaches. Framing rendering as two-stage attention—view-independent triangle transport, followed by pixel-level view aggregation—facilitates modeling effects previously considered out of reach for general neural renderers.
Open questions include further scaling (scene and triangle count), extending to textured materials, handling highly complex emissive configurations, and integrating with dynamic or temporal scene content. The approach’s independence from per-scene training strongly suggests applicability in content-creation, simulation, and interactive graphics workflows.