RenderFormer: Transformer-Based Neural Rendering

Updated 20 February 2026

The paper introduces a dual-stage transformer architecture that maps tokenized scene data to radiance patches, bypassing recursive light transport calculations.
It employs a view-independent stage for modeling global light transport and a view-dependent stage for efficient image synthesis from ray-bundle tokens.
Experimental results demonstrate competitive photorealism and a 50× speedup over Monte Carlo path-tracing, highlighting its potential for interactive graphics.

RenderFormer is a fully neural, transformer-based neural rendering framework that generates photorealistic images with full global illumination directly from triangle mesh scene descriptions. Distinct from traditional physical simulation-based renderers, RenderFormer does not require explicit per-scene training or fine-tuning, nor does it rely on recursive light transport calculations. Instead, it formulates image synthesis as a sequence-to-sequence mapping—transforming a tokenized scene representation into a tokenized image representation—via a dual-stage transformer architecture. RenderFormer achieves rapid and high-fidelity rendering of complex scenes, generalizes across novel objects and lighting conditions, and demonstrates competitive performance to Monte Carlo path-tracing approaches at a fraction of their computational cost (Zeng et al., 28 May 2025).

1. Foundations and Problem Formulation

The classical rendering equation,

$L_o(x, \omega_o) = L_e(x, \omega_o) + \int_\Omega f_r(x, \omega_i, \omega_o) L_i(x, \omega_i) (\omega_i \cdot n) d\omega_i,$

recursively expresses light transport through scene geometry, materials, and light sources. RenderFormer circumvents this process, learning a mapping

$f(T; R) = \{p_k\}_{k=1}^M, \qquad p_k = g(r_k, h(T)),$

where $T = \{t_i\}_{i=1}^N$ is a sequence of triangle tokens encoding scene geometry and appearance, $R = \{r_k\}_{k=1}^M$ is a sequence of ray-bundle tokens parameterizing the camera view, and $P = \{p_k\}_{k=1}^M$ encodes radiance over image patches. This paradigm replaces explicit simulation with a direct, globally informed, token-based mapping learned entirely from data.

Key components:

Input tokens $T = \{t_i\}$ : Each $t_i \in \mathbb{R}^{768}$ encodes per-vertex normals, GGX BRDF reflectance parameters, and emission.
Output tokens $P = \{p_k\}$ : Each $p_k$ summarizes radiance for an $8\times8$ -pixel patch, representing 64 rays.
Ray-bundle tokens $f(T; R) = \{p_k\}_{k=1}^M, \qquad p_k = g(r_k, h(T)),$ 0: Each $f(T; R) = \{p_k\}_{k=1}^M, \qquad p_k = g(r_k, h(T)),$ 1 encodes directions for the 64 rays of its corresponding patch.
Sequence-to-sequence architecture: The mapping is realized by composing a view-independent transformer $f(T; R) = \{p_k\}_{k=1}^M, \qquad p_k = g(r_k, h(T)),$ 2 and a view-dependent transformer $f(T; R) = \{p_k\}_{k=1}^M, \qquad p_k = g(r_k, h(T)),$ 3.

2. View-Independent Transformer Stage

This stage models all triangle-to-triangle light transport in a scene, producing latent per-triangle features that represent global illumination independently of camera view.

Embedding and Encoding

Triangle tokens: Each $f(T; R) = \{p_k\}_{k=1}^M, \qquad p_k = g(r_k, h(T)),$ 4 is constructed as the sum of two components:
- Vertex normals: Per-vertex normals are NeRF-style positional encoded (6 frequencies), linearly projected to $f(T; R) = \{p_k\}_{k=1}^M, \qquad p_k = g(r_k, h(T)),$ 5, and RMS normalized.
- Material/emission: A 10-D vector containing diffuse and specular albedo, roughness, and emission, linearly projected and RMS normalized.
Relative positional encoding: A 3D rotary positional encoding (3D-RoPE) is applied to all three vertex positions per triangle, using 6 frequencies. This produces a $f(T; R) = \{p_k\}_{k=1}^M, \qquad p_k = g(r_k, h(T)),$ 6-angle set, sine/cosine embedded, effecting a block-diagonal rotation on the first $f(T; R) = \{p_k\}_{k=1}^M, \qquad p_k = g(r_k, h(T)),$ 7 dimensions of every $f(T; R) = \{p_k\}_{k=1}^M, \qquad p_k = g(r_k, h(T)),$ 8-D attention head to ensure translation covariance and spatial awareness across attention layers.
Architecture:
- $f(T; R) = \{p_k\}_{k=1}^M, \qquad p_k = g(r_k, h(T)),$ 9 triangle tokens + $T = \{t_i\}_{i=1}^N$ 0 global register tokens.
- 12 transformer layers, each implementing: RMS-Norm, 6-head multi-head self-attention (QK-Norm), residual connections, and SwiGLU feed-forward layers ( $T = \{t_i\}_{i=1}^N$ 1).
Self-attention as light transport:
- For triangles $T = \{t_i\}_{i=1}^N$ 2 at layer $T = \{t_i\}_{i=1}^N$ 3:
$T = \{t_i\}_{i=1}^N$ 4

with $T = \{t_i\}_{i=1}^N$ 5. - The self-attention weights $T = \{t_i\}_{i=1}^N$ 6 serve as a learned model of inter-triangle light contribution.

3. View-Dependent Transformer Stage

This stage synthesizes the image from the perspective of a given camera, using cross-attention from ray-bundle tokens to the globally encoded triangle features.

Ray-Bundle Tokenization and Processing

Tokenization: For each $T = \{t_i\}_{i=1}^N$ 7 patch $T = \{t_i\}_{i=1}^N$ 8, $T = \{t_i\}_{i=1}^N$ 9 camera rays are bundled; their directions are concatenated into a $R = \{r_k\}_{k=1}^M$ 0-D vector, linearly projected to $R = \{r_k\}_{k=1}^M$ 1 and RMS normalized, yielding $R = \{r_k\}_{k=1}^M$ 2.
Architecture:
- 6 transformer layers, each with RMS-Norm, cross-attention (queries $R = \{r_k\}_{k=1}^M$ 3, keys/values $R = \{r_k\}_{k=1}^M$ 4), self-attention, SwiGLU feed-forward, and residuals.
- The final 4 layers' outputs are input to a compact dense vision transformer that decodes a $R = \{r_k\}_{k=1}^M$ 5-D bundle token per patch.
- A linear projection maps each bundle token to log-encoded HDR radiance for its $R = \{r_k\}_{k=1}^M$ 6 rays.
Cross-attention as view-dependent transport:
- For each $R = \{r_k\}_{k=1}^M$ 7 and $R = \{r_k\}_{k=1}^M$ 8 at layer $R = \{r_k\}_{k=1}^M$ 9:
$P = \{p_k\}_{k=1}^M$ 0 - This aggregates view-dependent information from the full set of triangles.

4. Training Protocol and Loss Definitions

Training is fully supervised and conducted entirely on synthetic data with randomized layouts, materials, and lighting.

Losses:
- $P = \{p_k\}_{k=1}^M$ 1 pixel loss on log-transformed HDR: $P = \{p_k\}_{k=1}^M$ 2
- Perceptual LPIPS loss (on tone-mapped images): $P = \{p_k\}_{k=1}^M$ 3
- Total objective: $P = \{p_k\}_{k=1}^M$ 4
Optimization setup:
- Optimizer: AdamW, batch size 128 across 8 A100 GPUs, decoupled weight decay.
- Learning rate: warm-up to $P = \{p_k\}_{k=1}^M$ 5 over $P = \{p_k\}_{k=1}^M$ 6k steps, then cosine decay.
Resolution and scheduling:
- Pre-training: $P = \{p_k\}_{k=1}^M$ 7 pixels, $P = \{p_k\}_{k=1}^M$ 8 triangles, $P = \{p_k\}_{k=1}^M$ 9k iterations.
- Fine-tuning: $T = \{t_i\}$ 0 pixels, $T = \{t_i\}$ 1 triangles, $T = \{t_i\}$ 2k iterations.
Data diversity:
- 2 million synthetic scenes, 8 million renders per target resolution.
- Training scenes constructed from 1–3 Objaverse objects, remeshed to $T = \{t_i\}$ 3 triangles, in four layout templates.
- Materials employ random albedo/roughness per object/triangle, parametrized by GGX BRDF.
- 1–8 triangle emitters per scene, intensity in $T = \{t_i\}$ 4, placed outside geometry.
- Cameras are always outside mesh, variable FOV $T = \{t_i\}$ 5 and distances $T = \{t_i\}$ 6 times the bounding box.
- On-the-fly scene rotation via RoMa for rotational generalization.

5. Experimental Results and Generalization

RenderFormer’s performance is evaluated both quantitatively and qualitatively, and its scalability and failure modes are systematically analyzed.

Metric	Score on 512²
PSNR	29.8 dB
SSIM	0.952
LPIPS	0.055
HDR-FLIP	0.175

Comparative performance: On held-out test scenes at $T = \{t_i\}$ 7 with up to 4096 tris, RenderFormer achieves $T = \{t_i\}$ 829.8 dB PSNR, $T = \{t_i\}$ 90.952 SSIM, $t_i \in \mathbb{R}^{768}$ 00.055 LPIPS, and $t_i \in \mathbb{R}^{768}$ 10.175 HDR-FLIP. Compared to Cycles path-tracing (4096 spp plus denoising), RenderFormer renders $t_i \in \mathbb{R}^{768}$ 250× faster ( $t_i \in \mathbb{R}^{768}$ 30.06 s vs 3–4 s per frame).
Qualitative effects: Realizes sub-triangle shadows, soft/hard shadowing, diffuse interreflection, indirect lighting, glossy/specular reflections (up to 2–3 bounces on average), multiple emitters, and consistent shading under camera movement.
Generalization:
- Handles up to $t_i \in \mathbb{R}^{768}$ 46000 triangles at inference (with graceful loss of detail).
- Robust to field of view and camera distances outside the training distribution, as long as camera is exterior to the mesh.
- Performance degrades gracefully with >8 light sources or sources inside the mesh; compositing renders for separate light conditions mitigates this.
- Early extension to textured BRDFs via per-triangle texture rasterization yielded blurred, yet plausible, outputs.
Failure modes: Issues arise with excessive scene complexity, internal light sources, or significant deviations from training distributions, but failures are typically nondestructive.

6. Context and Implications

RenderFormer demonstrates that transformer-based, sequence-to-sequence frameworks can “solve” the rendering equation over triangle meshes with global illumination, solely via learned inter-object attention and ray-triangle aggregation. By eschewing per-scene adaptation and explicitly physical simulation, RenderFormer can efficiently synthesize accurate, globally lit images for arbitrary scenes in a single pass, providing a substantial speedup over Monte Carlo renderers at comparable fidelity (Zeng et al., 28 May 2025).

A plausible implication is that transformer-only neural rendering pipelines represent a viable alternative to both classic rendering and earlier neural field/implicit-representation approaches. Framing rendering as two-stage attention—view-independent triangle transport, followed by pixel-level view aggregation—facilitates modeling effects previously considered out of reach for general neural renderers.

Open questions include further scaling (scene and triangle count), extending to textured materials, handling highly complex emissive configurations, and integrating with dynamic or temporal scene content. The approach’s independence from per-scene training strongly suggests applicability in content-creation, simulation, and interactive graphics workflows.

Markdown Report Issue Upgrade to Chat

References (1)

RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RenderFormer.