Papers
Topics
Authors
Recent
Search
2000 character limit reached

RenderFormer: Transformer-Based Neural Rendering

Updated 20 February 2026
  • The paper introduces a dual-stage transformer architecture that maps tokenized scene data to radiance patches, bypassing recursive light transport calculations.
  • It employs a view-independent stage for modeling global light transport and a view-dependent stage for efficient image synthesis from ray-bundle tokens.
  • Experimental results demonstrate competitive photorealism and a 50× speedup over Monte Carlo path-tracing, highlighting its potential for interactive graphics.

RenderFormer is a fully neural, transformer-based neural rendering framework that generates photorealistic images with full global illumination directly from triangle mesh scene descriptions. Distinct from traditional physical simulation-based renderers, RenderFormer does not require explicit per-scene training or fine-tuning, nor does it rely on recursive light transport calculations. Instead, it formulates image synthesis as a sequence-to-sequence mapping—transforming a tokenized scene representation into a tokenized image representation—via a dual-stage transformer architecture. RenderFormer achieves rapid and high-fidelity rendering of complex scenes, generalizes across novel objects and lighting conditions, and demonstrates competitive performance to Monte Carlo path-tracing approaches at a fraction of their computational cost (Zeng et al., 28 May 2025).

1. Foundations and Problem Formulation

The classical rendering equation,

Lo(x,ωo)=Le(x,ωo)+Ωfr(x,ωi,ωo)Li(x,ωi)(ωin)dωi,L_o(x, \omega_o) = L_e(x, \omega_o) + \int_\Omega f_r(x, \omega_i, \omega_o) L_i(x, \omega_i) (\omega_i \cdot n) d\omega_i,

recursively expresses light transport through scene geometry, materials, and light sources. RenderFormer circumvents this process, learning a mapping

f(T;R)={pk}k=1M,pk=g(rk,h(T)),f(T; R) = \{p_k\}_{k=1}^M, \qquad p_k = g(r_k, h(T)),

where %%%%2%%%% is a sequence of triangle tokens encoding scene geometry and appearance, R={rk}k=1MR = \{r_k\}_{k=1}^M is a sequence of ray-bundle tokens parameterizing the camera view, and P={pk}k=1MP = \{p_k\}_{k=1}^M encodes radiance over image patches. This paradigm replaces explicit simulation with a direct, globally informed, token-based mapping learned entirely from data.

Key components:

  • Input tokens T={ti}T = \{t_i\}: Each %%%%6%%%% encodes per-vertex normals, GGX BRDF reflectance parameters, and emission.
  • Output tokens P={pk}P = \{p_k\}: Each pkp_k summarizes radiance for an 8×88\times8-pixel patch, representing 64 rays.
  • Ray-bundle tokens R={rk}R = \{r_k\}: Each rkr_k encodes directions for the 64 rays of its corresponding patch.
  • Sequence-to-sequence architecture: The mapping is realized by composing a view-independent transformer h:TT^h: T \rightarrow \hat{T} and a view-dependent transformer g:(R,T^)Pg: (R, \hat{T}) \rightarrow P.

2. View-Independent Transformer Stage

This stage models all triangle-to-triangle light transport in a scene, producing latent per-triangle features that represent global illumination independently of camera view.

Embedding and Encoding

  • Triangle tokens: Each tit_i is constructed as the sum of two components:
    • Vertex normals: Per-vertex normals are NeRF-style positional encoded (6 frequencies), linearly projected to R768\mathbb{R}^{768}, and RMS normalized.
    • Material/emission: A 10-D vector containing diffuse and specular albedo, roughness, and emission, linearly projected and RMS normalized.
  • Relative positional encoding: A 3D rotary positional encoding (3D-RoPE) is applied to all three vertex positions per triangle, using 6 frequencies. This produces a $54$-angle set, sine/cosine embedded, effecting a block-diagonal rotation on the first $108$ dimensions of every $128$-D attention head to ensure translation covariance and spatial awareness across attention layers.
  • Architecture:
    • NN triangle tokens + $16$ global register tokens.
    • 12 transformer layers, each implementing: RMS-Norm, 6-head multi-head self-attention (QK-Norm), residual connections, and SwiGLU feed-forward layers (7683072768768 \rightarrow 3072 \rightarrow 768).
  • Self-attention as light transport:
    • For triangles i,ji,j at layer \ell:

    Qi()=WQti(),Kj()=WKtj(),Vj()=WVtj(),αij()=softmaxj(Qi()(Kj())Tdh),ti(+1)=ti()+jαij()Vj()Q_i^{(\ell)} = W_Q t_i^{(\ell)},\, K_j^{(\ell)} = W_K t_j^{(\ell)},\, V_j^{(\ell)} = W_V t_j^{(\ell)},\, \alpha_{ij}^{(\ell)} = \mathrm{softmax}_j\bigg(\frac{Q_i^{(\ell)} (K_j^{(\ell)})^T}{\sqrt{d_h}}\bigg),\, t_i^{(\ell+1)} = t_i^{(\ell)} + \sum_j \alpha_{ij}^{(\ell)} V_j^{(\ell)}

    with dh=128d_h = 128. - The self-attention weights αij\alpha_{ij} serve as a learned model of inter-triangle light contribution.

3. View-Dependent Transformer Stage

This stage synthesizes the image from the perspective of a given camera, using cross-attention from ray-bundle tokens to the globally encoded triangle features.

Ray-Bundle Tokenization and Processing

  • Tokenization: For each 8×88\times8 patch kk, $64$ camera rays are bundled; their directions are concatenated into a $192$-D vector, linearly projected to R768\mathbb{R}^{768} and RMS normalized, yielding rkr_k.

  • Architecture:

    • 6 transformer layers, each with RMS-Norm, cross-attention (queries RR, keys/values T^\hat{T}), self-attention, SwiGLU feed-forward, and residuals.
    • The final 4 layers' outputs are input to a compact dense vision transformer that decodes a $768$-D bundle token per patch.
    • A linear projection maps each bundle token to log-encoded HDR radiance for its $64$ rays.
  • Cross-attention as view-dependent transport:
    • For each rkr_k and t^i\hat{t}_i at layer \ell:

    Qk()=WQrrk(),Ki()=WKtt^i(),Vi()=WVtt^i(),βki()=softmaxi(Qk()(Ki())Tdh),rk(+1)=rk()+iβki()Vi()Q_k^{(\ell)} = W_Q^r r_k^{(\ell)},\, K_i^{(\ell)} = W_K^t \hat{t}_i^{(\ell)},\, V_i^{(\ell)} = W_V^t \hat{t}_i^{(\ell)},\, \beta_{ki}^{(\ell)} = \mathrm{softmax}_i\bigg(\frac{Q_k^{(\ell)} (K_i^{(\ell)})^T}{\sqrt{d_h}}\bigg),\, r_k^{(\ell+1)} = r_k^{(\ell)} + \sum_i \beta_{ki}^{(\ell)} V_i^{(\ell)} - This aggregates view-dependent information from the full set of triangles.

4. Training Protocol and Loss Definitions

Training is fully supervised and conducted entirely on synthetic data with randomized layouts, materials, and lighting.

  • Losses:

    • L1L_1 pixel loss on log-transformed HDR: LL1=log(Ipred+1)log(Igt+1)1L_{L1} = \|\log(I_{pred}+1) - \log(I_{gt}+1)\|_1
    • Perceptual LPIPS loss (on tone-mapped images): LLPIPSL_{LPIPS}
    • Total objective: L=LL1+0.05LLPIPS\mathcal{L} = L_{L1} + 0.05\, L_{LPIPS}
  • Optimization setup:
    • Optimizer: AdamW, batch size 128 across 8 A100 GPUs, decoupled weight decay.
    • Learning rate: warm-up to 1×1041\times 10^{-4} over $8$k steps, then cosine decay.
  • Resolution and scheduling:
    • Pre-training: 2562256^2 pixels, 1536\leq1536 triangles, $500$k iterations.
    • Fine-tuning: 5122512^2 pixels, 4096\leq4096 triangles, $100$k iterations.
  • Data diversity:
    • 2 million synthetic scenes, 8 million renders per target resolution.
    • Training scenes constructed from 1–3 Objaverse objects, remeshed to 4096\leq4096 triangles, in four layout templates.
    • Materials employ random albedo/roughness per object/triangle, parametrized by GGX BRDF.
    • 1–8 triangle emitters per scene, intensity in [2.5k,5k][2.5\textrm{k}, 5\textrm{k}], placed outside geometry.
    • Cameras are always outside mesh, variable FOV [30,60][30^\circ, 60^\circ] and distances [1.5,2.0][1.5, 2.0] times the bounding box.
    • On-the-fly scene rotation via RoMa for rotational generalization.

5. Experimental Results and Generalization

RenderFormer’s performance is evaluated both quantitatively and qualitatively, and its scalability and failure modes are systematically analyzed.

Metric Score on 512²
PSNR 29.8 dB
SSIM 0.952
LPIPS 0.055
HDR-FLIP 0.175
  • Comparative performance: On held-out test scenes at 5122512^2 with up to 4096 tris, RenderFormer achieves \approx29.8 dB PSNR, \approx0.952 SSIM, \approx0.055 LPIPS, and \approx0.175 HDR-FLIP. Compared to Cycles path-tracing (4096 spp plus denoising), RenderFormer renders \approx50× faster (\approx0.06 s vs 3–4 s per frame).
  • Qualitative effects: Realizes sub-triangle shadows, soft/hard shadowing, diffuse interreflection, indirect lighting, glossy/specular reflections (up to 2–3 bounces on average), multiple emitters, and consistent shading under camera movement.
  • Generalization:
    • Handles up to \approx6000 triangles at inference (with graceful loss of detail).
    • Robust to field of view and camera distances outside the training distribution, as long as camera is exterior to the mesh.
    • Performance degrades gracefully with >8 light sources or sources inside the mesh; compositing renders for separate light conditions mitigates this.
    • Early extension to textured BRDFs via per-triangle texture rasterization yielded blurred, yet plausible, outputs.
  • Failure modes: Issues arise with excessive scene complexity, internal light sources, or significant deviations from training distributions, but failures are typically nondestructive.

6. Context and Implications

RenderFormer demonstrates that transformer-based, sequence-to-sequence frameworks can “solve” the rendering equation over triangle meshes with global illumination, solely via learned inter-object attention and ray-triangle aggregation. By eschewing per-scene adaptation and explicitly physical simulation, RenderFormer can efficiently synthesize accurate, globally lit images for arbitrary scenes in a single pass, providing a substantial speedup over Monte Carlo renderers at comparable fidelity (Zeng et al., 28 May 2025).

A plausible implication is that transformer-only neural rendering pipelines represent a viable alternative to both classic rendering and earlier neural field/implicit-representation approaches. Framing rendering as two-stage attention—view-independent triangle transport, followed by pixel-level view aggregation—facilitates modeling effects previously considered out of reach for general neural renderers.

Open questions include further scaling (scene and triangle count), extending to textured materials, handling highly complex emissive configurations, and integrating with dynamic or temporal scene content. The approach’s independence from per-scene training strongly suggests applicability in content-creation, simulation, and interactive graphics workflows.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RenderFormer.