Papers
Topics
Authors
Recent
Search
2000 character limit reached

tttLRM: Test-Time Training 3D Reconstruction

Updated 4 July 2026
  • tttLRM is a 3D reconstruction model that uses test-time training to update a fixed-size fast-weight memory, enabling linear complexity for long input sequences.
  • It decodes an implicit latent representation into explicit 3D outputs such as Gaussian splats and triplane grids, supporting high-resolution novel view synthesis.
  • The architecture integrates LaCT blocks, window attention, and distributed training to efficiently process multi-view inputs and support both feedforward and streaming reconstruction.

Searching arXiv for the specified paper and closely related work names mentioned in the provided data. I’ll look up the paper on arXiv to verify metadata before writing the article. Searching arXiv for ([2602.20160](/papers/2602.20160)) and tttLRM. tttLRM, short for Test-Time Training Large Reconstruction Model, is a large 3D reconstruction model that uses test-time training (TTT) as its core sequence-modeling and memory mechanism to enable long-context, autoregressive 3D reconstruction with explicit outputs, primarily 3D Gaussian Splatting, while keeping computation linear in the number of views (Wang et al., 23 Feb 2026). It is designed for settings in which many posed RGB images of a scene or object must be compressed into a fixed-size latent state and then decoded into explicit 3D structure suitable for high-resolution novel view synthesis, feedforward reconstruction, or streaming reconstruction and refinement from incoming observations. In this formulation, the model’s fast weights function as an implicit 3D representation in latent space, and virtual queries decode that latent into explicit formats such as Gaussian splats or a triplane NeRF (Wang et al., 23 Feb 2026).

1. Definition, scope, and stated contributions

tttLRM addresses a specific bottleneck in large reconstruction models for 3D: existing systems such as LRM, GS-LRM, and Long-LRM are described as being bottlenecked by quadratic attention and/or fixed small numbers of input views. The proposed replacement is a TTT/LaCT fast-weight layer that stores all multi-view information in a fixed-size, learned fast-weight memory, updates that memory at test time via gradient descent, and can be queried by virtual tokens to decode different explicit 3D formats (Wang et al., 23 Feb 2026).

The model is framed around three operating requirements that are all explicit in the source description: support for high-resolution novel view synthesis up to 102421024^2, ingestion of long sequences of views up to 64–128 views and more than 1M tokens, and operation in both feedforward mode and a streaming / autoregressive mode where images arrive over time (Wang et al., 23 Feb 2026). The explicit outputs are primarily 3D Gaussian splats, though the same latent can also be decoded into a triplane grid for NeRF-like rendering.

The stated contributions are organized around seven points. First, tttLRM is presented as a TTT-based long-context LRM in which fast weights serve as the scene memory, enabling long-context 3D reconstruction with linear complexity in the number of tokens or views. Second, it applies LaCTLarge Chunk Test-Time Training—to 3D as the core sequence module, using large token chunks for good GPU utilization and fixed-size memory compression. Third, it introduces an implicit-to-explicit decoding view in which fast weights represent an implicit latent 3D state that can be decoded into 3D Gaussian splats or triplane features. Fourth, it provides an autoregressive / streaming causal variant that updates fast weights as views arrive and progressively refines Gaussian splats. Fifth, it includes a distributed / sequence-parallel training and inference scheme that shards the sequence dimension across multiple GPUs. Sixth, it uses pretraining on novel view synthesis through TTT-LVSM, then fine-tunes for 3D reconstruction. Seventh, it reports state-of-the-art feedforward 3DGS reconstruction on DL3DV-10K scenes and Objaverse/GSO objects, outperforming Long-LRM and GS-LRM on PSNR, SSIM, and LPIPS while remaining much faster than optimization-based baselines (Wang et al., 23 Feb 2026).

A common misconception is that “test-time training” implies updating the entire network during inference. In tttLRM, that is not the case: slow weights are learned during training and are frozen at test time, while only the internal fast weights are updated per scene (Wang et al., 23 Feb 2026).

2. Architecture, tokenization, and explicit 3D decoding

The inputs are a set of posed RGB images and corresponding ray embeddings,

{IiRH×W×3}i=1N,{RiRH×W×9}i=1N,\{I_i \in \mathbb{R}^{H\times W \times 3}\}_{i=1}^N,\qquad \{R_i \in \mathbb{R}^{H\times W \times 9}\}_{i=1}^N,

where the ray embeddings encode position and direction per pixel. The main output is a set of 3D Gaussian primitives representing the scene, with an optional alternative output of a triplane grid for NeRF-style rendering (Wang et al., 23 Feb 2026).

Each image is patchified and tokenized according to

{Ti,j}i=1Nj=1HW/p2=Tokenize(Patchify([{Ii}i=1N,{Ri}i=1N])).\{T_{i,j}\}_{i=1}^{N}{}_{j=1}^{HW/p^2} = \text{Tokenize}\big(\text{Patchify}([\{I_i\}_{i=1}^{N}, \{R_i\}_{i=1}^{N}])\big).

The model concatenates IiI_i and RiR_i along channels, patchifies using size p×pp \times p, and linearly projects patches to tokens Ti,jRdT_{i,j} \in \mathbb{R}^d. The typical setting is p=8p=8 or $16$, with hidden dimension d=768d=768 (Wang et al., 23 Feb 2026).

The backbone is a stack of 24 LaCT blocks. Each block combines a window attention module, used locally within each image to capture within-view spatial relationships, with a TTT / LaCT fast-weight layer. In simplified form, omitting feed-forward MLPs, each layer performs

{IiRH×W×3}i=1N,{RiRH×W×9}i=1N,\{I_i \in \mathbb{R}^{H\times W \times 3}\}_{i=1}^N,\qquad \{R_i \in \mathbb{R}^{H\times W \times 9}\}_{i=1}^N,0

followed by

{IiRH×W×3}i=1N,{RiRH×W×9}i=1N,\{I_i \in \mathbb{R}^{H\times W \times 3}\}_{i=1}^N,\qquad \{R_i \in \mathbb{R}^{H\times W \times 9}\}_{i=1}^N,1

Here, Update uses the LaCT rule, treating token projections as key–value pairs, computing gradients of an MSE loss, and updating {IiRH×W×3}i=1N,{RiRH×W×9}i=1N,\{I_i \in \mathbb{R}^{H\times W \times 3}\}_{i=1}^N,\qquad \{R_i \in \mathbb{R}^{H\times W \times 9}\}_{i=1}^N,2 with a Muon optimizer; Apply uses the current fast weights to map tokens in a manner analogous to attention with a compressed KV cache. Both steps are implemented with linear scaling in the number of tokens (Wang et al., 23 Feb 2026).

A distinctive architectural feature is the use of virtual tokens. After processing input tokens and updating {IiRH×W×3}i=1N,{RiRH×W×9}i=1N,\{I_i \in \mathbb{R}^{H\times W \times 3}\}_{i=1}^N,\qquad \{R_i \in \mathbb{R}^{H\times W \times 9}\}_{i=1}^N,3, tttLRM does not simply attach 3D heads to the input token stream. Instead, it introduces query tokens that are only applied, not used to update {IiRH×W×3}i=1N,{RiRH×W×9}i=1N,\{I_i \in \mathbb{R}^{H\times W \times 3}\}_{i=1}^N,\qquad \{R_i \in \mathbb{R}^{H\times W \times 9}\}_{i=1}^N,4:

{IiRH×W×3}i=1N,{RiRH×W×9}i=1N,\{I_i \in \mathbb{R}^{H\times W \times 3}\}_{i=1}^N,\qquad \{R_i \in \mathbb{R}^{H\times W \times 9}\}_{i=1}^N,5

For 3DGS, these virtual tokens are virtual views with known cameras, often aligned with or subsampled from actual views. For triplane NeRF, they are learnable triplane feature tokens not necessarily tied to images. The queried virtual tokens are then passed to a token decoder that produces explicit 3D representation parameters (Wang et al., 23 Feb 2026).

For Gaussian decoding, each virtual-view patch token predicts Gaussian attributes including RGB color {IiRH×W×3}i=1N,{RiRH×W×9}i=1N,\{I_i \in \mathbb{R}^{H\times W \times 3}\}_{i=1}^N,\qquad \{R_i \in \mathbb{R}^{H\times W \times 9}\}_{i=1}^N,6, scale {IiRH×W×3}i=1N,{RiRH×W×9}i=1N,\{I_i \in \mathbb{R}^{H\times W \times 3}\}_{i=1}^N,\qquad \{R_i \in \mathbb{R}^{H\times W \times 9}\}_{i=1}^N,7 or anisotropic covariance parameters, rotation parameters, opacity {IiRH×W×3}i=1N,{RiRH×W×9}i=1N,\{I_i \in \mathbb{R}^{H\times W \times 3}\}_{i=1}^N,\qquad \{R_i \in \mathbb{R}^{H\times W \times 9}\}_{i=1}^N,8, and a per-pixel depth {IiRH×W×3}i=1N,{RiRH×W×9}i=1N,\{I_i \in \mathbb{R}^{H\times W \times 3}\}_{i=1}^N,\qquad \{R_i \in \mathbb{R}^{H\times W \times 9}\}_{i=1}^N,9. Given ray origin {Ti,j}i=1Nj=1HW/p2=Tokenize(Patchify([{Ii}i=1N,{Ri}i=1N])).\{T_{i,j}\}_{i=1}^{N}{}_{j=1}^{HW/p^2} = \text{Tokenize}\big(\text{Patchify}([\{I_i\}_{i=1}^{N}, \{R_i\}_{i=1}^{N}])\big).0 and direction {Ti,j}i=1Nj=1HW/p2=Tokenize(Patchify([{Ii}i=1N,{Ri}i=1N])).\{T_{i,j}\}_{i=1}^{N}{}_{j=1}^{HW/p^2} = \text{Tokenize}\big(\text{Patchify}([\{I_i\}_{i=1}^{N}, \{R_i\}_{i=1}^{N}])\big).1, the Gaussian center is

{Ti,j}i=1Nj=1HW/p2=Tokenize(Patchify([{Ii}i=1N,{Ri}i=1N])).\{T_{i,j}\}_{i=1}^{N}{}_{j=1}^{HW/p^2} = \text{Tokenize}\big(\text{Patchify}([\{I_i\}_{i=1}^{N}, \{R_i\}_{i=1}^{N}])\big).2

Each Gaussian can therefore be written as

{Ti,j}i=1Nj=1HW/p2=Tokenize(Patchify([{Ii}i=1N,{Ri}i=1N])).\{T_{i,j}\}_{i=1}^{N}{}_{j=1}^{HW/p^2} = \text{Tokenize}\big(\text{Patchify}([\{I_i\}_{i=1}^{N}, \{R_i\}_{i=1}^{N}])\big).3

where {Ti,j}i=1Nj=1HW/p2=Tokenize(Patchify([{Ii}i=1N,{Ri}i=1N])).\{T_{i,j}\}_{i=1}^{N}{}_{j=1}^{HW/p^2} = \text{Tokenize}\big(\text{Patchify}([\{I_i\}_{i=1}^{N}, \{R_i\}_{i=1}^{N}])\big).4 is obtained from scale and rotation. The union of Gaussians from all virtual views forms the scene representation, which is rendered by a standard 3DGS rasterizer (Wang et al., 23 Feb 2026).

This architecture makes a sharp distinction between latent memory and explicit geometry. The fast weights are the latent scene memory; the Gaussian or triplane outputs are decoded products of that memory rather than the memory itself. This suggests that the framework is intended as a generic latent-3D-to-explicit-3D decoder family rather than a single-format predictor.

3. Test-time training as fast-weight scene memory

In tttLRM, test-time training refers to optimization of the fast weights {Ti,j}i=1Nj=1HW/p2=Tokenize(Patchify([{Ii}i=1N,{Ri}i=1N])).\{T_{i,j}\}_{i=1}^{N}{}_{j=1}^{HW/p^2} = \text{Tokenize}\big(\text{Patchify}([\{I_i\}_{i=1}^{N}, \{R_i\}_{i=1}^{N}])\big).5 inside each LaCT layer while keeping the rest of the network fixed. The fast weights are initialized to a learned default from training and then updated at test time using gradients from a TTT objective defined over token-derived key–value pairs (Wang et al., 23 Feb 2026). Conceptually, for a chunk of tokens {Ti,j}i=1Nj=1HW/p2=Tokenize(Patchify([{Ii}i=1N,{Ri}i=1N])).\{T_{i,j}\}_{i=1}^{N}{}_{j=1}^{HW/p^2} = \text{Tokenize}\big(\text{Patchify}([\{I_i\}_{i=1}^{N}, \{R_i\}_{i=1}^{N}])\big).6 projected to keys {Ti,j}i=1Nj=1HW/p2=Tokenize(Patchify([{Ii}i=1N,{Ri}i=1N])).\{T_{i,j}\}_{i=1}^{N}{}_{j=1}^{HW/p^2} = \text{Tokenize}\big(\text{Patchify}([\{I_i\}_{i=1}^{N}, \{R_i\}_{i=1}^{N}])\big).7 and values {Ti,j}i=1Nj=1HW/p2=Tokenize(Patchify([{Ii}i=1N,{Ri}i=1N])).\{T_{i,j}\}_{i=1}^{N}{}_{j=1}^{HW/p^2} = \text{Tokenize}\big(\text{Patchify}([\{I_i\}_{i=1}^{N}, \{R_i\}_{i=1}^{N}])\big).8, LaCT defines a parametric function {Ti,j}i=1Nj=1HW/p2=Tokenize(Patchify([{Ii}i=1N,{Ri}i=1N])).\{T_{i,j}\}_{i=1}^{N}{}_{j=1}^{HW/p^2} = \text{Tokenize}\big(\text{Patchify}([\{I_i\}_{i=1}^{N}, \{R_i\}_{i=1}^{N}])\big).9 and minimizes

IiI_i0

with the update

IiI_i1

The updates are performed in large chunks, stated as up to IiI_i2M tokens, for efficiency (Wang et al., 23 Feb 2026).

The scene-level training objective does not require explicit 3D supervision. The primary reconstruction loss is a rendering loss,

IiI_i3

where the perceptual term uses VGG-19 features. For scenes, the training loss also includes a scale-invariant depth loss between Gaussian positions along the depth axis and pseudo-ground-truth depth from a monocular depth estimator, as well as opacity regularization to encourage sparsity in Gaussians:

IiI_i4

The total loss is

IiI_i5

At training time these losses update the slow weights; at test time Gaussian generation is feedforward given IiI_i6, unless extra post-optimization is deliberately added (Wang et al., 23 Feb 2026).

The linear-complexity claim follows from the LaCT formulation. For sequence length IiI_i7 and token dimension IiI_i8, each LaCT block has complexity

IiI_i9

which is linear in RiR_i0. This contrasts with conventional attention at

RiR_i1

where each token attends to all others. In the fast-weight formulation, the entire key–value set is compressed into a fixed-size parameter matrix RiR_i2, each token contributes a constant amount of gradient information, and no pairwise token–token interactions are formed (Wang et al., 23 Feb 2026).

This yields an interpretation of sequence modeling as online regression of values from keys, learned at test time. In the 3D setting, the consequence is that a fixed-size learned memory can summarize many posed observations without quadratic growth in cost.

4. Long-context and autoregressive reconstruction

In this framework, autoregressive means causal in the sequence of input views. The fast-weight state RiR_i3 at time step RiR_i4 depends only on views RiR_i5, and as new views arrive the model updates RiR_i6 and immediately produces updated Gaussians for query or virtual views (Wang et al., 23 Feb 2026). The recurrence is

RiR_i7

and the decoded 3D state is a function of RiR_i8.

The paper’s simplified algorithm initializes RiR_i9. For each mini-batch of input views p×pp \times p0, it first updates the fast weights,

p×pp \times p1

and then predicts Gaussians for query views p×pp \times p2,

p×pp \times p3

The Gaussian set at the final batch p×pp \times p4 is the current scene reconstruction (Wang et al., 23 Feb 2026).

The model supports both feedforward and streaming operation. In feedforward mode, all p×pp \times p5 views are processed together, either as one sequence or in large chunks, then the latent memory is queried with virtual tokens to decode Gaussians. In streaming mode, the sequence is processed in batches, for example 4 views at a time, with fast weights carrying cumulative information from earlier observations (Wang et al., 23 Feb 2026).

A crucial design point is that the output Gaussian set is regenerated from scratch at each autoregressive stage rather than simply appended to. This allows the model to correct earlier errors. The source description contrasts this with a cheaper Predict&Merge strategy, where only new Gaussians are predicted and then merged with previous ones. That strategy cannot correct old Gaussian errors and yields worse metrics:

p×pp \times p6

The stated qualitative progression is: with only 4 views, reconstruction is coarse but consistent; at 8 views, coverage and fidelity improve; at 32 views, quality approaches full long-context feedforward reconstruction (Wang et al., 23 Feb 2026).

The paper also notes a capacity limit. Because p×pp \times p7 is fixed size, very long sequences or highly complex scenes eventually degrade performance, especially in outdoor or high-frequency scenes. To mitigate drift, the appendix explores selective fast-weight updates using an EMA of squared gradients as an approximate Fisher information estimate and an elastic regularization toward an EMA anchor p×pp \times p8. This produces a slight autoregressive improvement, reported as PSNR 24.81 p×pp \times p9 24.95 (Wang et al., 23 Feb 2026).

A common misunderstanding is to equate “autoregressive” with token-by-token generation of images or geometry. Here it specifically denotes causal state updates over incoming views, with Ti,jRdT_{i,j} \in \mathbb{R}^d0 acting as the recurrent state.

5. Pretraining, transfer, and empirical performance

The backbone architecture is stated to be identical to TTT-LVSM, a LaCT-based large view synthesis model. Pretraining uses a novel view synthesis task in which multi-view images are used to predict unseen views via the same fast-weight latent representation. For tttLRM, the backbone parameters are initialized from TTT-LVSM, while the Gaussian or triplane token decoder and, if needed, virtual token definitions are added or modified for reconstruction (Wang et al., 23 Feb 2026).

The reported effect of this initialization is both faster convergence and better final quality. For the Gaussian 3DGS decoder, training without pretraining gives PSNR = 32.77, SSIM = 0.969, LPIPS = 0.026, while with pretraining it gives PSNR = 33.14, SSIM = 0.972, LPIPS = 0.024. For the triplane NeRF decoder, the corresponding improvement is from PSNR = 26.40, SSIM = 0.903, LPIPS = 0.093 to PSNR = 27.87, SSIM = 0.925, LPIPS = 0.075 (Wang et al., 23 Feb 2026). The interpretation given in the source is that the TTT backbone learns a strong implicit view-consistent 3D prior during NVS pretraining.

The reported benchmarks cover both objects and scenes. On GSO, trained on Objaverse, at 512×512 with 8 input views, GS-LRM gives PSNR 32.83, SSIM 0.969, LPIPS 0.029, Time 0.7s, while tttLRM gives PSNR 34.02, SSIM 0.974, LPIPS 0.025, Time 0.3s (Wang et al., 23 Feb 2026). At 512×512 with 16 views, GS-LRM gives PSNR 33.55, SSIM 0.976, LPIPS 0.023, Time 2.5s, while tttLRM (10 virtual) gives PSNR 34.67, SSIM 0.978, LPIPS 0.022, Time 0.8s. At 24 views, the description states that tttLRM similarly outperforms GS-LRM in both quality and speed. It is also reported to scale to 1024×1024 resolution, where GS-LRM runs out of memory, and to produce high-fidelity single-view reconstructions when combined with a multi-view diffusion generator (Wang et al., 23 Feb 2026).

On DL3DV-140 with 32 views, the reported numbers are: 3DGS (30k iters) at PSNR 23.60, SSIM 0.779, LPIPS 0.213, 13 min; Scaffold-GS (30k) at 24.77, 0.805, 0.205, 16 min; Long-LRM (feedforward) at 24.10, 0.783, 0.254, 1.0 s; Long-LRM + 10-step optim at 25.60, 0.826, 0.233, 37 s; tttLRM (feedforward) at 25.07, 0.822, 0.215, 7.2 s; and tttLRM + 10-step optim at 26.37, 0.854, 0.201, 42 s (Wang et al., 23 Feb 2026). On DL3DV-140 with 64 views, Long-LRM (64v) gives 24.63, 0.799, 0.243; 3DGS (30k) gives 26.55, 0.852, 0.164; tttLRM (feedforward) gives 25.95, 0.844, 0.195; and tttLRM + 10-step optim gives 27.65, 0.880, 0.177. The source states that Tanks&Temples shows similar trends, with tttLRM better than Long-LRM and competitive with, and sometimes better than, optimization-based methods while being 100× faster (Wang et al., 23 Feb 2026).

These results situate tttLRM between feedforward explicit predictors and per-scene optimization pipelines. It preserves explicit 3DGS outputs and real-time rendering while narrowing the quality gap to optimization-heavy systems.

6. Scalability, relation to prior work, and limitations

The scalability argument is tied to both algorithmic complexity and distributed implementation. Letting Ti,jRdT_{i,j} \in \mathbb{R}^d1 denote the number of images, Ti,jRdT_{i,j} \in \mathbb{R}^d2 the resolution, Ti,jRdT_{i,j} \in \mathbb{R}^d3 the patch size, and Ti,jRdT_{i,j} \in \mathbb{R}^d4 the total number of tokens, each LaCT block has complexity Ti,jRdT_{i,j} \in \mathbb{R}^d5, so cost is linear in both sequence length and image resolution. The appendix further reports that even 3 layers of standard attention become slower than 24 LaCT layers once the token count exceeds about 2M, corresponding to approximately 256 views at 540×960 (Wang et al., 23 Feb 2026).

The distributed scheme shards tokens along the sequence dimension across GPUs. Each GPU receives a subset of image tokens; fast-weight updates Ti,jRdT_{i,j} \in \mathbb{R}^d6 are synchronized across GPUs via DDP; each GPU predicts Gaussians for its own virtual views; Gaussians are gathered into the global scene; and each GPU renders some target views and computes losses. This is stated to enable training on more than 1M tokens and scaling to 128 views (Wang et al., 23 Feb 2026).

In relation to prior work, the paper positions tttLRM at the intersection of test-time training, NeRF, 3DGS, and long-context 3D reconstruction. Prior TTT has been used for classification, LLMs, and pointcloud 3D through systems such as TTT3R and Test3R, but not for explicit photorealistic NVS-grade 3D representations. NeRF provides implicit fields but typically requires slow per-scene optimization; 3DGS is explicit and faster to render but still commonly optimized per scene; GS-LRM and related models provide feedforward GS prediction from few views but remain limited by short sequences and quadratic attention. Long-LRM, Gamba/MvGamba, and Stream3R address long sequences with attention, SSMs, or causal transformers, but tttLRM instead uses fast-weight models that turn long-sequence modeling into online regression with linear complexity (Wang et al., 23 Feb 2026).

The stated limitations are equally specific. First, fixed memory capacity means that extremely complex scenes or very long sequences can degrade performance. Second, there is a slight quality loss relative to the best purely implicit NVS models such as LVSM, reflecting a trade-off between pure NVS quality and explicit, real-time-renderable 3D output. Third, the system has substantial implementation complexity, requiring integration of TTT/LaCT and sequence-parallel multi-GPU training. Fourth, autoregressive drift and forgetting remain present for very long streams, even though selective update helps (Wang et al., 23 Feb 2026).

Taken together, these points place tttLRM as a unified framework in which the same TTT-based backbone and fast-weight memory can support different explicit decoders, including GS and triplane, and plausibly others by changing the virtual tokens and decoder head. The central conceptual move is to treat fast weights not merely as an adaptation mechanism but as the latent 3D scene state from which explicit geometry can be generated.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to tttLRM.