4RC: Unified 4D Reconstruction Framework

Updated 4 July 2026

4RC is a unified feed-forward 4D reconstruction framework that encodes a monocular video into a compact spatio-temporal latent space for querying 3D geometry and motion.
Its conditional decoder effectively retrieves dense scene structure and motion dynamics at any queried spatial location or time using a transformer-based approach.
Experimental results indicate that 4RC outperforms prior methods in tasks like dense tracking, camera pose estimation, and video depth reconstruction, setting new benchmarks in the field.

to=arxiv_search.search 菲律宾申博json {"query":"4RC arXiv (Luo et al., 10 Feb 2026) 4D reconstruction conditional querying anytime anywhere", "max_results": 5} to=arxiv_search.search 大发时时彩计划 хадоуjson {"query":"(Tian, 2013) Rate Region of the (4,3,3) Exact-Repair Regenerating Codes", "max_results": 3} to=arxiv_search.search ทะเบียนฟรีjson {"query":"4D radar-camera fusion 4RC methods arXiv RCGDet3D (Xiong et al., 20 May 2026)", "max_results": 5} 4RC is a unified feed-forward framework for 4D reconstruction from monocular videos. It is defined by an encode-once, query-anywhere and anytime paradigm in which a transformer backbone encodes an entire video into a compact spatio-temporal latent space, and a conditional decoder queries 3D geometry and motion for any query frame at any target timestamp. In contrast to approaches that decouple motion from geometry or restrict output to sparse trajectories or two-view scene flow, 4RC learns a holistic 4D representation that jointly captures dense scene geometry and motion dynamics (Luo et al., 10 Feb 2026).

1. Problem setting and representational objective

The input to 4RC is a monocular video $\mathcal{V}=\{I_i\}_{i=1}^N$ with timestamps $t_i\in\{1,\dots,N\}$ . The central objective is to learn a single compact spatio-temporal latent that can be queried for geometry or motion at any spatial location or time, rather than recomputing a representation for each frame pair or each target time (Luo et al., 10 Feb 2026).

The encoder $\mathcal{E}$ produces a latent

$\mathcal{F}=\mathcal{E}(V),$

and the conditional decoder is defined as

$Q(Z,x,t)\to [P(x),\Delta P(x,t)],$

where $P(x)$ is the 3D point at the canonical time of the query frame and $\Delta P(x,t)$ is its displacement to time $t$ . This formulation makes geometry and motion jointly queryable from a shared video-level representation rather than from separate pipelines for static reconstruction and motion estimation (Luo et al., 10 Feb 2026).

A defining design choice is the minimally factorized representation of per-view 4D attributes. Instead of regressing a full 3D map for each query pair $(i,t)$ , 4RC decomposes the output into base geometry and time-dependent relative motion. This suggests that the framework is structured to separate persistent scene structure from temporal variation while preserving a single end-to-end feed-forward model.

2. Spatio-temporal latent and 4D attribute decomposition

The latent is denoted

$Z\equiv \mathcal{F}\in\mathbb{R}^{N\times (M+D+C)},$

with, for each frame $t_i\in\{1,\dots,N\}$ 0, $t_i\in\{1,\dots,N\}$ 1 patch tokens $t_i\in\{1,\dots,N\}$ 2, one camera token $t_i\in\{1,\dots,N\}$ 3, and one time token $t_i\in\{1,\dots,N\}$ 4 (Luo et al., 10 Feb 2026).

From this latent, 4RC defines a factorized 4D representation through

$t_i\in\{1,\dots,N\}$ 5

In generic query notation, the components are described as base geometry $t_i\in\{1,\dots,N\}$ 6, relative motion $t_i\in\{1,\dots,N\}$ 7, and a combined 4D attribute

$t_i\in\{1,\dots,N\}$ 8

where $t_i\in\{1,\dots,N\}$ 9 denotes vector addition in world-coordinate space (Luo et al., 10 Feb 2026).

This decomposition is not merely notational. The paper explicitly frames it as a choice to avoid regressing a full 3D map at each $\mathcal{E}$ 0. A plausible implication is that the model can allocate representational capacity to geometry once and reuse it across times, while learning motion as a conditional residual on top of that geometry.

3. Encoder and conditional decoder architecture

The encoder $\mathcal{E}$ 1 is a ViT backbone with 40 layers and DINOv2 pre-training. Each frame is patchified into $\mathcal{E}$ 2 tokens, and a learnable camera token $\mathcal{E}$ 3 and time token $\mathcal{E}$ 4 are appended. The architecture stacks global and per-frame self-attention layers, described more specifically as a mixture of per-frame (spatial) self-attention and cross-frame (temporal) self-attention, to produce the final latent $\mathcal{E}$ 5 (Luo et al., 10 Feb 2026).

The geometry decoder $\mathcal{E}$ 6 takes $\mathcal{E}$ 7 and $\mathcal{E}$ 8 as input and outputs a depth map $\mathcal{E}$ 9, ray directions $\mathcal{F}=\mathcal{E}(V),$ 0, and camera intrinsics/extrinsics $\mathcal{F}=\mathcal{E}(V),$ 1. Its implementation is given as dual-DPT upsampling plus a small MLP head (Luo et al., 10 Feb 2026).

The motion decoder $\mathcal{F}=\mathcal{E}(V),$ 2 is a lightweight 4-layer transformer with alternating self-attention and cross-attention. Given query frame tokens $\mathcal{F}=\mathcal{E}(V),$ 3 and a target time token $\mathcal{F}=\mathcal{E}(V),$ 4, it cross-attends into $\mathcal{F}=\mathcal{E}(V),$ 5. Time-conditioning is injected via AdaLN in each attention block, and the decoder outputs a 3D displacement field

$\mathcal{F}=\mathcal{E}(V),$ 6

(Luo et al., 10 Feb 2026).

Taken together, these components instantiate the paper’s encode-once, query-anywhere and anytime formulation: a single video encoder produces a reusable latent, and two specialized conditional heads recover geometry and motion under arbitrary temporal queries.

4. Training objective and inference procedure

4RC is trained end-to-end with an uncertainty-weighted loss supervising both geometry and motion. The loss consists of depth plus smoothness, ray consistency, camera parameters, and motion plus temporal smoothness. The depth term uses an aleatoric-uncertainty loss on both $\mathcal{F}=\mathcal{E}(V),$ 7 versus $\mathcal{F}=\mathcal{E}(V),$ 8 and their spatial gradients; the motion term similarly applies the loss to $\mathcal{F}=\mathcal{E}(V),$ 9 versus $Q(Z,x,t)\to [P(x),\Delta P(x,t)],$ 0 and to their temporal gradients. The notation $Q(Z,x,t)\to [P(x),\Delta P(x,t)],$ 1 is used for the aleatoric-uncertainty loss, with $Q(Z,x,t)\to [P(x),\Delta P(x,t)],$ 2 and $Q(Z,x,t)\to [P(x),\Delta P(x,t)],$ 3 denoting spatial and temporal gradients, respectively. The weights are absorbed into the definition of $Q(Z,x,t)\to [P(x),\Delta P(x,t)],$ 4 or tuned via hyperparameters (Luo et al., 10 Feb 2026).

At test time, the procedure is explicitly staged. First, the encoder is run once on all $Q(Z,x,t)\to [P(x),\Delta P(x,t)],$ 5 frames to obtain $Q(Z,x,t)\to [P(x),\Delta P(x,t)],$ 6. Second, static geometry of frame $Q(Z,x,t)\to [P(x),\Delta P(x,t)],$ 7 is reconstructed by applying the geometry decoder to $Q(Z,x,t)\to [P(x),\Delta P(x,t)],$ 8. Third, motion from frame $Q(Z,x,t)\to [P(x),\Delta P(x,t)],$ 9 to any timestamp $P(x)$ 0 is queried by feeding $P(x)$ 1 into the motion decoder. Fourth, the 3D point map at time $P(x)$ 2 is formed as

$P(x)$ 3

The paper emphasizes that this allows dense or sparse queries at arbitrary spatio-temporal locations without re-running the full transformer (Luo et al., 10 Feb 2026).

This inference profile is central to the framework’s identity. Rather than treating 4D reconstruction as repeated pairwise estimation, 4RC amortizes video understanding into a single latent and exposes geometry and motion through conditional decoding.

5. Empirical evaluation across reconstruction tasks

The reported experiments cover dense 3D tracking, sparse point tracking, camera pose estimation, and video depth. All results are reported after global Sim(3) or scale alignment as required. Across these tasks, 4RC is reported to outperform prior and concurrent methods across a wide range of 4D reconstruction tasks, and to set a new state-of-the-art among feed-forward 4D reconstruction methods while matching or approaching the best performance of static 3D methods on geometry and pose (Luo et al., 10 Feb 2026).

Task	Datasets	Selected reported results
Dense 3D tracking	Kubric, Waymo	Kubric: 85.44 APD, 1.022 EPE vs V-DPM 71.12 APD, 2.849 EPE; Waymo: 56.63 APD, 1.611 EPE vs 41.44 APD, 1.948 EPE
Sparse point tracking	PO, DR, ADT, PStudio	PO: 85.86 APD vs 83.36 by V-DPM, EPE 0.250 vs 0.196; ADT: 87.82 APD vs 80.80, EPE 0.148 vs 0.236
Camera pose and reconstruction	TUM-Dynamics, ScanNet, 7-Scenes, NRGBD	TUM-Dynamics: ATE 0.010 m, RPE $P(x)$ 4 = 0.008; 7-Scenes: Acc = 0.034 m vs 0.044 by Pi3; NRGBD: NC = 0.912
Video depth	Bonn, Sintel	Bonn: Rel = 0.051, $P(x)$ 5; Sintel: Rel = 0.311, $P(x)$ 6

The benchmarking protocol uses task-appropriate metrics: APD and EPE for dense and sparse tracking; ATE, RPE $P(x)$ 7, RPE $P(x)$ 8, Accuracy, Completeness, and Normal Consistency for pose and reconstruction; and Rel error and $P(x)$ 9 for video depth. The breadth of the evaluation is notable because it treats 4RC not as a single-task estimator but as a general 4D reconstruction representation with multiple downstream query modalities.

6. Scope, misconceptions, and terminological ambiguity

Within the monocular-video literature, 4RC denotes “4D Reconstruction via Conditional Querying Anytime and Anywhere” (Luo et al., 10 Feb 2026). A common misunderstanding would be to view it as a specialized tracker or scene-flow estimator. The paper directly rejects that framing by contrasting 4RC with methods that decouple motion from geometry or produce limited 4D attributes such as sparse trajectories or two-view scene flow (Luo et al., 10 Feb 2026).

The acronym also appears in adjacent but distinct literatures. In the automotive-perception report “RCGDet3D: Rethinking 4D Radar-Camera Fusion-based 3D Object Detection with Enhanced Radar Feature Encoding,” the phrase “prior 4RC methods” refers to 4D radar-camera fusion methods for 3D object detection rather than monocular-video 4D reconstruction (Xiong et al., 20 May 2026). Similar alphanumeric strings are used for unrelated topics, including the RC4 fault-detection literature on FPGA implementations (Paul et al., 2014) and the $\Delta P(x,t)$ 0 exact-repair regenerating-code rate-region problem (Tian, 2013).

This suggests that disambiguation is necessary when “4RC” is encountered in arXiv-indexed research. In current usage, however, the standalone title-form “4RC” is specifically attached to the monocular-video framework that combines a video-level transformer latent with conditional geometry and motion querying (Luo et al., 10 Feb 2026).