Any4D: Unified 4D Modeling

Updated 17 December 2025

Any4D is a modeling framework that enables feed-forward, metric-scale 4D perception, reconstruction, and detection using diverse inputs like images, video, and sensor data.
It integrates open-prompt 4D synthesis and single-image 4D generation, leveraging transformers and diffusion models to produce temporally coherent dynamic scenes.
The framework uses modular representations and training objectives to achieve improved speed, accuracy, and temporal stability in applications such as robotics, AR/VR, and autonomous driving.

Any4D refers to a set of modeling paradigms and system architectures for general-purpose, feed-forward, metric-scale 4D (space-time) perception, reconstruction, generation, and detection. The unifying vision behind these approaches is to enable direct prediction or synthesis of temporally coherent 3D geometries and dynamics from diverse inputs—images, video, sensors, language prompts—without task-specific hand-tuning or multi-stage optimization. Current Any4D research encompasses dense metric 4D reconstruction, open-prompt 4D synthesis, 4D detection in video, annotation automation, avatar creation, and deformable object generation.

1. Unified Metric 4D Reconstruction

Any4D (Karhade et al., 11 Dec 2025) is a scalable, multi-view transformer framework for feed-forward, metric 4D reconstruction. Its input is a set of $N$ temporally synchronized frames from monocular or multi-sensor streams (RGB, RGB-D, Radar Doppler, IMUs). The system produces dense, per-pixel predictions for 3D point coordinates and scene flow in a shared world reference frame, with metric scale. Unlike prior methods limited to 2-view scene flow or sparse point tracking, Any4D achieves temporally coherent, dense, and metrically accurate point-maps and 4D scene motions, supporting a wide spectrum of sensors.

The outputs are factorized into egocentric (camera-local) and allocentric (scene-global) factors:

Egocentric factors: ray directions and normalized depths per view in local camera coordinates
Allocentric factors: camera extrinsics (translation/quaternion) and scene flow in world coordinates

Metric geometry is reconstructed as:

$\tilde{G}_i(u,v) = \tilde{s} \,[\tilde{T}_i]\,(\tilde{R}_i(u,v)\tilde{D}_i(u,v)) \in \mathbb{R}^3$

where $\tilde{s}$ is a learned scale, $\tilde{T}_i$ the pose, $\tilde{R}_i$ the direction, and $\tilde{D}_i$ the depth. The corresponding scene flow is $\tilde{s}\tilde{F}_i(u,v)$ .

The core is a vision transformer with modality-agnostic encoders (RGB, depth, auxiliary sensors), alternating-attention across all frames, and prediction heads for geometry and motion. The training procedure leverages mixed datasets (with partial/complete modalities) and dynamically reweights dynamic/static regions. Any4D attains 2–3× lower error and up to 15× higher speed than best baselines on tasks such as sparse 3D tracking, dense scene flow, and video depth estimation, supporting applications in robotics, autonomous driving, and AR/VR (Karhade et al., 11 Dec 2025).

2. Open-Prompt 4D Generation

Any4D-generation architectures enable text/image-conditioned, camera-controlled, feed-forward 4D synthesis (Li et al., 24 Nov 2025). Given a natural language prompt plus an optional reference image (or single image alone), along with a user-specified camera trajectory, systems generate multi-frame videos with explicit viewpoint control, then reconstruct full dynamic 3D geometry (4D scene):

Stage 1: Viewpoint-Conditional Video Generation
- Text/image cues are encoded into latents via a pretrained VAE.
- Camera trajectories (intrinsics/extrinsics) are encoded using framewise Plücker embeddings, forming per-frame, multi-scale camera features.
- A video diffusion model (e.g., CogVideoX) conditions on text/image latents and camera features to synthesize temporally coherent videos under tight viewpoint control.
Stage 2: 4D Scene Lifting via 3D Gaussian Splatting
- Generated video frames are analyzed by a depth estimator and 2D keypoint tracker.
- Depth, keypoints, and RGB are fused into a persistent, temporally-evolving 3D Gaussian representation. Each Gaussian's rigid motion is modeled as a combination of fixed and trainable SE(3) motion bases, learning per-Gaussian, per-frame coefficients.
- All parameters are optimized by minimizing RGB and depth reprojection errors; no adversarial or KL losses are used.
- The final 4D model allows for high-fidelity, 20fps rendering of arbitrary novel views and non-captured time instants from the generated trajectory.

This approach demonstrates flexible, open-prompt 4D generation without multi-view capture or depth sensors, achieving best-in-class PSNR (16.55 dB) and SSIM (0.61) on monocular dynamic scenes. Limitations include reduced performance under extreme viewpoint extrapolation or severe non-rigid deformation (Li et al., 24 Nov 2025).

3. Feed-Forward 4D Generation from Single Images

Feed-forward architectures such as 4DNeX enable efficient, single-image-to-4D dynamic scene generation (Chen et al., 18 Aug 2025). 4DNeX constructs a large labeled dataset (4DNeX-10M) and defines a unified 6D video representation jointly modeling appearance (RGB) and geometry (XYZ per pixel across time). Crucial features include:

Dataset construction: Static and dynamic videos (from public and synthetic sources) are pseudo-annotated for geometry (DUSt3R, MonST3R, MegaSaM pipelines) and filtered for quality.
6D representation: For each frame, $(X_t^{\mathrm{RGB}}, X_t^{\mathrm{XYZ}})$ pairs are produced, with XYZ being global point coordinates.
Adaptation of video diffusion models: RGB and XYZ channels are fused width-wise in the latent space to maximally leverage pretrained priors. LoRA adapters tune only a small percentage of parameters.
End-to-end pipeline: The system requires only a single RGB input, from which it generates smooth dynamic point clouds and view-consistent video, with optional post-optimization for camera/depth.

Quantitative results show substantial improvements in dynamic realism (Dynamic Degree 58.0% vs 47.4% for Free4D), inference time (15 min vs 60 min), and competitive consistency scores (Chen et al., 18 Aug 2025).

4. Temporally Consistent 4D Object Detection

DetAny4D addresses streaming 4D object detection: assigning globally consistent 3D bounding boxes to objects in sequential RGB video (Hou et al., 24 Nov 2025). The system is built around DA4D, a 280,000+ sequence dataset with 3D boxes, depth, camera pose, and object prompts for diverse indoor/outdoor scenes. DetAny4D fuses features from foundational vision transformers, geometry priors from UniDepth-V2, and prompt encodings for open-vocabulary detection.

Key architectural innovations:

Geometry-aware spatiotemporal decoder: Stacks causal attention blocks to enforce temporal unidirectionality, enabling each query to aggregate evidence up to the current time.
Multi-task heads: Jointly estimate depth, camera intrinsics, 3D box parameters, and pose. A multi-term loss aggregates spatial 2D/3D IoUs, center/size/orientation predictions, and a bespoke spatial+temporal Chamfer loss.
Object-query memory: Allows for seamless tracking of new and retiring object instances over arbitrary-length sequences.
No explicit smoothing: Temporal stability is enforced purely via causal attention and regularization, sharply reducing prediction jitter.

DetAny4D attains higher per-frame AP3D and 10–30% lower temporal variance in box positions, compared to both framewise and multi-stage baselines. The system generalizes to open-set object categories, demonstrating robust performance on both seen and unseen classes (Hou et al., 24 Nov 2025).

5. Specialized Any4D Applications

Several distinct lines of research demonstrate the extensibility of the Any4D paradigm:

Automatic 4D Annotation: Auto4D decomposes LiDAR-based 4D annotation into sequential object size and motion path estimation for rigid objects (Yang et al., 2021). This approach leverages aggregation over full point trajectories and achieves a 25% reduction in human annotation effort at high-precision IoU thresholds by sequentially refining geometry and smoothing 4D motion paths.
Deformable 4D Generative Modeling: Dictionary-based Neural Fields (DNF) utilize SVD-based dictionaries of neural field atoms to disentangle shape and motion factors in deformable 4D sequences (Zhang et al., 6 Dec 2024). Transformer diffusion models in shared latent and coefficient space enable unconstrained synthesis of high-fidelity, temporally coherent deformations, with results exceeding previous 4D generative baselines on MMD and coverage.
Open-Domain 4D Avatarization: AvatarArtist employs parametric triplanes (Next3D), StyleGAN-based 4D GANs, and diffusion priors to support cross-domain, 4D avatar synthesis from single portrait images (Liu et al., 25 Mar 2025), achieving state-of-the-art measures on temporal consistency and perceptual identity.

6. Training Objectives, Losses, and Representational Strategies

Any4D frameworks exploit modular representations (explicit geometric factors, scene flow, dictionary-atoms, triplanes) and modular architectures (transformers with patch/instance tokens, conditional diffusion, NEAT fusion of primal features) to facilitate multi-task learning and fusion of disparate supervision signals. Losses are composed based on available data:

Explicit $\ell_1$ or $\ell_2$ losses for depth, scene flow, and geometry
Chamfer losses, 2D/3D IoU, and corner alignment for object detection
Coefficient/weight-space regularization and orthogonality in SVD-based networks
Hybrid physical and perceptual losses (LPIPS, FID) for generative and avatarization applications

A common theme is strong architectural and training modularity, making it feasible to mix-pair data with varying sensor modalities or annotation completeness and maintain robust, generalizable 4D predictions.

7. Limitations, Extensions, and Future Directions

Current limitations across Any4D variants include challenges with extreme viewpoint extrapolation, severe occlusions, rapid non-rigid deformations, and reliance on upstream video/model capacities for texture and semantics. Datasets remain partially synthetic or pseudo-annotated, and some models are monocular-only. Future research directions identified include:

Incorporating explicit dynamical models (e.g., kinematic Kalman-style motion priors),
End-to-end joint training of generative and reconstruction modules,
Integration of richer, non-rigid or semantically-parameterized motion bases,
Extension to multi-agent, scene-wide, or open-vocabulary segmentation/annotation.

Overall, Any4D provides a foundation for scalable, unified, and interpretable modeling and understanding of 4D environments across sensing, generative, detection, and annotation domains, with applications to robotics, autonomous driving, XR, digital twins, and digital content synthesis (Karhade et al., 11 Dec 2025, Li et al., 24 Nov 2025, Zhang et al., 6 Dec 2024, Chen et al., 18 Aug 2025, Yang et al., 2021, Hou et al., 24 Nov 2025, Liu et al., 25 Mar 2025).