RoboBrain 2.5: Depth in Sight, Time in Mind
- The paper introduces an 8B-parameter multimodal foundation model that integrates depth-aware 3D spatial reasoning with dense temporal value estimation.
- The model architecture employs a ViT-based visual encoder and Qwen3-VL transformer to fuse multi-view images with text for precise 3D keypoint prediction.
- Dense temporal estimation yields robust progress signals and collision-aware manipulation traces, enabling reliable closed-loop execution in robotics.
Searching arXiv for the cited RoboBrain papers and adjacent works to ground the article. {"query":"RoboBrain 2.5: Depth in Sight, Time in Mind arXiv (Tan et al., 20 Jan 2026)", "max_results": 5} I’m going to look up the relevant arXiv entries for RoboBrain 2.5 and closely related RoboBrain papers. RoboBrain 2.5 is an 8B-parameter multimodal VLM-based embodied foundation model for general perception, spatial reasoning, and temporal modeling, trained on 12.4M curated samples spanning general perception, spatial reasoning from 2D to metric 3D, and temporal prediction. Its defining additions over RoboBrain 2.0 are Precise 3D Spatial Reasoning, described as “Depth in Sight,” and Dense Temporal Value Estimation, described as “Time in Mind.” In place of 2D pixel-relative grounding and sparse success signals, it predicts absolute, depth-aware 3D keypoints and dense task progress or regress values over time, with the stated aim of enabling collision-aware 3D manipulation traces and stable, step-aware feedback for closed-loop execution and reinforcement learning (Tan et al., 20 Jan 2026).
1. Terminological position and lineage
The label “RoboBrain” spans several distinct research programs, and the 2026 model called RoboBrain 2.5 should be distinguished from earlier works that use the same name. The 2014 paper "RoboBrain: Large-Scale Knowledge Engine for Robots" introduced a collaborative, cloud-based, graph-structured knowledge engine for robots and explicitly does not mention a version called “RoboBrain 2.5” (Saxena et al., 2014). The 2025 manipulation model "RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete" likewise does not define a “RoboBrain 2.5”; in that work, “2.5” refers to the Qwen2.5-7B-Instruct language backbone rather than a RoboBrain version (Ji et al., 28 Feb 2025). Similarly, RoboOS describes RoboBrain-1.5-OS, trained from Qwen2.5-VL-7B, not a 2.5 release (Tan et al., 6 May 2025).
Within this naming landscape, RoboBrain 2.5 specifically denotes the 2026 embodied AI foundation model titled "RoboBrain 2.5: Depth in Sight, Time in Mind." Its scope is narrower and more execution-oriented than the 2014 knowledge engine: it upgrades spatial and temporal grounding for reliable, physically compliant manipulation, with explicit support for depth-aware coordinate prediction, metric constraint comprehension, dense progress estimation, and trace-level feasibility checking (Tan et al., 20 Jan 2026).
2. Model architecture and interfaces
RoboBrain 2.5 uses a Qwen3-VL-family vision-language transformer as its base. A visual encoder consisting of a ViT with adapters produces image tokens, and a language decoder attends over visual tokens and text to produce structured outputs. Temporal context is modeled by feeding sequences of frames, including multi-view frames, as visual tokens; the transformer then aggregates context across time and viewpoints (Tan et al., 20 Jan 2026).
Its primary inputs are monocular RGB images from one or multiple views, together with text instructions encoding task descriptions and spatial constraints. Camera intrinsics are assumed known. For value estimation, the model consumes multi-view tuples of Initial, Goal, BEFORE, and AFTER states sampled along expert or real trajectories. Optional proprioception can be appended as tokens in downstream integration, but the two new capabilities are trained primarily from visual-only evidence (Tan et al., 20 Jan 2026).
The outputs are structured and explicitly tied to spatial and temporal grounding. For manipulation, the model predicts ordered keypoint sequences
which define a manipulation trace in decoupled image-plane and depth coordinates. It also produces quantitative measures such as metric distances, clearances, object sizes, and left-to-right ordering in absolute units such as centimeters. For temporal modeling, it predicts hop values and a fused global progress signal
together with progress or regress classification or regression outputs and an auxiliary pairwise temporal ordering objective in Stage-1 (Tan et al., 20 Jan 2026).
3. Precise 3D Spatial Reasoning
The central spatial shift in RoboBrain 2.5 is from 2D pixel-relative grounding to depth-aware 3D prediction. The stated motivation is that pixel-relative outputs cannot ensure physical feasibility, particularly under occlusion, viewpoint shifts, or absolute geometric constraints such as clearances and collision avoidance. RoboBrain 2.5 therefore predicts decoupled coordinates and understands absolute metric constraints, so that predicted traces can be checked for physical realizability rather than interpreted only as image-plane hints (Tan et al., 20 Jan 2026).
The representation is explicitly tied to camera geometry. Given camera intrinsics and a 3D point in the camera frame , projection is defined by
Back-projection from image coordinates and depth is given by
A rigid transform then maps the result into world or robot coordinates,
with homogeneous coordinates used when applying rigid transforms (Tan et al., 20 Jan 2026).
The model is trained and evaluated against explicit metric constraints. Examples include Euclidean distance constraints
and planar constraints such as hovering 0–1 cm above a surface with normal 2,
3
Pose, kinematic, and collision feasibility are checked post-prediction. Collision cost along a trace 4 is defined as the sum of signed distances to obstacle meshes, with feasibility requiring distances greater than 5, and kinematic reachability is assessed by inverse kinematics convergence for each waypoint together with joint-limit constraints (Tan et al., 20 Jan 2026).
Manipulation is formulated as prediction of ordered 3D keypoints under language guidance. The paper’s example instruction, “water flowers left-to-right hovering 1–5 cm above each,” is decomposed into 3D Spatial Referring, which resolves left-right order and localizes objects, and 3D Spatial Measuring, which estimates absolute heights and clearances. Physical consistency is then assessed through valid start conditions, defined by grasp proximity to the target point cloud, valid end conditions, defined by being inside or near the destination 3D bounding box, and collision-free path validity. These notions are operationalized in the TraceSpatial benchmark through the measures “3D Start,” “3D End,” and overall “Success.” The supervised objective includes coordinate regression,
6
together with smoothness and metric-consistency terms when such labels are available (Tan et al., 20 Jan 2026).
4. Dense Temporal Value Estimation
The temporal component of RoboBrain 2.5 frames value estimation as progress prediction from visual observations, with explicit robustness to viewpoint changes. Training trajectories are segmented into keyframes 7 and densely sampled. Adaptive sampling within segments is defined by
8
producing states 9 with global progress
0
This construction yields a dense temporal supervision signal rather than a single terminal success label (Tan et al., 20 Jan 2026).
Hop-based relative progress normalization converts pairwise BEFORE and AFTER states into bounded progress or regress targets. For states 1 and 2, the hop function is
3
This normalizes supervision into 4 with respect to remaining forward distance or already covered backward distance. The paper states that reconstruction of 5 by iteratively applying hops remains strictly within 6 (Tan et al., 20 Jan 2026).
RoboBrain 2.5 combines three temporal perspectives. Incremental prediction is locally precise but susceptible to drift:
7
with
8
Anchored estimates are defined relative to the initial and goal states:
9
The fused estimate is then
0
To avoid OOD reward hacking, the model uses bi-directional consistency checks,
1
and applies the weight
2
in a conservative online update (Tan et al., 20 Jan 2026).
The learning objective uses discretized hop bins with balanced temporal distances per bin and explicit zero-hop samples when 3, typically optimized with cross-entropy over hop bins,
4
Stage-1 temporal pretraining also includes pairwise temporal ordering. The stated significance of this design is that dense feedback stabilizes downstream reinforcement learning, fused progress is drift-resistant, and multi-view conditioning promotes view-invariant embeddings that are more robust under occlusion and viewpoint shifts (Tan et al., 20 Jan 2026).
5. Data, supervision, and training regimen
The training corpus totals approximately 12.4M samples. General multimodal data contributes about 2.83M samples from Honey-Data-1M and LLaVA-OneVision 1.5-Instruct-Data after filtering, deduplication, and packing. Spatial reasoning data spans multiple subsets, including Visual Grounding, Object Pointing, Affordance, Spatial Understanding, and Spatial Referring. A new 3D Spatial Reasoning set contributes 1.74M samples and 8.08M QA pairs, built from CA-1M and ScanNet scans for metric-grounded reasoning and occupancy maps, plus cleaned and decomposed real or simulated manipulation videos from AgiBot-World, DROID, and RoboTwin 2.0 (Tan et al., 20 Jan 2026).
Dense Value Estimation is trained from a corpus of approximately 35M value samples derived from about 27M raw frames and down-sampled to about 3.5M for training. The sources are distributed across real robots at roughly 60%, including AgiBot-World, DROID, and RoboBrain-X; simulation at roughly 13%, including LIBERO, RoboCasa, and RoboTwin; and human egocentric data at roughly 26% from EgoDex. Multi-view setups and hop-based labels are applied consistently across these sources (Tan et al., 20 Jan 2026).
Training proceeds in two stages. Stage-1, Foundational Spatiotemporal Learning, uses 8.3M samples and a next-token prediction objective over mixed tasks to preserve general perception, 2D grounding, qualitative 3D understanding, planning, and coarse temporal logic through temporal ordering. Stage-2, Specific Spatiotemporal Enhancement, uses approximately 4.1M samples and adds metric 3D tracing, including prediction of 5 sequences and distances in centimeters, together with dense hop prediction as classification or regression over frame pairs. Anti-forgetting is handled with 15% replay of Stage-1 data. The training infrastructure uses hybrid parallelism, dynamic memory pre-allocation, and validated cross-accelerator training on NVIDIA and Moore-Threads hardware with matched convergence (Tan et al., 20 Jan 2026).
6. Evaluation and practical robotics use
Evaluation spans 2D spatial reasoning, 3D spatial reasoning, dense temporal value estimation, and downstream deployment suitability. On 2D spatial reasoning, RoboBrain 2.5 reports an average score of 75.82 on CV-Bench, CrossPoint, RoboSpatial, RefSpatial, and EmbSpatial. On 3D spatial reasoning, it reports 64.17 on MSMU for NVIDIA and 61.66 for Moore-Threads, 78.31 on Q-Spatial for Moore-Threads, and on TraceSpatial it reports 3D Start 83 for NVIDIA, 3D End 65 for Moore-Threads, and Success 44 for NVIDIA or 36 for Moore-Threads. Error-based evaluations include VABench-V RMSE of 0.1189 for Moore-Threads and 0.1281 for NVIDIA, and ShareRobot-T RMSE of 0.1164 for NVIDIA and 0.1171 for Moore-Threads, improving over RoboBrain 2.0 at 0.1240. For dense temporal value estimation, the paper reports near-ceiling bidirectional consistency on LIBERO and RoboCasa at approximately 99%, forward and reverse VOC of 93.67 and 89.26 on DROID for Moore-Threads, 94.58 and 94.54 on Galaxea for Moore-Threads, and 80.67 and 81.12 on EgoDex for Moore-Threads, indicating transfer beyond robot morphology (Tan et al., 20 Jan 2026).
The practical robotics pipeline is explicitly specified. First, one provides the text instruction and current RGB frames, optionally multi-view, together with camera intrinsics and extrinsics. Second, the model predicts the 3D trace 6, which is back-projected to the camera frame and transformed to robot or world coordinates. Third, the trace is validated by start or end constraints and collision checks, with optional spline smoothing. Fourth, the resulting waypoints are passed to a motion planner or IK stack, including grasp and release phases. Fifth, during execution, streamed frames are sent to the value head together with Initial and Goal anchors to obtain hop values and 7 for monitoring, recovery, and RL reward shaping (Tan et al., 20 Jan 2026).
The required sensing assumptions are comparatively modest. Monocular RGB cameras with known intrinsics suffice because depth is predicted; multi-view cameras are recommended for robustness, and camera-to-base calibration through extrinsics is required. The model runs on standard GPUs, supports batched multi-view inference, and the 8B model is described as suitable for several-Hz closed-loop feedback with typical inference accelerators, although exact latency depends on hardware and sequence length (Tan et al., 20 Jan 2026).
7. Limitations, downstream adaptations, and prospective extensions
RoboBrain 2.5 assumes accurate intrinsics and extrinsics, static or slowly changing scenes, and consistent lighting. Severe occlusion or fast dynamics can degrade both trace predictions and value signals. The paper also identifies domain gaps for rare objects or unconventional viewpoints, where hop inconsistency may appear through low 8, and lists specific failure modes including tight clearances with heavy specular surfaces, non-rigid manipulation where depth-only keypoints under-specify shape-state change, and unexpected camera drift. Planned extensions include unified understanding and generation through world-modeling with image or video prediction, deployment to mobile manipulation and humanoids, a scalable model family with edge-optimized and “instruction vs. thinking” variants, and a self-evolving data engine that uses the value estimator to curate training data (Tan et al., 20 Jan 2026).
A subsequent use of RoboBrain 2.5 appears in "Technical Report of RoboSpatial Challenge at CVPR 2026: Selective Reasoning Activation and Reference-Frame Disambiguation for Embodied Spatial Reasoning," where RoboBrain2.5-8B-NV serves as the base model for RoboSpatialBrain. That system adds two inference-time mechanisms: a forced “> ” prefix with task-specific post-prompts for context and compatibility tasks, and an explicit reference-frame redirection pipeline that converts object-centric directives into camera-centric ones through object extraction, facing-direction estimation, and dual querying. In the RoboSpatial-Home benchmark, RoboSpatialBrain achieved first place with an official average success rate of 80.9%, with per-task success rates of 88.6 for Configuration, 83.8 for Compatibility, and 70.5 for Context. The same report also documents that compatibility-only LoRA fine-tuning on 24,000 EmbodiedScan-derived examples can help when no reasoning activation is used, but degrades the best prompted configuration, which the authors attribute to catastrophic forgetting of broader reasoning and instruction-following abilities (Xie et al., 30 Jun 2026).
Taken together, RoboBrain 2.5 occupies a specific position in embodied AI research: it is a foundation model centered on metric 3D grounding and dense execution-time value estimation, rather than a cloud knowledge graph or a generic long-horizon manipulation MLLM. Its technical signature lies in replacing pixel-relative outputs with depth-aware 3D traces and replacing sparse success signals with bounded, multi-view-consistent progress estimates, thereby aligning perception, spatial constraint satisfaction, and closed-loop execution within a single multimodal framework (Tan et al., 20 Jan 2026).