Papers
Topics
Authors
Recent
Search
2000 character limit reached

Point Bridge: 3D Representations for Cross Domain Policy Learning

Published 22 Jan 2026 in cs.RO | (2601.16212v1)

Abstract: Robot foundation models are beginning to deliver on the promise of generalist robotic agents, yet progress remains constrained by the scarcity of large-scale real-world manipulation datasets. Simulation and synthetic data generation offer a scalable alternative, but their usefulness is limited by the visual domain gap between simulation and reality. In this work, we present Point Bridge, a framework that leverages unified, domain-agnostic point-based representations to unlock synthetic datasets for zero-shot sim-to-real policy transfer, without explicit visual or object-level alignment. Point Bridge combines automated point-based representation extraction via Vision-LLMs (VLMs), transformer-based policy learning, and efficient inference-time pipelines to train capable real-world manipulation agents using only synthetic data. With additional co-training on small sets of real demonstrations, Point Bridge further improves performance, substantially outperforming prior vision-based sim-and-real co-training methods. It achieves up to 44% gains in zero-shot sim-to-real transfer and up to 66% with limited real data across both single-task and multitask settings. Videos of the robot are best viewed at: https://pointbridge3d.github.io/

Summary

  • The paper establishes a novel framework for zero-shot sim-to-real policy transfer using domain-agnostic 3D keypoint representations that reduce the domain gap.
  • It employs transformer-based policy learning over unified point-based representations, achieving up to 44% improvement on manipulation tasks.
  • The framework supports efficient co-training with minimal real data, offering significant gains in sample efficiency and robustness to distractors.

Point Bridge: Domain-Agnostic 3D Keypoint Representations for Cross-Domain Policy Learning

Motivation and Problem Statement

A central bottleneck for generalist robot policy learning is the limited availability and high collection cost of large-scale real-world robotic manipulation datasets. In contrast to vision and language domains, scalable Internet-scale data is difficult to obtain for robotics, where embodied physical interactions are required. Simulation and synthetic data generation, leveraged by advances in generative models and high-fidelity simulators, represent a promising alternative. However, the utility of simulation data for real-world deployment is severely constrained by the domain gap—primarily visual, geometric, and sensing differences between simulated and physical environments.

To overcome these limitations, "Point Bridge: 3D Representations for Cross Domain Policy Learning" (2601.16212) presents a framework for sim-to-real policy transfer based on unified, domain-agnostic point-based representations. This architecture achieves robust zero-shot sim-to-real transfer without explicit visual or object-level alignment, and further enables effective co-training with small amounts of real data as well as multitask learning in a language-conditioned setting.

Point Bridge Framework

Point Bridge consists of three main stages: (1) automated extraction of compact task-relevant 3D keypoints from both simulation and real-world observations, (2) transformer-based policy learning over unified point-based representations, and (3) an efficient, flexible real-time perception pipeline for deployment. Figure 1

Figure 1: The Point Bridge perception pipeline uses state-of-the-art vision-LLMs and segmentation methods to extract task-relevant 3D keypoints from real images and task descriptions.

Unified Point-Based Scene Representation

Point Bridge replaces image-based or dense point-cloud representations with a compact set of 3D keypoints representing the robot and task-relevant objects, all expressed in a common reference frame. In simulation, keypoints are sampled from object meshes, projecting to camera views and simulating real sensor noise. In real deployments, keypoints are extracted by:

  1. Object Identification: A VLM (Gemini-2.5) analyzes the scene image and natural language task description to identify relevant object categories.
  2. Object Localization: Molmo-7B localizes objects at the pixel level, initializing masks for semantic object segmentation (SAM-2) to generate robust 2D object masks.
  3. 3D Projection: Uniformly sampled interior mask points are projected to 3D using FoundationStereo’s depth estimation and camera calibration. Points are subsampled to maintain coverage and computational efficiency.
  4. End-Effector Representation: Gripper pose is encoded as a set of rigidly offset keypoints, paralleling object keypoint procedures.

This unified abstraction minimizes sim-to-real alignment requirements by enforcing domain-invariant object and robot representations.

Scalable Data Generation and Co-Training

Original human-provided teleoperated demonstrations in simulation are expanded via synthetic data generation tools (MimicGen). Segments are adapted to novel scenes by SE(3) transformations that preserve end-effector geometries relative to scene objects. This approach maximizes policy generalization from limited source data. For enhanced sim-to-real transfer, small sets of real-world demonstrations can be incorporated for joint training—unifying sim and real representations via the keypoint pipeline.

Transformer-Based Policy Learning

Policies are learned using a decoder-only multi-task transformer, architecturally following BAKU, with point embeddings derived from PointNet encoders (combining robot and object keypoints). For multitask learning, language instructions are embedded (MiniLM). Action prediction targets include the end-effector pose and gripper state, and smoothness regularization is achieved through action chunking and temporal averaging.

Experimental Validation

Point Bridge is evaluated on six real-world manipulation tasks, featuring substantial object and environment domain shifts. The experiments leverage both large-scale synthetic simulation data (augmented from demonstrations) and real-world teleoperated examples. Figure 2

Figure 2: Real robot rollouts visualizing successful execution of six diverse physical manipulation tasks with Point Bridge policies.

Sim-to-Real Transfer and Robustness

Point Bridge enables zero-shot sim-to-real policy transfer with up to 44% improvement over strongest vision-based baselines in both single-task and multitask experiments. Notably, domain-invariant point-based abstraction outperforms approaches reliant on careful visual alignment or high-fidelity simulators.

  • Zero-shot transfer is robust to large visual, object, and background discrepancies between sim and reality.
  • Point Bridge generalizes across previously unseen object instances due to the abstraction of geometry over pixel-level features.

Adding small amounts of real data through co-training results in success rate gains of up to 66% in multitask settings compared to vision-based sim-and-real co-training methods, demonstrating superior sample efficiency.

Analysis of Critical Pipeline Design Choices

Depth estimation accuracy is a key determinant of success; FoundationStereo-based 3D lifting dramatically outperforms RGB-D sensors and triangulation alternatives, particularly under challenging visibility or reflectivity conditions. Camera view alignment between simulated and real data is a performance bottleneck, though diversity in simulated camera viewpoints alleviates this.

Clutter and Distractor Robustness

Point Bridge’s VLM-guided scene filtering pipeline is robust to background distractors, in contrast to point cloud methods that lack semantic filtering. Figure 3

Figure 3: Example scenes with significant background clutter and distractors in the real-robot setup, illustrating the necessity for robust scene filtering.

Failure Cases and Limitations

Failure modes are largely attributable to VLM errors in object identification or segmentation, particularly under severe occlusions or ambiguous contexts. As VLM technology improves, these issues are expected to diminish. Figure 4

Figure 4: Representative failure cases of the VLM-guided scene filtering pipeline, showing missed or incorrectly segmented objects.

Other practical limitations include (1) dependence on accurate scene-camera calibration for reference frame consistency, (2) reduced control frequency relative to image-based policies due to perception pipeline overhead, and (3) the potential loss of spatial context through overly sparse abstractions.

Theoretical and Practical Implications

The Point Bridge framework demonstrates that domain-agnostic 3D point-based representations, automated via foundation vision-LLMs, are sufficient for scalable sim-to-real and cross-domain generalization in robotic policy learning. This removes the need for feature-level domain adaptation or manual object annotation at scale. Furthermore, the unified pipeline supports multitask learning and is amenable to augmentation with real or internet-scale synthetic data. The extension of this paradigm to articulated or deformable object manipulation is corroborated by high task success rates beyond rigid-body scenarios.

The results indicate that as VLMs and large-scale synthetic scene generators improve, the domain gap will increasingly be addressed at the representation level, suggesting that future cross-domain robotic policy models can exploit “Internet-scale” synthetic and video datasets without object-level domain engineering. A promising direction for further study is the integration of hybrid, context-aware representations that provide both task-relevant abstraction and essential environmental spatial cues.

Conclusion

Point Bridge establishes a new paradigm for scalable cross-domain policy learning through domain-agnostic 3D representations and automated keypoint extraction, unlocking the vast potential of synthetic simulation data for highly sample-efficient, generalizable robotic manipulation. The architecture delivers robust zero-shot sim-to-real transfer, supports effective multitask and co-training with minimal real data, and is resilient to clutter and distractors by virtue of a strong VLM-guided filtering pipeline. Limitations regarding VLM failure modes, camera calibration dependence, and abstraction sparsity frame directions for continued research. The practical implication is a significant reduction in real-data collection costs and an expansion of scalable robotic generalization capabilities, accelerating progress toward practical, generalist embodied AI.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 61 likes about this paper.