POMA-3D: The Point Map Way to 3D Scene Understanding (2511.16567v2)

Published 20 Nov 2025 in cs.CV

Abstract: In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits diverse tasks, including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Overall, our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning. Project Page: https://matchlab-imperial.github.io/poma3d/

Summary

The paper introduces a novel self-supervised framework leveraging point maps and dual learning objectives for robust 3D scene understanding.
It details the construction of the large-scale ScenePoint dataset and employs geometric and vision-language alignment to enhance representation learning.
POMA-3D outperforms previous methods in tasks like 3D QA, embodied navigation, and scene retrieval, showcasing effective multi-view fusion.

POMA-3D: The Point Map Way to 3D Scene Understanding

Introduction

The paper "POMA-3D: The Point Map Way to 3D Scene Understanding" (2511.16567) proposes a novel framework for self-supervised 3D representation learning based on point maps, which encode explicit 3D coordinates on a structured 2D grid. Unlike point clouds and depth maps traditionally used in 3D scene modeling, point maps are tailored to preserve global 3D geometry and are inherently compatible with 2D foundational vision models. Leveraging this modality, the authors introduce POMA-3D—a self-supervised pretraining approach with two objectives: cross-modal alignment to transfer 2D priors and geometric consistency enforcement among multi-view representations.

ScenePoint Dataset Construction

A significant contribution of this work is the design of ScenePoint, a large-scale dataset enabling POMA-3D pretraining. ScenePoint comprises 6.5K room-level annotated RGB-D indoor scenes and 1M single-view image scenes. For RGB-D data, multi-view point maps are generated via depth reprojection, densely sampling views that maximize the spatial coverage of the corresponding 3D scenes. The single-view scenes are synthesized from existing image-caption datasets, with depth and pose inferred by the VGGT model. Leveraging vision-language pretraining techniques, view-level and scene-level captions are produced using LLM pipelines, refining annotation via alignment with FG-CLIP (frozen CLIP backbone with extended token capacity).

POMA-3D Pretraining Objectives

POMA-3D pretraining employs two complementary objectives:

View-to-scene vision-language alignment: Structured as a contrastive learning problem, this objective aligns [CLS] embeddings from point map encoders with image and caption embeddings. The approach exploits FG-CLIP as a backbone and utilizes LoRA for adaptation. Both view-level and scene-level contrastive losses are optimized, enforcing the mutual alignment of multi-modal scene representations.
POMA-JEPA (Joint Embedding Predictive Architecture): POMA-JEPA enforces geometric consistency of point map features under multi-view masking. By using Chamfer distance rather than MSE, the method accommodates the permutation invariance of 3D point sets, which is critical for dense, order-agnostic feature prediction across merged point clouds.

The pretraining protocol is divided into two stages: a warm-up phase using 2D image-derived point maps for initial alignment and a main phase using curated room-level scenes for joint optimization of both objectives.

Figure 1: Overview of the POMA-3D pretraining workflow, illustrating objective alignment and geometric consistency enforcement.

Downstream Task Performance

Empirical evaluation covers a broad spectrum of 3D scene understanding problems: question answering (QA), embodied navigation, scene retrieval, and embodied localization, all under zero-shot conditions and using purely geometric (uncolored) inputs. POMA-3D demonstrates strong numerical results, consistently outperforming prior state-of-the-art VLL models and LLM-based generalists, even those leveraging richer modalities such as colored point clouds and images. On established QA datasets (ScanQA, SQA3D, Hypo3D), POMA-3D achieves superior EM@1 and EM@10 scores, with notable improvements over SceneVerse and 3D-VisTA. Likewise, in navigational tasks, the method excels especially in higher-directional granularities.

In scene retrieval, POMA-3D surpasses prior methods on all tested metrics, highlighting the quality of semantic alignment between textual queries and geometric representations. Qualitative results indicate that only POMA-3D consistently retrieves ground-truth scenes in complex semantic queries (bookshelf detection, footstool counts, etc.).

Figure 2: Qualitative scene retrieval results confirming that POMA-3D yields correct semantic matches in scenarios where competing baselines fail.

POMA-3D further enables embodied localization, accurately identifying agent locations given textual situational descriptions, demonstrating strong spatial and situational reasoning capabilities without explicit color or segmentation cues.

Figure 3: Qualitative embodied localization via text-to-point map retrieval, with highlighted agent activity regions.

Ablation and Comparative Analysis

Extensive ablation studies confirm the necessity of each pretraining component. The integration of image-derived point map warmup, room-level scene alignment, and POMA-JEPA all yield measurable gains, with the geometric predictive objective being most impactful for tasks involving hypothetical and situated reasoning. Evaluations also reveal that point maps yield stronger representation transfer than depth maps or other geometric modalities, particularly when leveraging 2D priors for initialization and alignment.

Furthermore, POMA-3D achieves monotonic improvements with increased view counts—unlike FG-CLIP, whose performance saturates—demonstrating effective multi-view fusion and redundancy utilization.

Figure 4: Comparative analysis of POMA-3D and FG-CLIP across varying view counts, highlighting robust scaling properties.

Theoretical and Practical Implications

POMA-3D marks a shift from unstructured point cloud modalities to structured point map-based learning, allowing seamless transfer of semantic and geometric priors from large-scale 2D models. The demonstrated capable transfer learning and zero-shot generalization indicate that point maps could become a preferred modality for unified 2D-3D learning. From a practical perspective, POMA-3D reduces the annotation bottleneck in 3D understanding, enabling scalable and self-supervised learning from abundant 2D corpora. The approach is highly modular, permitting fusion with future advancements in color-based and multi-modal pretraining. Open questions remain regarding effective integration with color features and end-to-end pretraining of large reasoning models for even broader downstream generalization.

(Figure 5)

Figure 5: Qualitative analysis of view retrieval and agent localization accuracy under diverse situational descriptions.

Conclusion

POMA-3D advocates point map-based self-supervised 3D learning as a scalable paradigm for holistic 3D scene understanding. Through large-scale pretraining on ScenePoint and rigorous alignment objectives, POMA-3D establishes state-of-the-art performance across reasoning and retrieval tasks that span specialist and generalist 3D models. The work substantiates point maps as a geometric modality that maximizes transfer from 2D visual-LLMs and facilitates robust representation learning in data-scarce 3D domains. Future extensions should explore end-to-end backbone adaptation within large 3D multimodal models and fusion strategies for color-geometric data.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Explaining “POMA-3D: The Point Map Way to 3D Scene Understanding”

Overview: What is this paper about?

This paper introduces POMA-3D, a new way for computers to understand 3D spaces (like rooms or buildings). Instead of using regular 3D data formats, it uses “point maps,” which are like special images where every pixel carries its exact position in the real world. The big idea is to teach a 3D model using knowledge from powerful 2D models (that understand pictures and text) so it can handle many 3D tasks—even without color—just using geometry.

Goals: What questions did the researchers ask?

The researchers wanted to know:

Can we train a strong 3D model using “point maps” that are easier to connect with 2D image models?
Can we transfer the rich knowledge from 2D image–text models (like CLIP) to 3D understanding?
Will this help with different 3D tasks, such as answering questions about a room, navigating, finding the right scene from a description, or locating where an agent is based on a text hint?

Methods: How did they do it?

Think of their approach as teaching the model in two steps, with smart data and training tricks.

What is a “point map”?

Imagine a photo of a room. Normally, each pixel just shows color. In a point map, each pixel also knows its real 3D position (x, y, z) in the world.
So you get the best of both worlds: a 2D grid (like an image) with accurate 3D geometry. This makes it much easier to align with 2D models while keeping full 3D structure.

What is “self-supervised” learning?

It’s learning without needing a teacher to label everything. The model learns patterns from the data itself, like figuring out how different views of the same room relate to each other.

Training data: ScenePoint

They built a new dataset called ScenePoint:
- About 6,500 real indoor rooms (from RGB-D videos—color plus depth).
- 1,000,000 single images from an image–caption dataset, converted into point maps using a 3D reconstruction model (VGGT).
- Each view has captions, and scenes also have longer descriptions. This lets the model connect 3D geometry with text.

Connecting 3D with 2D: Vision–language alignment

The model learns that a point map (3D), an image (2D), and a caption (text) all describe the same thing.
It uses a powerful 2D foundation model (FG-CLIP, like CLIP) to “align” the 3D point map features with image and text features. Think of it as teaching the 3D model the meanings understood by 2D models.

Making different views agree: POMA-JEPA

Rooms are seen from multiple views (like different camera angles). The model must keep its understanding consistent across these views.
POMA-JEPA is a training step where the model predicts the features of hidden patches (masked parts) by looking at visible ones from other views.
Important detail: 3D points don’t have a fixed order like pixels in an image. To avoid confusion, they compare sets of features in a way that doesn’t care about exact ordering (using a “Chamfer distance,” which is like measuring how close two clouds of points are overall).

Two-stage training

Warm-up stage: Train on single-view point maps from images to learn basic 3D–2D alignment.
Main stage: Train on multi-view room data to learn strong scene-level understanding and view consistency.

Results: What did they find and why it matters?

POMA-3D turned out to be a strong backbone (starting point) for many 3D tasks, often beating previous methods—even when it only used 3D coordinates (no colors).

3D Question Answering: The model did especially well on tasks that require understanding spatial relationships and common sense in 3D rooms.
Embodied Navigation: Given a room and an instruction, it chose good directions to move, and performed best or near-best in both simpler (4 directions) and harder (8 directions) settings.
Scene Retrieval: Given a description (like “a room with three footstools and bookshelves”), it found the correct 3D scene much more reliably than other methods.
Embodied Localization: From a simple text like “I am standing between two blackboards,” it retrieved the views that highlight the agent’s likely location in the room.

Why this is important:

It shows that knowledge learned from huge 2D datasets (images and captions) can be reused to help 3D models—solving the problem that 3D data is limited and expensive to collect.
Point maps act as a “bridge” between 2D and 3D, making alignment easier and more effective.
The model works well without color—just from geometry—which is handy in cases where color is missing or unreliable.

Implications: What could this lead to?

This research points toward scalable, general-purpose 3D models that:

Help robots and AR systems understand and navigate real spaces more reliably (e.g., find the kitchen from a description, or locate a bed from “I’m sitting on the bed”).
Make it easier to build 3D “foundation models” by reusing existing large 2D datasets, saving time and resources.
Encourage future work to combine 3D coordinates with color and texture for even richer understanding.
Support both specialist tools (for a single job) and generalist systems (that can do many 3D tasks) with the same powerful 3D features.

In short, POMA-3D shows a practical and effective “point map” path to 3D scene understanding, helping computers learn from the vast knowledge already available in 2D image–text models.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, aimed to guide future research.

Limited modality scope (geometry-only): POMA-3D uses only point-map coordinates (no color/texture). Quantify how much performance is lost/gained by adding appearance features; design and evaluate principled fusion (e.g., late fusion with RGB CLIP embeddings, joint geometric–photometric encoders, colorized point maps) across color-dependent tasks such as ScanQA.
Reliance on 2D foundation models without scaling analysis: The point-map encoder is initialized from FG-CLIP ViT-B/16 and LoRA-tuned. Evaluate scaling laws (ViT-L/14, ViT-H/14), training from scratch versus 2D-initialization, and whether stronger 2D backbones materially change 3D performance or alignment stability.
POMA-JEPA loss design not ablated: JEPA uses Chamfer Distance over patch embeddings to avoid order constraints. Test alternative permutation-invariant objectives (e.g., Earth Mover’s Distance/optimal transport, contrastive-set matching, Sinkhorn) and report failure modes (e.g., mode collapse) under different losses, predictor depths, and mask ratios.
Patch-level versus [CLS]-level alignment: Current view/scene alignment uses [CLS] embeddings; it is unclear if local semantics are captured. Explore fine-grained patch-level or region-level alignment (e.g., cross-attention to text tokens), and measure improvements on fine-grained tasks (grounding, spatial relationships).
Multi-view aggregation is simplistic: Scene embeddings are mean-pooled across views. Compare attention pooling, transformer-based cross-view fusion, or geometry-aware pooling (e.g., pose-weighted, visibility-aware) to improve holistic scene representations.
Embodied localization lacks quantitative evaluation: Only qualitative results are shown. Establish a benchmark with ground-truth agent positions/regions and report metrics (IoU, Top-K recall/precision, localization error) to quantify performance.
Generalization to outdoor, large-scale, and dynamic scenes is unknown: ScenePoint focuses on indoor RGB-D; assess transfer to LiDAR, outdoor datasets (e.g., KITTI, nuScenes), large spaces (malls, airports), and scenes with dynamic objects. Investigate gravity alignment or canonical axes for cross-scene comparability.
Sensitivity to camera calibration and depth noise: Room point maps depend on accurate intrinsics/extrinsics; single-view point maps use VGGT predictions. Quantify robustness to pose/depth errors (synthetic perturbations), scale ambiguities, and occlusions; propose pose/noise-robust training (e.g., uncertainty-aware losses).
Warm-up with pseudo point maps from 2D images is under-characterized: Single-view point maps have unconstrained scales/orientations. Measure their metric fidelity, how mis-specified frames affect downstream alignment, and whether normalization (e.g., unit-sphere scaling, gravity alignment) improves warm-up transfer.
Caption quality and bias unchecked: View-level captions are LLM-generated and filtered by FG-CLIP similarity; scene-level captions are sourced externally. Audit caption correctness, bias, and diversity; evaluate how caption quality affects alignment and downstream performance (e.g., via controlled noise or adversarial perturbations).
Dataset coverage remains modest for 3D scenes: Only ~6.5K room-level scenes (though supplemented with 1M images). Explore scaling with more real scenes, richer sensor modalities (LiDAR, stereo), and diverse layouts; report data–performance scaling curves and sample-efficiency.
No direct evaluation on standard 3D visual grounding: Due to point cloud vs point-map misalignment, VG is repurposed to scene retrieval. Develop an evaluation protocol that bridges point-map grids to point-cloud annotations (e.g., reprojecting masks, voxel alignment, learned map–cloud registration) to test object-level grounding.
Cross-lingual and long-text handling not studied: FG-CLIP is chosen for longer text capacity, but multilingual alignment and very long scene narratives are not evaluated. Test multilingual datasets (e.g., Jina-CLIP-v2), code-switching, and long-instruction robustness.
Hard negative mining and curriculum are absent: Negatives are drawn within-batch under maximum coverage sampling. Investigate hard-negative mining (cross-scene, near-duplicate scenes), curriculum strategies, and their effect on retrieval/localization robustness.
View selection policy effects are unknown: 32-view maximum coverage sampling is used. Analyze the impact of view count, selection strategy (random vs coverage vs diversity), and redundancy; optimize selection for compute–accuracy trade-offs.
Compute efficiency and memory footprint unreported: Point maps scale with image resolution (H×W×3), potentially heavy for multi-view training. Report throughput, memory usage, latency, and trade-offs across resolutions; explore compression (e.g., sparse grids, learned downsampling).
Coordinate-frame consistency across scenes: Multi-view point maps share a world frame per scene, but cross-scene frames are arbitrary. Study normalization (e.g., canonical orientation via gravity/floor plane, scale alignment) to improve cross-scene retrieval and zero-shot transfer.
Limited integration with generalist 3D LLMs: Due to resource constraints, only lightweight LoRA is used for one epoch. Evaluate deeper instruction tuning, full-model finetuning, and using POMA-3D as primary 3D backbone in scratch-trained 3D LLMs; report sample efficiency and stability.
Comparison breadth across 3D modalities: While depth maps and point clouds are compared, analysis remains limited. Include stronger baselines (fused point clouds, meshes, voxels, NeRFs, 3DGS with geometry-aware encoders) under identical training recipes to isolate the benefits of point maps.
Robustness to occlusion and visibility across views: POMA-JEPA masks per-view patches and reconstructs across views, but does not explicitly model visibility. Add visibility-aware masking, occlusion handling, and test performance under heavy occlusions or limited overlap.
Explainability and geometry–semantics disentanglement: It is unclear which geometric cues drive predictions. Provide attribution (e.g., token/patch saliency maps, text-to-geometry attention visualizations) and measure whether embeddings encode metric properties (distances, angles) versus coarse layout.
Failure-case analysis is minimal: Document cases where POMA-3D underperforms (e.g., color-dependent queries, cluttered scenes, repetitive layouts), and derive targeted remedies (appearance fusion, hard negatives, local alignment).
Hyperparameter sensitivity is underreported: No ablations for predictor depth, mask aspect ratios, temperature τ, LoRA ranks, or EMA decay. Provide systematic sensitivity analyses and guidelines for stable training.
Integration with SR-3D and related point-map approaches not compared: Directly benchmark against SR-3D’s positional-encoding strategy and hybrid encoders to quantify the gains of dedicated point-map pretraining.
Safety, fairness, and privacy considerations: Indoor datasets may encode socioeconomic or cultural biases. Audit for bias, ensure privacy-preserving data handling, and evaluate fairness across room types and object distributions.

View Paper Prompt View All Prompts

Practical Applications

Practical Applications Derived from POMA-3D

POMA-3D introduces a self-supervised 3D representation learned from point maps and aligned with 2D foundation models via view-to-scene vision–language alignment and the POMA-JEPA objective. It demonstrates transfer to 3D QA, embodied navigation, scene retrieval, and embodied localization, using only geometric inputs. Below are actionable applications, categorized by deployment horizon, with sector tags, workflows/products that could emerge, and feasibility notes.

Immediate Applications

3D scene search and retrieval across scan repositories
- Sector: software, AEC (architecture/engineering/construction), facilities, real estate
- What: Build a “3D search engine” to query large libraries of building scans (ScanNet/ARKitScenes-like) using natural language (e.g., “conference room with two whiteboards and six chairs”).
- Tools/products/workflow: PointMap Indexer service to embed point-map scenes; text-to-scene retrieval API; UI for scene ranking with preview.
- Dependencies/assumptions: Access to room-level scans or RGB-D videos; ScenePoint-like curation; reliable view-to-scene embeddings; domain LoRA fine-tuning improves precision.
Text-to-region embodied localization for mobile robots and AR
- Sector: robotics, AR/VR, smart home
- What: Localize an agent’s current area from a situational text (“I’m standing between two blackboards”), highlight top-K views, and plan movements towards the inferred region.
- Tools/products/workflow: Embodied Localization API; robot planner consumes top-K localized views and computes navigation waypoints; AR overlay that highlights regions satisfying a query.
- Dependencies/assumptions: Multi-view point maps covering the agent’s surroundings; consistent world-frame coordinates; simple path planners; may need object cues beyond geometry in some cases.
Directional decision-making for navigation tasks from language
- Sector: robotics, logistics, healthcare operations
- What: Use POMA-3D’s improved 4-/8-directional navigation (MSNN benchmark) to decide movement based on text instructions (“go towards the bed”), all on geometry-only inputs.
- Tools/products/workflow: Lightweight navigation module plugged into mobile robots; instruction-to-direction classifier; integration with SLAM stacks to maintain global frame.
- Dependencies/assumptions: Geometry coverage of the environment; downstream fine-tuning on domain-specific layouts; possible benefit from limited color fusion (signage recognition).
3D question answering assistants for indoor environments
- Sector: education, retail, facilities management, robotics
- What: Query scenes for spatial, situated, and hypothetical reasoning (e.g., “Is there enough space to place a wheelchair next to the bed?”) using point-map features, outperforming prior 3D VLL baselines on SQA3D/Hypo3D.
- Tools/products/workflow: 3D QA engine backed by POMA-3D; LoRA-tuned LLM front-end; domain prompt templates; deployment for floor plan assessments and staff training.
- Dependencies/assumptions: Accurate camera calibration or reliable depth/pose prediction; geometric-only inputs may miss color-dependent cues (e.g., signage colors).
Rapid dataset bootstrapping from image corpora
- Sector: academia, simulation, software
- What: Convert existing image-caption datasets to single-view point maps using VGGT (depth/pose prediction) for pretraining and task transfer in low-data 3D regimes.
- Tools/products/workflow: ScenePoint-style converter CLI; batching pipeline to produce point maps at scale; metadata consistency checker for camera parameters.
- Dependencies/assumptions: Quality of VGGT predictions; domain gap between pseudo-depth and real RGB-D; verification protocols for noisy geometry.
Scene compliance pre-screening (geometry-focused)
- Sector: facilities, policy, safety/EHS
- What: Basic checks using geometric cues (clearances, doorway widths, obstacle-free paths) from point maps to flag potential compliance issues (ADA/OSHA geometries).
- Tools/products/workflow: Geometry rules engine; text-to-checklists QA (“Is the corridor ≥ 36 inches wide?”); batch scanning of building areas.
- Dependencies/assumptions: Reliable metric scale in point maps; limited without semantics (materials, colors); human-in-the-loop verification advised.
Academic benchmarking and backbone replacement
- Sector: academia
- What: Use POMA-3D as a drop-in 3D backbone for studies on scene-level QA, grounding via retrieval, and embodied tasks; compare modalities (point maps vs depth maps vs point clouds).
- Tools/products/workflow: Open-sourced pretrained weights; training scripts for warm-up and main stages; ablation templates; Chamfer-based POMA-JEPA implementation.
- Dependencies/assumptions: Compute availability (A100-class recommended); access to FG-CLIP and LLaVA-OV for alignment baselines; consistent coordinate conventions.
Privacy-preserving scene analytics
- Sector: policy, compliance, software
- What: Analytics using coordinate-only point maps to avoid storing raw RGB imagery, reducing PII exposure; apply scene retrieval and QA without faces/textures.
- Tools/products/workflow: Privacy mode pipeline that drops color channels; audit logs documenting coordinate-only storage; legal guidance integration.
- Dependencies/assumptions: Some tasks still need appearance cues (labels/signage); policy acceptance requires risk analyses and DPIA documentation.

Long-Term Applications

Generalist 3D foundation models spanning perception, language, and action
- Sector: robotics, software, education
- What: Train full-scale 3D LLMs from scratch using point-map encoders for broader task unification (QA, planning, manipulation, multi-agent coordination).
- Tools/products/workflow: Multi-task instruction tuning across 3D QA, navigation, grounding, manipulation; large-scale ScenePoint-like corpora; end-to-end policy heads.
- Dependencies/assumptions: Significant compute; robust datasets beyond indoor rooms; broader sensor diversity; safety evaluation protocols.
Geometry+appearance fusion for holistic 3D understanding
- Sector: AR/VR, retail, safety/EHS
- What: Combine point-map geometry with color/texture features to improve tasks that depend on appearance (signage, material identification).
- Tools/products/workflow: Dual-stream encoder with late fusion; contrastive training aligning fused embeddings with text; adapters for domain-specific cues (e.g., fire safety symbols).
- Dependencies/assumptions: New datasets with synchronized RGB and geometry; careful calibration; deployment trade-offs for privacy and performance.
Standardization of point-map formats and APIs
- Sector: policy, interoperability, AEC
- What: Industry standards governing point-map grids, world-frame conventions, and metadata for camera intrinsics/extrinsics to ensure cross-tool compatibility.
- Tools/products/workflow: Open specifications; validation suites; schema registries; compliance certification for vendors.
- Dependencies/assumptions: Multi-stakeholder governance; alignment with existing BIM/IFC standards; backward compatibility with point clouds and Gaussian splats.
Digital twin analytics with language-driven querying
- Sector: energy, facilities, smart buildings
- What: Integrate POMA-3D embeddings into building digital twins to query space utilization, accessibility, and energy-related geometry (zoning, airflow paths).
- Tools/products/workflow: Twin-wide embedding store; QA/retrieval over live scans; coupling with HVAC/BMS data; alerting for reconfigurations.
- Dependencies/assumptions: Continuous scanning or SLAM updates; accurate metric scale; multi-sensor integration (IoT, occupancy).
Outdoor adaptation for autonomous systems
- Sector: autonomous driving, public safety
- What: Extend point-map learning to outdoor scenes for pedestrian/vehicle navigation, disaster response, and infrastructure inspection.
- Tools/products/workflow: Dataset generation with multi-view outdoor imagery; robust VGGT depth/pose under diverse lighting; domain-adaptive POMA-JEPA masks.
- Dependencies/assumptions: Handling dynamic objects, weather, scale; safety-critical validation; integration with maps (GIS).
Language-driven scene graph construction and spatial reasoning pipelines
- Sector: robotics, analytics, education
- What: Construct scene graphs (objects, relations, affordances) from point maps and align them with language for complex tasks (e.g., “place the box under the shelf facing the window”).
- Tools/products/workflow: 3D graph generators; relational QA modules; planning algorithms that leverage graph constraints.
- Dependencies/assumptions: Improved object grounding without color; possibly requires semantic segmentation or multi-modal fusion.
Compliance automation and policy auditing at scale
- Sector: policy, EHS, healthcare
- What: Automated checking against building codes and hospital policies using geometric thresholds and language rules; continuous compliance monitoring.
- Tools/products/workflow: Policy-to-geometry compilers; dashboards showing pass/fail regions; audit trail generation.
- Dependencies/assumptions: Formalization of standards into machine-readable rules; validated measurement accuracy; legal acceptance.
Real-time, on-device POMA-3D inference for mobile XR
- Sector: consumer software, hardware
- What: Optimize and compress point-map encoders for smartphones/AR headsets to enable on-device text-to-region search and navigation prompts.
- Tools/products/workflow: Distilled encoders; quantization; on-device CLIP-aligned text encoders; efficient multi-view aggregation.
- Dependencies/assumptions: Hardware constraints; battery/performance trade-offs; privacy requirements; user experience considerations.
Warehouse/retail robotics with natural-language tasking
- Sector: logistics, retail
- What: Language-driven item pick-and-place and aisle navigation in complex indoor layouts based on geometry-first representations that generalize across stores/warehouses.
- Tools/products/workflow: Retrieval-guided path planning; shelf geometry reasoning; multilingual instruction support.
- Dependencies/assumptions: Object identity often requires appearance; fusion with barcode/RFID systems improves reliability.
Cultural heritage and museum curation
- Sector: culture, education
- What: Index and retrieve exhibit spaces and artifacts in 3D with language queries for digital tours and curator workflows.
- Tools/products/workflow: Embedding catalogs per gallery; curator QA tools; visitor AR experiences with text-to-region guides.
- Dependencies/assumptions: High-fidelity scanning; fusion for material/color; rights management and preservation policies.

Notes on feasibility across applications:

POMA-3D is strongest for geometry-driven tasks; color-dependent reasoning (e.g., signage interpretation) requires fusion with appearance features.
Reliable multi-view coverage and world-coordinate consistency are critical; single-view conversions via VGGT introduce noise and benefit from domain adaptation.
CLIP-aligned text/image priors are foundational; licensing and model access (FG-CLIP, LLaVA-OV, VGGT) affect deployment.
Privacy gains from coordinate-only inputs are balanced by potential loss of semantic detail; enterprise policies should explicitly cover data retention and DPIA.

View Paper Prompt View All Prompts

Glossary

3D Gaussian splatting: A representation/rendering technique that models a scene with Gaussian kernels in 3D to enable efficient view synthesis and alignment. "The key reason is that these methods primarily utilize point clouds~\cite{jia2024sceneverse,zhu20233d}, depth maps~\cite{mao2024opendlign,huang2023clip2point}, or 3D Gaussian splatting~\cite{li2025scenesplat,thai2025splattalk} for alignment"
AdamW: An optimizer that decouples weight decay from gradient updates to improve training stability and generalization. "We adopt the AdamW~\cite{loshchilov2017decoupled} optimizer ( $\beta_1$ = 0.9, $\beta_2$ = 0.98) with a learning rate of $1\times10^{-4}$ , a warm-up of 500 steps, and cosine decay scheduling."
Canonical world coordinates: A fixed global coordinate frame that ensures multiple views share the same geometric reference. "Moreover, multi-view point maps are defined in canonical world coordinates, preserving the same global geometry as point clouds."
Chamfer Distance: A symmetric distance between two point sets used to measure reconstruction alignment without requiring point-wise correspondence. "Here, we define the POMA-JEPA loss $\mathcal{L}_{\text{pjepa}$ using the Chamfer Distance~\cite{fan2017point}, a widely used metric in masked point cloud modeling~\cite{pang2023masked} for its robustness to minor order misalignments, defined as:"
CLIP: A large-scale vision–LLM that learns aligned image–text embeddings via contrastive training. "Early 3D VLL methods align trainable 3D encoders with frozen 2D vision-LLMs (e.g., CLIP~\cite{radford2021learning})."
Contrastive objectives: Training losses that pull matched pairs together and push mismatched pairs apart in embedding space. "3D vision-language learning (VLL) provide them through contrastive objectives without relying on downstream annotations."
Context encoder: The trainable encoder that processes visible inputs to produce embeddings used for alignment and prediction. "Frozen image and language encoders are aligned with a trainable point map context encoder."
CLS token: A special token embedding used to summarize the content of a sequence or image for downstream tasks. "where $z_P^{i}$ , $z_I^{i}$ , and $z_V^{i}$ denote the [CLS] token embeddings of the point map, image, and per-view caption, respectively."
Cosine decay scheduling: A learning rate schedule that decays the rate following a cosine curve to stabilize late-stage training. "We adopt the AdamW~\cite{loshchilov2017decoupled} optimizer ( $\beta_1$ = 0.9, $\beta_2$ = 0.98) with a learning rate of $1\times10^{-4}$ , a warm-up of 500 steps, and cosine decay scheduling."
Embodied localization: Identifying the agent’s active region in a scene from language descriptions of its state. "After pretraining, POMA-3D gains the ability to accurately locate the agentâs active region from textual queries in a zero-shot setting, a task we term embodied localization (see Fig.~\ref{fig:1})."
Embodied navigation: Inferring actions or directions for an agent within a 3D environment based on multimodal context. "including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates)."
Exact Match (EM@1, EM@10): Accuracy metrics counting whether the predicted answer exactly matches the ground truth among top-k outputs. "For LLM-based models, we report exact match (EM@1) scores, and for specialist models, we report both EM@1 and EM@10 metrics."
Exponential Moving Average (EMA): A parameter update that averages model weights over time to stabilize targets. "the target encoder $E_T$ is updated as the EMA of the context encoder $E_C$ after each iteration."
Extrinsic parameters: Camera pose parameters (rotation and translation) that map camera coordinates to world coordinates. "using the intrinsic matrix $K$ and the extrinsic parameters $E_i = [R_i \,|\, t_i]$ , where $R_i$ and $t_i$ represent rotation and translation, respectively."
Feed-forward 3D reconstruction models: Networks that directly predict 3D structure from images without iterative optimization. "This is enabled by recent advances in feed-forward 3D reconstruction models~\cite{wang2024dust3r,wang2025vggt,leroy2024grounding,huang2025no}."
FG-CLIP: A CLIP variant with extended token capacity, used here as the frozen 2D backbone for alignment. "FG-CLIP~\cite{xie2025fg} is chosen as the backbone for view-to-scene visionâlanguage alignment due to its extended text token capacity, enabling effective modeling of long scene descriptions."
Intrinsic matrix: Camera parameters (focal lengths and principal point) mapping pixel coordinates to camera rays. "using the intrinsic matrix $K$ and the extrinsic parameters $E_i = [R_i \,|\, t_i]$ , where $R_i$ and $t_i$ represent rotation and translation, respectively."
JEPA (Joint Embedding-Predictive Architecture): A self-supervised framework that predicts masked/target embeddings from visible context embeddings. "Unlike traditional JEPAs~\cite{assran2023self, bardes2023v,assran2025v}, POMA-JEPA explicitly handles the order-agnostic nature of point maps in the world coordinate frame by enforcing permutation-invariant embedding prediction."
LoRA: Low-Rank Adaptation; a parameter-efficient fine-tuning method that injects trainable low-rank matrices. "A context point map encoder $E_C$ , initialized from $E_I$ and LoRA-finetuned, is used to bridge 3D and 2D modalities."
Maximum sampling coverage: A frame selection strategy that chooses views covering the largest spatial extent of a scene. "Following the maximum sampling coverage strategy of Video-3D-LLM~\cite{zheng2025video}, the selected frames capture the maximum spatial extent of each scene."
Mean pooling: Aggregating feature vectors by averaging them to form a single representation. "For scene-level alignment, point map and image embeddings $\{z_P^{i}\}_{i=1}^{N_v}$ and $\{z_I^{i}\}_{i=1}^{N_v}$ within each scene are mean-pooled into scene embeddings $\bar{z}_P$ and $\bar{z}_I$ ."
Permutation-invariant: A property where the output does not depend on the ordering of input elements. "by enforcing permutation-invariant embedding prediction."
Point cloud: A set of 3D points representing scene geometry without explicit connectivity. "Moreover, multi-view point maps are defined in canonical world coordinates, preserving the same global geometry as point clouds."
Point map: A 2D grid where each pixel stores its corresponding 3D coordinate, preserving geometry while matching image-like input. "Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models."
POMA-3D: The proposed self-supervised model that learns 3D scene representations from point maps. "In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps."
POMA-JEPA: The paper’s JEPA variant tailored to point maps, enforcing multi-view geometric consistency. "we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views."
R@M-N (Recall@N): Retrieval metric denoting the probability the correct item is in the top-N when composing queries from M texts. "The metric R@M-N denotes recall@N for retrieving the correct 3D scene from M referring texts."
Scene retrieval: Selecting the correct 3D scene from a database given a language description. "including 3D question answering, embodied navigation, scene retrieval, and embodied localization"
Visual grounding: Linking language expressions to specific objects or regions in a visual scene. "Early 3D understanding models were specialist, targeting specific 3D tasks such as instance segmentation~\cite{jiang2020pointgroup,hou20193d}, visual grounding~\cite{chen2020scanrefer,achlioptas2020referit3d}, or visual question answering (QA)~\cite{azuma2022scanqa,ma2022sqa3d}."
Vision–Language Learning (VLL): Training paradigms that align visual and textual modalities to transfer semantic priors. "3D vision-language learning (VLL) provide them through contrastive objectives without relying on downstream annotations."
ViT-B/16: A Vision Transformer variant (Base, 16-pixel patch size) used as the vision backbone. "The vision encoders are ViT-B/16 from FG-CLIP-Base~\cite{xie2025fg}."
Voxelized patch aggregation: Converting 3D data into voxel patches to aggregate features for transformer-based models. "LLaVA-3D~\cite{zhu2024llava} extends 2D visual instruction tuning to 3D via voxelized patch aggregation"
Zero-shot: Performing tasks without task-specific fine-tuning by leveraging aligned multimodal representations. "This cross-modal alignment enables 3D encoders to inherit rich knowledge from 2D counterparts, enabling zero-shot tasks such as object classification, retrieval, and detection~\cite{liu2023openshape, xue2023ulip, mao2024opendlign, zhou2023uni3d, xue2024ulip, qi2024shapellm}."

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (5)

Collections

GitHub

Tweets

This paper has been mentioned in 2 tweets and received 66 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

YouTube

Show All Videos

POMA-3D: The Point Map Way to 3D Scene Understanding (2511.16567v2)

Summary

POMA-3D: The Point Map Way to 3D Scene Understanding

Introduction

ScenePoint Dataset Construction

POMA-3D Pretraining Objectives

Downstream Task Performance

Ablation and Comparative Analysis

Theoretical and Practical Implications

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Explaining “POMA-3D: The Point Map Way to 3D Scene Understanding”

Overview: What is this paper about?

Goals: What questions did the researchers ask?

Methods: How did they do it?

What is a “point map”?

What is “self-supervised” learning?

Training data: ScenePoint

Connecting 3D with 2D: Vision–language alignment

Making different views agree: POMA-JEPA

Two-stage training

Results: What did they find and why it matters?

Implications: What could this lead to?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Practical Applications Derived from POMA-3D

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (5)

Collections

GitHub

Tweets

YouTube