SurGe: Improved Surface Geometry in Point Maps
Abstract: Recent feedforward 3D reconstruction methods predict point maps and estimate global 3D geometry remarkably well. However, their predictions still exhibit inaccurate local surface geometry, which is clearly visible qualitatively but only weakly reflected in common metrics. To make these errors more explicit in evaluation, we introduce a point map normal metric that evaluates the local surface orientation induced by neighboring 3D predictions. To reduce these errors, we propose two complementary components: a point gradient matching loss that supervises depth-normalized 3D finite differences, and a Neighborhood Attention Decoder (NAD) that progressively upsamples features and uses Neighborhood Attention for local feature mixing. Across eight zero-shot monocular geometry benchmarks, our model, SurGe, achieves the best average rank for global point map AbsRel and consistently improves local point map and point map normal evaluations.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about
This paper is about teaching computers to understand the 3D shape of a scene from just one picture. Recent AI models already do a good job at getting the big, overall 3D layout right. But they often mess up small details, like thin poles, chair legs, or edges, which end up looking wavy, bumpy, or bent in 3D. The authors introduce SurGe, a new method that focuses on making these local details cleaner and more accurate.
The main questions the paper asks
- How can we measure the quality of small, local 3D shapes (like thin parts and sharp edges) in a way that clearly shows when a model gets them wrong?
- How can we train a model so that neighboring pixels in an image form smooth, consistent 3D surfaces instead of wavy or noisy ones?
- Can we design a better decoder (the part of the model that turns features into per-pixel 3D predictions) that captures fine details without creating new artifacts?
How they approached the problem (in simple terms)
Think of turning a photo into a 3D scene like building a sculpture based on a picture. Each pixel gets a 3D point, forming a “point map.” If nearby points don’t line up nicely, the surface looks wrinkly.
To fix this, the paper does three things:
- A better test for local surfaces: They add a new metric that looks at the direction each tiny surface patch faces. Imagine placing a tiny arrow on the surface at each point that shows which way it’s pointing. This “normal” tells you local shape. If the arrows flip around wildly, the surface is bad, even if the points are close. This metric makes local mistakes easy to see and measure.
- A better training rule to keep neighbors consistent: They introduce a “point gradient matching loss.” Instead of only checking if each predicted point is in the right place, they also check how neighboring points change relative to each other. They normalize these small differences by depth so the rule doesn’t care about the scene’s overall size (this is called scale-invariant). In everyday terms: rather than judging each height alone, they compare the slope between neighbors—and do it in a way that isn’t fooled if the whole object is closer or farther.
- A better decoder that focuses locally: They design a Neighborhood Attention Decoder (NAD). Attention is a method where the model “looks” more at the parts that matter. But full attention over an entire image is expensive and can cause blocky artifacts. Neighborhood Attention only looks around each spot’s nearby area, which is faster and better for local details. The decoder processes the image at multiple scales, getting progressively finer, so thin structures and edges are preserved without creating checkerboard or patchy artifacts.
A few helpful analogies:
- Normals: Little arrows sticking out of the surface; if they twist randomly, the surface is messy.
- Gradient matching: Compare how two neighboring points go up or down, not just their absolute heights.
- Neighborhood Attention: Like focusing on nearby puzzle pieces to fit them together accurately, instead of scanning the whole table every time.
What they found and why it matters
The authors tested SurGe on eight standard benchmarks without any extra training on those datasets (this is called zero-shot). They measured three kinds of quality:
- Global accuracy of the whole 3D scene.
- Local point accuracy inside objects (after aligning each object).
- Local surface direction accuracy using their new “point map normal” metric.
Key results:
- SurGe gave the best average ranking on global 3D accuracy compared to strong recent methods.
- SurGe clearly improved local details: it got the best scores on both local point accuracy and the new normal-based metric across multiple datasets.
- Visually, SurGe produced cleaner thin structures (like street signs, poles, and edges) and smoother, more stable surfaces.
Why it matters:
- Many real-world uses (robotics, AR/VR, mapping, 3D content creation) need both good overall structure and clean local detail. If edges wobble or thin parts bend, interactions and measurements can fail. SurGe improves exactly these weak spots.
What this could change going forward
- Better evaluation: The new normal-based metric makes it easier for researchers to see and fix local surface problems. This could become a standard way to assess 3D quality.
- Better training recipes: The point gradient matching loss shows that teaching models using neighbor-to-neighbor comparisons, in a scale-invariant way, leads to smoother, more realistic surfaces. Others can adopt this idea to different 3D tasks.
- Better architectures: The Neighborhood Attention Decoder proves you can get the benefits of attention for fine details without the heavy cost or patchy artifacts. Future models for depth, surface normals, or even image restoration may use similar neighborhood attention blocks.
In short, SurGe doesn’t just make the big 3D picture look right; it also makes the small details precise and clean. That combination is crucial if we want reliable 3D from a single image in everyday applications.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
The following list highlights what remains missing, uncertain, or unexplored in the paper and can guide future research:
Evaluation metrics and analysis
- Quantify how well the proposed point map normal metric correlates with human judgments of local surface quality across diverse scenes.
- Analyze sensitivity of the normal metric to ground-truth sparsity, noise, and resampling (e.g., LiDAR, SfM), and provide variants robust to sparse or noisy labels.
- Study the metric’s dependence on stencil choice (four-neighborhood cross products) and evaluate alternative normal estimators (diagonal/8-neighborhood, multi-scale, robust estimators).
- Investigate whether global rotational misalignments (if any) bias the normal metric, and whether a rotation-invariant evaluation variant is needed.
- Provide ablations demonstrating metric stability across image resolutions and aspect ratios.
Loss design and supervision
- Evaluate alternative normalizations in the point gradient matching (PGM) loss (e.g., average depth, geometric mean, per-pair learned scale, epsilon-clipped normalization) and their stability near very small z.
- Extend PGM beyond 4-neighborhood forward differences to include diagonal neighbors, larger stencils, or multi-scale gradients, and assess rotational and scale robustness.
- Compare PGM directly to normal/edge-angle losses and higher-order (curvature/second-derivative) surface losses under identical training protocols.
- Explore combining PGM with explicit normal-based losses or learned confidence-weighted gradient matching to handle outliers and label noise.
- Validate the occlusion-boundary masking strategy: quantify how masking impacts boundary fidelity and whether learnable or geometry-aware boundary handling performs better without relying on dataset-specific edge conventions.
- Perform a systematic study of loss-weight sensitivity (including the fixed weight 10 for PGM) across datasets and label qualities.
Architectural choices and efficiency
- Provide inference-time compute, memory, and latency benchmarks for Neighborhood Attention Decoder (NAD) versus ConvStack/DPT/ViT decoders across resolutions; assess real-time feasibility.
- Ablate Neighborhood Attention hyperparameters (window size, number of heads, head dimension), dilation, and window scheduling across stages to understand accuracy–efficiency trade-offs.
- Compare NAD to alternative attention paradigms (e.g., Swin-style shifted windows, deformable attention, dynamic convolution/modulated kernels) for local mixing.
- Study the impact of removing LayerNorm on training stability and generalization across backbones, precisions (e.g., FP16/bfloat16), and hardware.
- Analyze the reliance on the encoder for long-range context: does NAD’s local attention limit very long-range interactions, and can occasional global or cross-scale attention improve consistency?
- Test scalability of NAD with larger/smaller backbones and different token budgets; provide scaling laws for performance vs. compute.
Generalization and robustness
- Assess robustness to image degradations (noise, blur, compression, exposure changes, rolling shutter) and to challenging material properties (transparency, specularity) that can destabilize local geometry.
- Evaluate performance under diverse camera intrinsics and sensor models; characterize sensitivity to intrinsics miscalibration and propose intrinsic-aware conditioning if needed.
- Provide analysis on hard categories (e.g., extremely thin structures against highly textured/far backgrounds, textureless regions, repetitive patterns), including failure cases and targeted remedies.
- Explore uncertainty estimation for local geometry (per-pixel confidence) and its use in loss weighting, post-processing, and downstream tasks.
Scope and applicability
- Investigate applicability to multi-view architectures: can NAD and PGM transfer to many-view feedforward models without introducing patch-aligned or window-aligned artifacts?
- Examine whether the normal metric can guide test-time refinement or self-supervised adaptation, and whether PGM can be used in semi-/self-supervised or sparse-label regimes.
- Address absolute scale recovery: the work focuses on affine-invariant evaluation; methods to predict or calibrate metric scale (without external cues) remain unexplored.
- Analyze domain and data-mix effects: quantify the contribution of each dataset category, the role of synthetic vs. real data, and identify optimal curricula for local-surface learning.
- Consider joint prediction of auxiliary geometry (e.g., explicit normals, curvature, occlusion boundaries) and their synergies with point maps under the proposed training scheme.
Reproducibility and standardization
- Standardize and release evaluation code/checklists for the normal metric (including handling of masks and boundary cases) to facilitate broader adoption and comparability.
- Provide detailed reporting on training stability (seed variance), hyperparameter robustness, and ablations that isolate the individual contributions of NAD and PGM at full training scale.
Practical Applications
Immediate Applications
Below are concrete ways SurGe’s findings and components can be applied today across sectors. Each item names the sector(s), the use case, likely tools/workflows, and key assumptions/dependencies.
- Sector: Robotics, Industrial Automation
- Use case: Safer navigation and manipulation around thin structures (cables, rods, fences), better grasp planning on slender objects, improved obstacle avoidance in cluttered scenes from a single onboard camera.
- Tools/workflows: Drop-in point-map module in perception stacks; SurGe as a prior for visual odometry/SLAM; local surface normals to filter spurious points and enforce grasp pose constraints.
- Assumptions/dependencies: Monocular scale ambiguity must be resolved (IMU/wheel odometry/stereo/known object sizes); GPU availability on robot; domain adaptation may be needed for factory/surgical lighting.
- Sector: AR/VR, Mobile
- Use case: Higher-fidelity occlusion and placement in AR (correct handling of chair legs, wires, handles), improved collision proxies in AR physics, single-image room captures for quick previews.
- Tools/workflows: Integrate SurGe as a depth/point-map provider in ARKit/ARCore pipelines; generate mesh proxies from point maps; use point-normal metric in QA to gate AR releases.
- Assumptions/dependencies: For metric scale, rely on device sensors (ToF/LiDAR, IMU); latency budgets may require model distillation or reduced-token variants.
- Sector: VFX, Games, Creative Tools
- Use case: Single-frame geometry extraction for relighting, defocus, fog, and volumetric effects; cleaner meshes from thin geometry for asset integration; improved matte edges via normals.
- Tools/workflows: DCC plugins (Blender, Unreal, Nuke) that convert SurGe point maps to watertight meshes; normal-aware smoothing to retain sharp edges without ripples; QA with the normal-based metric.
- Assumptions/dependencies: Studio compute or workstation GPU; scene scale derived from production metadata or SfM alignment.
- Sector: E-commerce, 3D Digitization
- Use case: Rapid 3D previews for products with thin parts (jewelry, eyewear, furniture frames) from sparse images; background separation leveraging consistent local surfaces.
- Tools/workflows: Point-to-mesh pipelines with normal-guided meshing; automated quality checks with point-map normal error; SurGe-initialized MVS to reduce capture time.
- Assumptions/dependencies: Multi-view capture still preferred for metric scale and completeness; domain tuning for studio lighting/backgrounds.
- Sector: Construction, Real Estate, Interior Design
- Use case: Quick scene measurement proxies and spatial planning from a single photo (e.g., measuring fixtures, detecting protrusions), cleaner floorplan cues from thin edges (moldings, rails).
- Tools/workflows: Mobile app that predicts point maps, extracts planes and edge segments via normals; mesh export to CAD/BIM.
- Assumptions/dependencies: Metric scale from known references (door height) or AR frameworks; accuracy bounds communicated to users.
- Sector: Mapping/Photogrammetry/NeRF and 3D Gaussian Splatting pipelines
- Use case: Faster and more stable multi-view reconstruction by seeding with SurGe point maps; fewer artifacts on thin structures and repetitive textures.
- Tools/workflows: Use SurGe outputs as priors for MVS, NeRF, or 3DGS initialization; normal-based metric to pick frames with reliable geometry for pose-graph optimization.
- Assumptions/dependencies: Consistent intrinsics/extrinsics; pipeline integration handles monocular scale alignment across views.
- Sector: Research and ML Engineering (Academia/Industry)
- Use case: Better training signals for point-map models via the point gradient matching loss; stronger local-geometry evaluation with the proposed normal metric; NAD as a drop-in decoder for dense geometry tasks.
- Tools/workflows: Replace or augment edge/normal losses with depth-normalized finite-difference supervision; add point-map normal metric to validation dashboards; adopt NAD blocks to reduce patch artifacts without full attention cost.
- Assumptions/dependencies: Access to training data with sufficient label density/quality; reproduction of training schedules; licensing of pretrained backbones (e.g., DINOv2).
- Sector: Standards, Safety, Policy
- Use case: Procurement and benchmark criteria that include local-surface quality (thin-structure fidelity) beyond global errors; regression tests for AD/robotics perception stacks.
- Tools/workflows: Incorporate the point-map normal mean angular error into evaluation suites; require local-instance metrics in RFPs and model cards.
- Assumptions/dependencies: Community and vendor agreement to report additional metrics; availability of instance masks or dense GT for evaluation.
Long-Term Applications
These opportunities become feasible with further research, scaling, or system integration.
- Sector: Autonomous Driving, Advanced Robotics
- Use case: Robust perception of road furniture (signposts, poles, wires), debris detection, and fine manipulation (cable routing, textile handling) using monocular cameras.
- Tools/workflows: Sensor fusion to resolve scale (LiDAR/radar/IMU), real-time SurGe variants on automotive-grade SoCs, continual learning with normal-based local-surface QA.
- Assumptions/dependencies: Real-time performance on edge; safety validation under adverse weather/night; domain-shift robustness.
- Sector: Consumer AR Glasses and Edge Devices
- Use case: On-device, persistent AR with high-fidelity occlusion and physics from glance-level captures; live scene editing.
- Tools/workflows: Model compression (quantization/pruning) of NAD, knowledge distillation; mixed-precision kernels and neighborhood-attention accelerators.
- Assumptions/dependencies: Hardware support for efficient neighborhood attention; energy constraints; privacy-compliant on-device inference.
- Sector: Cultural Heritage, Inspection, Metrology
- Use case: Accurate reconstruction of fine ornamentation, crack detection, and defect metrology from limited imagery.
- Tools/workflows: Domain-adapted SurGe models; normal-guided meshing with uncertainty estimates; integration into inspection robots/drones.
- Assumptions/dependencies: Calibrated capture protocols; certified accuracy and traceability to standards.
- Sector: Medical Imaging and Endoscopy (Research Transfer)
- Use case: Monocular 3D reconstruction in endoscopy (vessel and tool geometry), with improved fidelity on thin structures.
- Tools/workflows: Adapt NAD and pairwise depth-normalized supervision to medical domain; synthetic-to-real transfer with curated datasets.
- Assumptions/dependencies: Regulatory approval, domain-specific training data, handling of specularities/fluids.
- Sector: Generative 3D and Video
- Use case: Geometry-consistent conditioning for 3D generative models and text-to-4D; normal-based losses for training generative models with stronger local surface priors.
- Tools/workflows: Use SurGe point maps as geometry supervision for diffusion/autoregressive 3D pipelines; enforce normal-consistency constraints across frames.
- Assumptions/dependencies: Scalable training on multimodal datasets; temporal consistency modules; license compatibility for pretrained encoders.
- Sector: Geospatial/Remote Sensing
- Use case: Monocular 3D inference for aerial/satellite images to aid DSM/DTM refinement, especially for man-made thin structures (powerlines, poles).
- Tools/workflows: Domain adaptation with aerial datasets; fusion with stereo/ LiDAR; QA with normal metric for small-structure completeness.
- Assumptions/dependencies: Different imaging geometries and scales; atmospheric effects; rigorous georeferencing.
- Sector: Benchmarking and Regulation
- Use case: New standard suites emphasizing local-surface fidelity for safety-critical perception; certification processes that penalize surface ripples/blockiness.
- Tools/workflows: Public benchmarks reporting point-map normal errors alongside global metrics; dataset design prioritizing thin-structure annotations.
- Assumptions/dependencies: Community adoption; tooling to compute normals from diverse GT sources.
- Sector: Platforms and Tooling
- Use case: Turnkey SDKs that output point maps, normals, and meshes from single images; “thin-structure mode” in capture apps; training libraries implementing the point gradient matching loss and NAD.
- Tools/workflows: Open-source packages and commercial APIs; CAD/DCC plugins; AutoML recipes with local-surface metrics.
- Assumptions/dependencies: Sustainable maintenance; interoperability with existing photogrammetry/engine toolchains; IP/licensing for pretrained components.
Cross-cutting assumptions and dependencies
- Scale ambiguity: Monocular predictions are affine-invariant; applications requiring metric units must fuse additional cues (IMU, stereo, LiDAR, known dimensions).
- Compute and latency: The reference model uses a large ViT (DINOv2-L) and multi-stage decoder; mobile/real-time deployments require compression or smaller backbones.
- Data and domain shift: Zero-shot generalization is strong but not guaranteed; fine-tuning and targeted data curation may be necessary for specialized domains.
- Integration maturity: The point-map normal metric and the point gradient matching loss are immediately useful for model training/evaluation, but downstream pipelines may need engineering to exploit normals and improved local surfaces (e.g., meshing, uncertainty, QA gates).
- Licensing and reproducibility: Use of pretrained backbones and datasets must comply with licenses; reproducing training at full scale requires significant compute (multi-GPU).
Glossary
- AbsRel: Absolute Relative Error; a common metric in depth/geometry estimation measuring relative point error after alignment. "global affine-invariant AbsRel evaluates all valid pixels"
- AdamW: An optimization algorithm that decouples weight decay from the gradient update to improve training stability. "We optimize with AdamW~\cite{loshchilov2019adamw} for $120$K steps at total batch size $128$, using peak learning rates for the decoder and for the backbone, and a reciprocal square root schedule~\cite{zhai2022scalingvisiontransformers} with $1$K warmup steps and a cooldown."
- Affine-invariant: Property of methods/metrics that are unchanged under affine transformations (scaling, translation) of 3D coordinates. "affine-invariant point map AbsRel metrics in global and instance-wise local forms."
- Affine-invariant depth: Depth parameterization invariant to affine transformations (often in log space), enabling scale- and shift-agnostic supervision. "predict affine-invariant depth in log space instead of point maps."
- Cross-attend: An attention mechanism where one set of features attends to another set (often across resolutions or modalities). "utilizes NA to cross-attend between high-resolution image features of a small CNN and a low-resolution feature map of a vision foundation model in order to compute high-resolution features."
- Cross-view completion: A pretraining objective that learns to predict or fill in features across different camera views. "a ViT pretrained with cross-view completion~\cite{weinzaepfel2022croco_v1,weinzaepfel2023croco_v2}"
- DINOv2: A large-scale self-supervised vision foundation model used to initialize Vision Transformers. "SurGe combines a DINOv2~\cite{oquab2023dinov2} encoder with our Neighborhood Attention Decoder (NAD)."
- DPT head: A decoder architecture (from Dense Prediction Transformers) for upsampling and predicting dense maps from ViT features. "This decoder is often implemented as a DPT head~\cite{ranftl2021dpt,yang2024depth_anything_v1,yang2024depth_anything_v2,bochkovskii2024depthpro}"
- Edge angle loss: A surface-consistency loss that penalizes differences in angles between edges to encourage coherent local geometry. "point normals and edge angle losses supervise local orientation but do not constrain displacement magnitude"
- FFN (Feed-Forward Network): The position-wise MLP block inside Transformer layers that processes each token independently. "a Transformer-style~\cite{vaswani2017attention} residual block with a Neighborhood Attention layer and a pointwise FFN"
- Finite differences: Discrete approximations of derivatives computed as differences between neighboring pixels/points. "a point gradient matching loss that supervises depth-normalized 3D finite differences"
- LayerNorm (Layer Normalization): A normalization technique applied to activations to stabilize and accelerate training. "omits the usual pre-attention and pre-FFN LayerNorm~\cite{ba2016layernormalization} layers."
- LiDAR: A sensor that measures distances using laser light, providing sparse or dense depth annotations. "for LiDAR labels we keep only and ."
- Log-depth gradient matching: A loss comparing gradients of log-depth, yielding scale-invariant supervision across neighboring pixels. "the log-depth gradient matching loss originally proposed for monocular depth~\cite{li2018megadepth}"
- Monocular geometry estimation: Recovering dense 3D structure from a single image. "Monocular geometry estimation seeks to recover dense 3D scene structure from a single image."
- Neighborhood Attention: An attention variant where each token attends only to a local spatial neighborhood to reduce cost and inject locality. "Neighborhood Attention (NA)~\cite{hassani2023nat}"
- Neighborhood Attention Decoder (NAD): The proposed multiscale decoder that uses Neighborhood Attention for local feature mixing and upsampling. "a Neighborhood Attention Decoder (NAD) that progressively upsamples features and uses Neighborhood Attention for local feature mixing."
- Occlusion boundaries: Image regions where surfaces overlap in depth, often causing ambiguous or inconsistent local supervision. "omit pairs near occlusion boundaries."
- Pairwise scale invariance: A property where supervision depends on relative scales within local pixel pairs, not global scale. "while preserving its pairwise scale invariance."
- Pixel shuffle: An upsampling operation that rearranges channel data into higher spatial resolution. "a final pixel shuffle~\cite{shi2016pixelshuffle} for point predictions."
- Point gradient matching loss: The proposed loss comparing depth-normalized 3D finite differences between neighboring predicted and ground-truth points. "a point gradient matching loss that supervises depth-normalized 3D finite differences"
- Point map: A dense 3D representation assigning each pixel a 3D point in space. "predict a point map, assigning each pixel a 3D point."
- Point map normal metric: An evaluation metric that compares local surface normals induced by neighboring 3D point differences. "we introduce a point map normal metric that evaluates the local surface orientation induced by neighboring 3D predictions."
- Point map normals: Surface normals computed from local differences in the point map to characterize orientation of surfaces. "instead compares point map normals induced by neighboring point differences, and therefore reflects this degradation."
- QK normalization: A stabilization technique that normalizes queries and keys in attention to control attention logits. "we use QK~normalization~\cite{zhai2022scalingvisiontransformers,dehghani2023vit22b}."
- Reciprocal square root schedule: A learning rate schedule that decays proportionally to the reciprocal of the square root of training step. "a reciprocal square root schedule~\cite{zhai2022scalingvisiontransformers}"
- RoPE (Rotary Positional Embeddings): A method for injecting relative positional information into attention via rotations in feature space. "use window-matched RoPE~\cite{su2024rope}"
- ROE alignment: An affine alignment procedure (from MoGe) applied before point-map loss computation. "the global affine-invariant point map loss with ROE~\cite{wang2025moge} alignment"
- Scale-invariant: Unaffected by global scaling, often desirable in depth/geometry losses and metrics. "A scale-invariant point gradient matching loss, inspired by log-depth gradient matching~\cite{li2018megadepth}"
- SfM (Structure-from-Motion): A technique that recovers 3D structure and camera motion from image sequences. "for SfM labels we omit and "
- Transposed convolution: A learned upsampling layer that increases spatial resolution by reversing the convolution operation. "a transposed convolution with stride $2$, followed by a convolution"
- Unprojection: Converting depth (or similar scalar fields) into 3D points using camera intrinsics or reference geometry. "Unprojecting affine-invariant log-depth to 3D geometry requires a reference point map"
- ViT (Vision Transformer): A Transformer architecture applied to images via patch tokenization for feature encoding/decoding. "a ViT~\cite{dosovitskiy2021vit} encoder with a Neighborhood Attention Decoder (NAD, \cref{sec:arch})."
- Zero-shot: Evaluation without task-specific fine-tuning on the target datasets. "Across eight zero-shot monocular geometry benchmarks, our model, SurGe, achieves the best average rank"
Collections
Sign up for free to add this paper to one or more collections.