Papers
Topics
Authors
Recent
Search
2000 character limit reached

SurGe: Improved Surface Geometry in Point Maps

Published 29 May 2026 in cs.CV | (2605.31577v1)

Abstract: Recent feedforward 3D reconstruction methods predict point maps and estimate global 3D geometry remarkably well. However, their predictions still exhibit inaccurate local surface geometry, which is clearly visible qualitatively but only weakly reflected in common metrics. To make these errors more explicit in evaluation, we introduce a point map normal metric that evaluates the local surface orientation induced by neighboring 3D predictions. To reduce these errors, we propose two complementary components: a point gradient matching loss that supervises depth-normalized 3D finite differences, and a Neighborhood Attention Decoder (NAD) that progressively upsamples features and uses Neighborhood Attention for local feature mixing. Across eight zero-shot monocular geometry benchmarks, our model, SurGe, achieves the best average rank for global point map AbsRel and consistently improves local point map and point map normal evaluations.

Summary

  • The paper introduces a new point map normal metric to measure local surface orientation using mean angular error.
  • The paper presents a novel point gradient matching loss and an attention-based decoder that accurately captures thin and complex 3D structures.
  • The paper demonstrates that improved local geometric accuracy on standard benchmarks sets a new evaluation paradigm in 3D vision.

SurGe: Improved Surface Geometry in Point Maps – A Technical Analysis

Introduction and Motivation

Monocular 3D scene reconstruction from a single image has rapidly advanced with feedforward architectures that predict dense 3D point maps. While these methods demonstrate competitive global geometric accuracy, they commonly exhibit artifacts and inaccuracies in local surface geometry—manifested as inconsistent surface orientation, distortions on thin structures, and high-frequency surface noise. These pathologies are not adequately captured by standard global pointwise error metrics, thus motivating a shift in both evaluation and modeling focus toward local geometric consistency.

SurGe directly targets this gap. It introduces an improved evaluation paradigm and novel training and architectural innovations, resulting in superior local surface geometry while maintaining competitive global performance.

Methodological Contributions

SurGe makes three principal technical contributions:

  1. Point Map Normal Metric: A new evaluation metric that quantifies the mean angular error between local surface normals derived from point map neighborhoods, providing a direct measure of local surface orientation correctness.
  2. Point Gradient Matching Loss (Lpgm\mathcal{L}_{\text{pgm}}): A novel, scale-invariant, pairwise loss function supervising the local orientation and magnitude of 3D point gradients. This loss extends ideas from the log-depth gradient matching loss but adapts them coherently for vector-valued (3D) point maps, enforcing local coherence at a finer geometrical level.
  3. Neighborhood Attention Decoder (NAD): An attention-based upsampling decoder that, unlike convolutional alternatives, leverages Neighborhood Attention for local, content-adaptive feature mixing at multiple resolutions. This design avoids the limitations of spatial convolution (fixed receptive fields) and pure Transformer decoders (patch-level artifacts), and empirically produces more accurate geometry on thin structures and in regions with high spatial complexity.

Architectural Design

The SurGe pipeline comprises a DINOv2-initialized ViT-Large encoder followed by the NAD. The NAD operates in five progressive upsampling stages, each composed of multiple NAD blocks employing Neighborhood Attention. Stages execute feature mixing and upsampling, culminating in a per-pixel 3D point map output using an (ξ,η,ρ)(\xi,\eta,\rho) parameterization with exponential scale. The NAD intentionally omits LayerNorm; instead, it relies on QK normalization for training stability, hypothesizing that this better preserves activation scales relevant for regression tasks.

Critically, NAD's local attention combines the spatial sensitivity of convolutions with the adaptivity of feature-driven attention. It enables precise recovery of high-frequency geometric details while remaining computationally tractable at high resolution.

Learning Objective

The supervision scheme combines:

  • Global affine-invariant point map loss
  • Multi-scale local patch-wise losses (with various spatial support)
  • The new point gradient matching loss (Lpgm\mathcal{L}_{\text{pgm}})

The Lpgm\mathcal{L}_{\text{pgm}} term matches depth-normalized finite 3D point differences between neighboring pixels, enforcing that local 3D structures are not only positionally accurate but also form locally coherent surfaces. The loss is weighted empirically (factor 10 relative to the global loss) and is masked near occlusion boundaries to avoid irreducible errors due to ground truth ambiguity.

Empirical Results and Analysis

SurGe is evaluated on eight standard zero-shot monocular geometry benchmarks (including NYUv2, KITTI, ETH3D, iBims-1, GSO, Sintel, DDAD, DIODE), employing both established global alignment metrics and the newly proposed point map normal error. Key results include:

  • Strongest average rank across global point map metrics versus prior monocular and many-view models.
  • Consistent improvement in local surface evaluations, with SurGe yielding the lowest local point map and normal error across all benchmarks.
  • Qualitative advances on thin and complex structures, with reduced blockiness, bending, or oscillations versus convolutional and Transformer decoder alternatives.

Ablation studies further elucidate contributions:

  • NAD outperforms both convolutional decoders (ConvStack, DPT head) and ViT decoders across all regimes, with the gap especially pronounced in local metrics.
  • Replacing Lpgm\mathcal{L}_{\text{pgm}} with conventional surface losses (normal angle, log-depth gradient) reduces both local and often global geometric fidelity.
  • Scaling convolutional decoder capacity narrows the gap but does not close it, emphasizing the efficacy of attention-based local mixing.

There are modest increases in inference latency and memory relative to ConvStack-L, but the computational overhead is marginal compared to encoder costs and justified by accuracy gains.

Implications and Future Directions

SurGe establishes that substantial improvements in local 3D geometry are possible with architectural and loss function innovations tailored to geometric coherence. The point map normal metric concretizes a previously under-explored evaluation axis, likely to influence future model selection and development in geometric vision.

Practically, SurGe's advances improve performance in downstream 3D perception tasks that are sensitive to local geometry, such as SLAM, scene interaction, and robotic manipulation.

Further research directions include:

  • Architectural efficiency: Developing lighter-weight attention mechanisms to reduce decoder overhead at high resolution.
  • Generalization to sparser or lower-quality labels: Adapting losses and architecture for real-world data regimes.
  • Extension to dynamic and sequence data: Investigating temporal consistency and generalizing local surface metrics across time.

Conclusion

SurGe advances monocular 3D reconstruction by directly targeting local surface geometry, marrying a new evaluation metric, a principled pairwise supervision loss, and an attention-based multiscale decoder. The result is a model that achieves state-of-the-art local and global 3D point map accuracy and sets a precedent for prioritizing local geometric fidelity in dense 3D vision tasks. This framework opens avenues for more robust, geometry-aware visual models, with broad implications for future research in computer vision and embodied AI.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What this paper is about

This paper is about teaching computers to understand the 3D shape of a scene from just one picture. Recent AI models already do a good job at getting the big, overall 3D layout right. But they often mess up small details, like thin poles, chair legs, or edges, which end up looking wavy, bumpy, or bent in 3D. The authors introduce SurGe, a new method that focuses on making these local details cleaner and more accurate.

The main questions the paper asks

  • How can we measure the quality of small, local 3D shapes (like thin parts and sharp edges) in a way that clearly shows when a model gets them wrong?
  • How can we train a model so that neighboring pixels in an image form smooth, consistent 3D surfaces instead of wavy or noisy ones?
  • Can we design a better decoder (the part of the model that turns features into per-pixel 3D predictions) that captures fine details without creating new artifacts?

How they approached the problem (in simple terms)

Think of turning a photo into a 3D scene like building a sculpture based on a picture. Each pixel gets a 3D point, forming a “point map.” If nearby points don’t line up nicely, the surface looks wrinkly.

To fix this, the paper does three things:

  • A better test for local surfaces: They add a new metric that looks at the direction each tiny surface patch faces. Imagine placing a tiny arrow on the surface at each point that shows which way it’s pointing. This “normal” tells you local shape. If the arrows flip around wildly, the surface is bad, even if the points are close. This metric makes local mistakes easy to see and measure.
  • A better training rule to keep neighbors consistent: They introduce a “point gradient matching loss.” Instead of only checking if each predicted point is in the right place, they also check how neighboring points change relative to each other. They normalize these small differences by depth so the rule doesn’t care about the scene’s overall size (this is called scale-invariant). In everyday terms: rather than judging each height alone, they compare the slope between neighbors—and do it in a way that isn’t fooled if the whole object is closer or farther.
  • A better decoder that focuses locally: They design a Neighborhood Attention Decoder (NAD). Attention is a method where the model “looks” more at the parts that matter. But full attention over an entire image is expensive and can cause blocky artifacts. Neighborhood Attention only looks around each spot’s nearby area, which is faster and better for local details. The decoder processes the image at multiple scales, getting progressively finer, so thin structures and edges are preserved without creating checkerboard or patchy artifacts.

A few helpful analogies:

  • Normals: Little arrows sticking out of the surface; if they twist randomly, the surface is messy.
  • Gradient matching: Compare how two neighboring points go up or down, not just their absolute heights.
  • Neighborhood Attention: Like focusing on nearby puzzle pieces to fit them together accurately, instead of scanning the whole table every time.

What they found and why it matters

The authors tested SurGe on eight standard benchmarks without any extra training on those datasets (this is called zero-shot). They measured three kinds of quality:

  • Global accuracy of the whole 3D scene.
  • Local point accuracy inside objects (after aligning each object).
  • Local surface direction accuracy using their new “point map normal” metric.

Key results:

  • SurGe gave the best average ranking on global 3D accuracy compared to strong recent methods.
  • SurGe clearly improved local details: it got the best scores on both local point accuracy and the new normal-based metric across multiple datasets.
  • Visually, SurGe produced cleaner thin structures (like street signs, poles, and edges) and smoother, more stable surfaces.

Why it matters:

  • Many real-world uses (robotics, AR/VR, mapping, 3D content creation) need both good overall structure and clean local detail. If edges wobble or thin parts bend, interactions and measurements can fail. SurGe improves exactly these weak spots.

What this could change going forward

  • Better evaluation: The new normal-based metric makes it easier for researchers to see and fix local surface problems. This could become a standard way to assess 3D quality.
  • Better training recipes: The point gradient matching loss shows that teaching models using neighbor-to-neighbor comparisons, in a scale-invariant way, leads to smoother, more realistic surfaces. Others can adopt this idea to different 3D tasks.
  • Better architectures: The Neighborhood Attention Decoder proves you can get the benefits of attention for fine details without the heavy cost or patchy artifacts. Future models for depth, surface normals, or even image restoration may use similar neighborhood attention blocks.

In short, SurGe doesn’t just make the big 3D picture look right; it also makes the small details precise and clean. That combination is crucial if we want reliable 3D from a single image in everyday applications.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The following list highlights what remains missing, uncertain, or unexplored in the paper and can guide future research:

Evaluation metrics and analysis

  • Quantify how well the proposed point map normal metric correlates with human judgments of local surface quality across diverse scenes.
  • Analyze sensitivity of the normal metric to ground-truth sparsity, noise, and resampling (e.g., LiDAR, SfM), and provide variants robust to sparse or noisy labels.
  • Study the metric’s dependence on stencil choice (four-neighborhood cross products) and evaluate alternative normal estimators (diagonal/8-neighborhood, multi-scale, robust estimators).
  • Investigate whether global rotational misalignments (if any) bias the normal metric, and whether a rotation-invariant evaluation variant is needed.
  • Provide ablations demonstrating metric stability across image resolutions and aspect ratios.

Loss design and supervision

  • Evaluate alternative normalizations in the point gradient matching (PGM) loss (e.g., average depth, geometric mean, per-pair learned scale, epsilon-clipped normalization) and their stability near very small z.
  • Extend PGM beyond 4-neighborhood forward differences to include diagonal neighbors, larger stencils, or multi-scale gradients, and assess rotational and scale robustness.
  • Compare PGM directly to normal/edge-angle losses and higher-order (curvature/second-derivative) surface losses under identical training protocols.
  • Explore combining PGM with explicit normal-based losses or learned confidence-weighted gradient matching to handle outliers and label noise.
  • Validate the occlusion-boundary masking strategy: quantify how masking impacts boundary fidelity and whether learnable or geometry-aware boundary handling performs better without relying on dataset-specific edge conventions.
  • Perform a systematic study of loss-weight sensitivity (including the fixed weight 10 for PGM) across datasets and label qualities.

Architectural choices and efficiency

  • Provide inference-time compute, memory, and latency benchmarks for Neighborhood Attention Decoder (NAD) versus ConvStack/DPT/ViT decoders across resolutions; assess real-time feasibility.
  • Ablate Neighborhood Attention hyperparameters (window size, number of heads, head dimension), dilation, and window scheduling across stages to understand accuracy–efficiency trade-offs.
  • Compare NAD to alternative attention paradigms (e.g., Swin-style shifted windows, deformable attention, dynamic convolution/modulated kernels) for local mixing.
  • Study the impact of removing LayerNorm on training stability and generalization across backbones, precisions (e.g., FP16/bfloat16), and hardware.
  • Analyze the reliance on the encoder for long-range context: does NAD’s local attention limit very long-range interactions, and can occasional global or cross-scale attention improve consistency?
  • Test scalability of NAD with larger/smaller backbones and different token budgets; provide scaling laws for performance vs. compute.

Generalization and robustness

  • Assess robustness to image degradations (noise, blur, compression, exposure changes, rolling shutter) and to challenging material properties (transparency, specularity) that can destabilize local geometry.
  • Evaluate performance under diverse camera intrinsics and sensor models; characterize sensitivity to intrinsics miscalibration and propose intrinsic-aware conditioning if needed.
  • Provide analysis on hard categories (e.g., extremely thin structures against highly textured/far backgrounds, textureless regions, repetitive patterns), including failure cases and targeted remedies.
  • Explore uncertainty estimation for local geometry (per-pixel confidence) and its use in loss weighting, post-processing, and downstream tasks.

Scope and applicability

  • Investigate applicability to multi-view architectures: can NAD and PGM transfer to many-view feedforward models without introducing patch-aligned or window-aligned artifacts?
  • Examine whether the normal metric can guide test-time refinement or self-supervised adaptation, and whether PGM can be used in semi-/self-supervised or sparse-label regimes.
  • Address absolute scale recovery: the work focuses on affine-invariant evaluation; methods to predict or calibrate metric scale (without external cues) remain unexplored.
  • Analyze domain and data-mix effects: quantify the contribution of each dataset category, the role of synthetic vs. real data, and identify optimal curricula for local-surface learning.
  • Consider joint prediction of auxiliary geometry (e.g., explicit normals, curvature, occlusion boundaries) and their synergies with point maps under the proposed training scheme.

Reproducibility and standardization

  • Standardize and release evaluation code/checklists for the normal metric (including handling of masks and boundary cases) to facilitate broader adoption and comparability.
  • Provide detailed reporting on training stability (seed variance), hyperparameter robustness, and ablations that isolate the individual contributions of NAD and PGM at full training scale.

Practical Applications

Immediate Applications

Below are concrete ways SurGe’s findings and components can be applied today across sectors. Each item names the sector(s), the use case, likely tools/workflows, and key assumptions/dependencies.

  • Sector: Robotics, Industrial Automation
    • Use case: Safer navigation and manipulation around thin structures (cables, rods, fences), better grasp planning on slender objects, improved obstacle avoidance in cluttered scenes from a single onboard camera.
    • Tools/workflows: Drop-in point-map module in perception stacks; SurGe as a prior for visual odometry/SLAM; local surface normals to filter spurious points and enforce grasp pose constraints.
    • Assumptions/dependencies: Monocular scale ambiguity must be resolved (IMU/wheel odometry/stereo/known object sizes); GPU availability on robot; domain adaptation may be needed for factory/surgical lighting.
  • Sector: AR/VR, Mobile
    • Use case: Higher-fidelity occlusion and placement in AR (correct handling of chair legs, wires, handles), improved collision proxies in AR physics, single-image room captures for quick previews.
    • Tools/workflows: Integrate SurGe as a depth/point-map provider in ARKit/ARCore pipelines; generate mesh proxies from point maps; use point-normal metric in QA to gate AR releases.
    • Assumptions/dependencies: For metric scale, rely on device sensors (ToF/LiDAR, IMU); latency budgets may require model distillation or reduced-token variants.
  • Sector: VFX, Games, Creative Tools
    • Use case: Single-frame geometry extraction for relighting, defocus, fog, and volumetric effects; cleaner meshes from thin geometry for asset integration; improved matte edges via normals.
    • Tools/workflows: DCC plugins (Blender, Unreal, Nuke) that convert SurGe point maps to watertight meshes; normal-aware smoothing to retain sharp edges without ripples; QA with the normal-based metric.
    • Assumptions/dependencies: Studio compute or workstation GPU; scene scale derived from production metadata or SfM alignment.
  • Sector: E-commerce, 3D Digitization
    • Use case: Rapid 3D previews for products with thin parts (jewelry, eyewear, furniture frames) from sparse images; background separation leveraging consistent local surfaces.
    • Tools/workflows: Point-to-mesh pipelines with normal-guided meshing; automated quality checks with point-map normal error; SurGe-initialized MVS to reduce capture time.
    • Assumptions/dependencies: Multi-view capture still preferred for metric scale and completeness; domain tuning for studio lighting/backgrounds.
  • Sector: Construction, Real Estate, Interior Design
    • Use case: Quick scene measurement proxies and spatial planning from a single photo (e.g., measuring fixtures, detecting protrusions), cleaner floorplan cues from thin edges (moldings, rails).
    • Tools/workflows: Mobile app that predicts point maps, extracts planes and edge segments via normals; mesh export to CAD/BIM.
    • Assumptions/dependencies: Metric scale from known references (door height) or AR frameworks; accuracy bounds communicated to users.
  • Sector: Mapping/Photogrammetry/NeRF and 3D Gaussian Splatting pipelines
    • Use case: Faster and more stable multi-view reconstruction by seeding with SurGe point maps; fewer artifacts on thin structures and repetitive textures.
    • Tools/workflows: Use SurGe outputs as priors for MVS, NeRF, or 3DGS initialization; normal-based metric to pick frames with reliable geometry for pose-graph optimization.
    • Assumptions/dependencies: Consistent intrinsics/extrinsics; pipeline integration handles monocular scale alignment across views.
  • Sector: Research and ML Engineering (Academia/Industry)
    • Use case: Better training signals for point-map models via the point gradient matching loss; stronger local-geometry evaluation with the proposed normal metric; NAD as a drop-in decoder for dense geometry tasks.
    • Tools/workflows: Replace or augment edge/normal losses with depth-normalized finite-difference supervision; add point-map normal metric to validation dashboards; adopt NAD blocks to reduce patch artifacts without full attention cost.
    • Assumptions/dependencies: Access to training data with sufficient label density/quality; reproduction of training schedules; licensing of pretrained backbones (e.g., DINOv2).
  • Sector: Standards, Safety, Policy
    • Use case: Procurement and benchmark criteria that include local-surface quality (thin-structure fidelity) beyond global errors; regression tests for AD/robotics perception stacks.
    • Tools/workflows: Incorporate the point-map normal mean angular error into evaluation suites; require local-instance metrics in RFPs and model cards.
    • Assumptions/dependencies: Community and vendor agreement to report additional metrics; availability of instance masks or dense GT for evaluation.

Long-Term Applications

These opportunities become feasible with further research, scaling, or system integration.

  • Sector: Autonomous Driving, Advanced Robotics
    • Use case: Robust perception of road furniture (signposts, poles, wires), debris detection, and fine manipulation (cable routing, textile handling) using monocular cameras.
    • Tools/workflows: Sensor fusion to resolve scale (LiDAR/radar/IMU), real-time SurGe variants on automotive-grade SoCs, continual learning with normal-based local-surface QA.
    • Assumptions/dependencies: Real-time performance on edge; safety validation under adverse weather/night; domain-shift robustness.
  • Sector: Consumer AR Glasses and Edge Devices
    • Use case: On-device, persistent AR with high-fidelity occlusion and physics from glance-level captures; live scene editing.
    • Tools/workflows: Model compression (quantization/pruning) of NAD, knowledge distillation; mixed-precision kernels and neighborhood-attention accelerators.
    • Assumptions/dependencies: Hardware support for efficient neighborhood attention; energy constraints; privacy-compliant on-device inference.
  • Sector: Cultural Heritage, Inspection, Metrology
    • Use case: Accurate reconstruction of fine ornamentation, crack detection, and defect metrology from limited imagery.
    • Tools/workflows: Domain-adapted SurGe models; normal-guided meshing with uncertainty estimates; integration into inspection robots/drones.
    • Assumptions/dependencies: Calibrated capture protocols; certified accuracy and traceability to standards.
  • Sector: Medical Imaging and Endoscopy (Research Transfer)
    • Use case: Monocular 3D reconstruction in endoscopy (vessel and tool geometry), with improved fidelity on thin structures.
    • Tools/workflows: Adapt NAD and pairwise depth-normalized supervision to medical domain; synthetic-to-real transfer with curated datasets.
    • Assumptions/dependencies: Regulatory approval, domain-specific training data, handling of specularities/fluids.
  • Sector: Generative 3D and Video
    • Use case: Geometry-consistent conditioning for 3D generative models and text-to-4D; normal-based losses for training generative models with stronger local surface priors.
    • Tools/workflows: Use SurGe point maps as geometry supervision for diffusion/autoregressive 3D pipelines; enforce normal-consistency constraints across frames.
    • Assumptions/dependencies: Scalable training on multimodal datasets; temporal consistency modules; license compatibility for pretrained encoders.
  • Sector: Geospatial/Remote Sensing
    • Use case: Monocular 3D inference for aerial/satellite images to aid DSM/DTM refinement, especially for man-made thin structures (powerlines, poles).
    • Tools/workflows: Domain adaptation with aerial datasets; fusion with stereo/ LiDAR; QA with normal metric for small-structure completeness.
    • Assumptions/dependencies: Different imaging geometries and scales; atmospheric effects; rigorous georeferencing.
  • Sector: Benchmarking and Regulation
    • Use case: New standard suites emphasizing local-surface fidelity for safety-critical perception; certification processes that penalize surface ripples/blockiness.
    • Tools/workflows: Public benchmarks reporting point-map normal errors alongside global metrics; dataset design prioritizing thin-structure annotations.
    • Assumptions/dependencies: Community adoption; tooling to compute normals from diverse GT sources.
  • Sector: Platforms and Tooling
    • Use case: Turnkey SDKs that output point maps, normals, and meshes from single images; “thin-structure mode” in capture apps; training libraries implementing the point gradient matching loss and NAD.
    • Tools/workflows: Open-source packages and commercial APIs; CAD/DCC plugins; AutoML recipes with local-surface metrics.
    • Assumptions/dependencies: Sustainable maintenance; interoperability with existing photogrammetry/engine toolchains; IP/licensing for pretrained components.

Cross-cutting assumptions and dependencies

  • Scale ambiguity: Monocular predictions are affine-invariant; applications requiring metric units must fuse additional cues (IMU, stereo, LiDAR, known dimensions).
  • Compute and latency: The reference model uses a large ViT (DINOv2-L) and multi-stage decoder; mobile/real-time deployments require compression or smaller backbones.
  • Data and domain shift: Zero-shot generalization is strong but not guaranteed; fine-tuning and targeted data curation may be necessary for specialized domains.
  • Integration maturity: The point-map normal metric and the point gradient matching loss are immediately useful for model training/evaluation, but downstream pipelines may need engineering to exploit normals and improved local surfaces (e.g., meshing, uncertainty, QA gates).
  • Licensing and reproducibility: Use of pretrained backbones and datasets must comply with licenses; reproducing training at full scale requires significant compute (multi-GPU).

Glossary

  • AbsRel: Absolute Relative Error; a common metric in depth/geometry estimation measuring relative point error after alignment. "global affine-invariant AbsRel evaluates all valid pixels"
  • AdamW: An optimization algorithm that decouples weight decay from the gradient update to improve training stability. "We optimize with AdamW~\cite{loshchilov2019adamw} for $120$K steps at total batch size $128$, using peak learning rates 3×1043\times10^{-4} for the decoder and 3×1053\times10^{-5} for the backbone, and a reciprocal square root schedule~\cite{zhai2022scalingvisiontransformers} with $1$K warmup steps and a 10%10\% cooldown."
  • Affine-invariant: Property of methods/metrics that are unchanged under affine transformations (scaling, translation) of 3D coordinates. "affine-invariant point map AbsRel metrics in global and instance-wise local forms."
  • Affine-invariant depth: Depth parameterization invariant to affine transformations (often in log space), enabling scale- and shift-agnostic supervision. "predict affine-invariant depth in log space instead of point maps."
  • Cross-attend: An attention mechanism where one set of features attends to another set (often across resolutions or modalities). "utilizes NA to cross-attend between high-resolution image features of a small CNN and a low-resolution feature map of a vision foundation model in order to compute high-resolution features."
  • Cross-view completion: A pretraining objective that learns to predict or fill in features across different camera views. "a ViT pretrained with cross-view completion~\cite{weinzaepfel2022croco_v1,weinzaepfel2023croco_v2}"
  • DINOv2: A large-scale self-supervised vision foundation model used to initialize Vision Transformers. "SurGe combines a DINOv2~\cite{oquab2023dinov2} encoder with our Neighborhood Attention Decoder (NAD)."
  • DPT head: A decoder architecture (from Dense Prediction Transformers) for upsampling and predicting dense maps from ViT features. "This decoder is often implemented as a DPT head~\cite{ranftl2021dpt,yang2024depth_anything_v1,yang2024depth_anything_v2,bochkovskii2024depthpro}"
  • Edge angle loss: A surface-consistency loss that penalizes differences in angles between edges to encourage coherent local geometry. "point normals and edge angle losses supervise local orientation but do not constrain displacement magnitude"
  • FFN (Feed-Forward Network): The position-wise MLP block inside Transformer layers that processes each token independently. "a Transformer-style~\cite{vaswani2017attention} residual block with a Neighborhood Attention layer and a pointwise FFN"
  • Finite differences: Discrete approximations of derivatives computed as differences between neighboring pixels/points. "a point gradient matching loss that supervises depth-normalized 3D finite differences"
  • LayerNorm (Layer Normalization): A normalization technique applied to activations to stabilize and accelerate training. "omits the usual pre-attention and pre-FFN LayerNorm~\cite{ba2016layernormalization} layers."
  • LiDAR: A sensor that measures distances using laser light, providing sparse or dense depth annotations. "for LiDAR labels we keep only Lglob\mathcal{L}_\mathrm{glob} and Lloc,4\mathcal{L}_{\mathrm{loc},4}."
  • Log-depth gradient matching: A loss comparing gradients of log-depth, yielding scale-invariant supervision across neighboring pixels. "the log-depth gradient matching loss originally proposed for monocular depth~\cite{li2018megadepth}"
  • Monocular geometry estimation: Recovering dense 3D structure from a single image. "Monocular geometry estimation seeks to recover dense 3D scene structure from a single image."
  • Neighborhood Attention: An attention variant where each token attends only to a local spatial neighborhood to reduce cost and inject locality. "Neighborhood Attention (NA)~\cite{hassani2023nat}"
  • Neighborhood Attention Decoder (NAD): The proposed multiscale decoder that uses Neighborhood Attention for local feature mixing and upsampling. "a Neighborhood Attention Decoder (NAD) that progressively upsamples features and uses Neighborhood Attention for local feature mixing."
  • Occlusion boundaries: Image regions where surfaces overlap in depth, often causing ambiguous or inconsistent local supervision. "omit pairs near occlusion boundaries."
  • Pairwise scale invariance: A property where supervision depends on relative scales within local pixel pairs, not global scale. "while preserving its pairwise scale invariance."
  • Pixel shuffle: An upsampling operation that rearranges channel data into higher spatial resolution. "a final pixel shuffle~\cite{shi2016pixelshuffle} for point predictions."
  • Point gradient matching loss: The proposed loss comparing depth-normalized 3D finite differences between neighboring predicted and ground-truth points. "a point gradient matching loss that supervises depth-normalized 3D finite differences"
  • Point map: A dense 3D representation assigning each pixel a 3D point in space. "predict a point map, assigning each pixel a 3D point."
  • Point map normal metric: An evaluation metric that compares local surface normals induced by neighboring 3D point differences. "we introduce a point map normal metric that evaluates the local surface orientation induced by neighboring 3D predictions."
  • Point map normals: Surface normals computed from local differences in the point map to characterize orientation of surfaces. "instead compares point map normals induced by neighboring point differences, and therefore reflects this degradation."
  • QK normalization: A stabilization technique that normalizes queries and keys in attention to control attention logits. "we use QK~normalization~\cite{zhai2022scalingvisiontransformers,dehghani2023vit22b}."
  • Reciprocal square root schedule: A learning rate schedule that decays proportionally to the reciprocal of the square root of training step. "a reciprocal square root schedule~\cite{zhai2022scalingvisiontransformers}"
  • RoPE (Rotary Positional Embeddings): A method for injecting relative positional information into attention via rotations in feature space. "use window-matched RoPE~\cite{su2024rope}"
  • ROE alignment: An affine alignment procedure (from MoGe) applied before point-map loss computation. "the global affine-invariant point map loss Lglob\mathcal{L}_\mathrm{glob} with ROE~\cite{wang2025moge} alignment"
  • Scale-invariant: Unaffected by global scaling, often desirable in depth/geometry losses and metrics. "A scale-invariant point gradient matching loss, inspired by log-depth gradient matching~\cite{li2018megadepth}"
  • SfM (Structure-from-Motion): A technique that recovers 3D structure and camera motion from image sequences. "for SfM labels we omit Lloc,64\mathcal{L}_{\mathrm{loc},64} and Lpgm\mathcal{L}_\mathrm{pgm}"
  • Transposed convolution: A learned upsampling layer that increases spatial resolution by reversing the convolution operation. "a transposed 2×22\times{}2 convolution with stride $2$, followed by a 3×33\times{}3 convolution"
  • Unprojection: Converting depth (or similar scalar fields) into 3D points using camera intrinsics or reference geometry. "Unprojecting affine-invariant log-depth to 3D geometry requires a reference point map"
  • ViT (Vision Transformer): A Transformer architecture applied to images via patch tokenization for feature encoding/decoding. "a ViT~\cite{dosovitskiy2021vit} encoder with a Neighborhood Attention Decoder (NAD, \cref{sec:arch})."
  • Zero-shot: Evaluation without task-specific fine-tuning on the target datasets. "Across eight zero-shot monocular geometry benchmarks, our model, SurGe, achieves the best average rank"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 35 likes about this paper.