Digital Surface Models (DSMs) Overview
- DSMs are raster elevation datasets that record the first reflective elevation encountered, capturing both ground surfaces and above-ground structures.
- They are generated using photogrammetry, LiDAR, and hybrid fusion techniques to derive detailed 3D shapes and volumetric attributes for diverse applications.
- Advanced DSM processing leverages void filling, boundary sharpening, and deep learning to enhance accuracy and support urban, environmental, and energy mapping studies.
A Digital Surface Model (DSM) is a raster elevation dataset in which each cell records the elevation of the Earth's surface, including all natural and artificial objects such as buildings, vegetation, and infrastructure. DSMs fundamentally contrast with Digital Terrain Models (DTMs), which represent the bare earth surface with all superstructures removed. The explicit inclusion of non-terrain objects in DSMs enables direct analysis of built environments, canopy structures, and anthropogenic modifications, and underpins a vast array of geospatial, remote sensing, and urban modeling applications (Mutreja et al., 24 Mar 2025).
1. Definition, Distinction, and Primary Roles of DSMs
A DSM is a gridded surface encoding, at each location, the first reflective elevation encountered from above (e.g., building rooftops, tree canopies, terrain surface). This is formally distinct from a DTM , which is ideally obtained from by removing all above-surface objects, often via a DSM-to-DTM filtering operation (Mutreja et al., 24 Mar 2025, Dhaouadi et al., 13 Nov 2025). DSMs enable direct extraction of object heights, 3D shapes, volumetric and morphological attributes, and serve as foundational layers for landcover mapping, digital twins, shadow modeling, line-of-sight, and solar resource estimation (Batchu et al., 2024).
2. Generation Methodologies: Photogrammetric and Fused Approaches
Contemporary DSM generation primarily leverages photogrammetric stereo, LiDAR, radargrammetry, or hybrid fusion techniques:
- Stereo Photogrammetry: Multiple high-resolution images (typically from satellites or UAVs) are rectified and matched using algorithms such as semi-global matching (SGM), with resulting disparities triangulated to derive dense point clouds. These point clouds are rasterized to form DSMs (Stucker et al., 2021, Qin, 2019). Quality is directly affected by image geometry (convergence angles), radiometric conditions, occlusions, and surface texture (Stucker et al., 2021, Batchu et al., 2024).
- Multi-View Fusion: Multi-view DSMs are generated by ranking and fusing depth maps from carefully selected stereo pairs, optimized using a reference such as sparse LiDAR. Adaptive median fusion—incorporating spatial and spectral cues—increases robustness and reduces salt-and-pepper noise compared to plain-cell medians (Qin, 2019).
- Hybrid Techniques: Advanced DSM refinement employs encoder-decoder and conditional generative adversarial network (cGAN) architectures with dual encoders (e.g., for PAN and DSM inputs). Early fusion strategies (at the bottleneck) improve output sharpness and rectilinearity of building forms beyond late fusion models (Bittner et al., 2019, Bittner et al., 2019).
3. DSM Processing: Filtering, Completion, and Registration
Raw DSMs frequently exhibit artifacts due to sensor noise, insufficient texture, occlusions, and temporal changes. Key processing and enhancement strategies include:
- Void Filling and Height Completion: Standard spatial interpolation (IDW, kriging, splines) performs poorly for complex urban and vegetated terrain. Modern approaches employ guided inpainting using edge-aware anisotropic diffusion, where guidance images (e.g., RGB orthophotos) steer the completion of missing DSM values. Diffusion models, such as Dfilled, have demonstrated superior preservation of sharp structures and faithful completion of large voids, using diffusion tensors derived from optical gradients and masks realistically simulated with Perlin noise (Panangian et al., 26 Jan 2025). Foundation models enable sensor-agnostic DSM completion by propagating metric information from available priors to missing regions via semantic correspondence in Vision Transformer (ViT) features, combined with test-time-adapted monocular depth predictions (Rafaeli et al., 2 Apr 2026).
- Boundary Sharpening: Dense-matching algorithms over-smooth depth edges. Graph-cut and plane-fitting post-processing, leveraging line-segment cues from orthophotos, can rectify building outlines, reduce boundary RMSE, and align DSM discontinuities with physical wall positions (Lu et al., 2019).
- Wide-area Registration: Large mosaics of DSM tiles require global alignment. Memory-efficient grid-aware nearest neighbor search enables pairwise ICP between tiles without k-d-tree overhead. Pose graphs over tile overlaps are solved by motion averaging to enforce global consistency, reducing registration-induced errors to sub-meter levels (Xu et al., 2024).
4. Deep Learning and Self-Supervised DSM Enhancement
Recent DSM workflows increasingly depend on deep, multi-modal learning frameworks:
- Multi-Task Learning: Encoder-decoder architectures with shared representations and multi-head decoders leverage auxiliary tasks such as roof-type segmentation alongside height regression. Uncertainty-based multi-task loss balancing, surface-normal constraints, and adversarial terms enable structural regularization and improved roof geometry (Liebel et al., 2020).
- Self-Supervised Pre-training: Dual-encoder models (e.g., HiRes-FusedMIM) learn joint representations from high-resolution RGB and DSM data using masked patch modeling and contrastive alignment heads. Incorporating DSM as an explicit modality during pre-training yields improvements for classification, semantic segmentation, and instance segmentation tasks—particularly in building-centric benchmarks (Mutreja et al., 24 Mar 2025). Aggressive patch masking (60%) and per-city normalization are critical for robust spatial feature extraction across modalities.
- Diffusion Models and Residual Priors: Single-view DSM estimation and DSM-to-DTM translation benefit from conditional diffusion probabilistic models (DDPM), achieving state-of-the-art accuracy and allowing for uncertainty quantification. Residual refinement networks (ResDepth) model the DSM refinement as a residual correction task, encoding geometric and urban priors with strong cross-city transfer (Corley et al., 2023, Stucker et al., 2021, Dhaouadi et al., 13 Nov 2025).
- Early Fusion and GAN-based Refinement: Early fusion of spectral (PAN) and depth (DSM) features within cGAN generators enhances planar regularity, edge delineation, and completion of partially occluded roofs, outperforming late-fusion counterparts (Bittner et al., 2019).
5. Quantitative Performance and Benchmark Results
DSMs, when processed with state-of-the-art methods, demonstrate significant accuracy improvements across tasks:
| Task / Dataset | Method/Model | Metric | Value | Reference |
|---|---|---|---|---|
| Building-level mIoU | HiRes-FusedMIM (RGB+DSM) | Vaihingen mIoU | 74.40 % | (Mutreja et al., 24 Mar 2025) |
| DSM void inpainting | Dfilled (diffusion) | RMSE (Real voids) | 2.91 m | (Panangian et al., 26 Jan 2025) |
| Stereo DSM filtering | Multi-task CNN Ensemble | Berlin RMSE | 1.94 m | (Liebel et al., 2020) |
| Single-view height | Conditional DDPM | Vaihingen RMSE | 1.760 m | (Corley et al., 2023) |
| Building shape | WNet-cGAN (fused) | RMSE (Berlin) | 4.36 m | (Bittner et al., 2019) |
| DTM extraction | GrounDiff (ALS2DTM/DALES) | RMSE | 0.51 m | (Dhaouadi et al., 13 Nov 2025) |
Key findings across studies highlight that multi-modal fusion, deep residual architectures, and attention to guidance by orthophotos or foundation semantic features yield substantial gains in edge sharpness, volumetric accuracy, and downstream utility. For instance, DSM-including pre-training with HiRes-FusedMIM improved GeoNRW mIoU by +2.29% over RGB-only (Mutreja et al., 24 Mar 2025), while GrounDiff achieved up to 93% RMSE reduction over previous DTM extraction methods (Dhaouadi et al., 13 Nov 2025).
6. Application Domains and Integrated Workflows
High-resolution DSMs are indispensable across building-scale and regional applications:
- Urban and Building Modeling: DSMs at 0.2–0.5 m GSD resolve roof topology, enable shadow/solar analyses, digital twin generation (LoD2+), and support urban planning, risk assessment, and infrastructure management (Mutreja et al., 24 Mar 2025, Batchu et al., 2024).
- Hydrology and Environmental Monitoring: DSMs support volumetric assessments, flood inundation modeling, automated water surface elevation (WSE) extraction in rivers (using CNNs, neuroevolution, or FBEWMA regression), and visibility estimation for air traffic (Szostak et al., 2021, Szostak et al., 2023, Andreu et al., 2015).
- Solar Mapping: Automated roof segmentation and DSM-based pitch/azimuth retrieval underpins gigascale solar potential mapping, as demonstrated for global Solar API deployments. Integrated DSM and affinity mask heads in Swin-B/UNet architectures yield sub-meter building MAE and pitch errors near 5° (Batchu et al., 2024).
- Change Detection and Road Modeling: DSM updating blends prior DTM information, semantic ViT features, and monocular depth for robust, up-to-date surface representation. Diffusion-based methods specifically target ground extraction for road smoothness (Rafaeli et al., 2 Apr 2026, Dhaouadi et al., 13 Nov 2025).
7. Future Directions and Open Challenges
Emerging trends and challenges in DSM research include:
- Self-supervised and Domain-adaptive Completion: Zero-training, test-time adaptive height completion via foundation features expands to arbitrary sensors and domains, mitigating dataset-specific limitations (Rafaeli et al., 2 Apr 2026).
- Robust Automation of DSM→DTM Filtering: Diffusion models and confidence-gated fusion remove dependencies on manually tuned morphological parameters, promising generalized ground surface extraction (Dhaouadi et al., 13 Nov 2025).
- Integrative Multi-modality Fusion: Extending DSM workflows to fuse LiDAR, SAR, and temporal stacks aims to boost accuracy for change detection and 3D reconstruction pipelines (Mutreja et al., 24 Mar 2025).
- Computational Scalability: O(N) complexity algorithms for mosaic registration and patchwise tiling (PrioStitch) maintain tractability for wide-area DSM processing at city and region scale (Xu et al., 2024, Dhaouadi et al., 13 Nov 2025).
- Semantic-aware Structural Completion: Feature-space correspondence via ViT embeddings enables object-centric height propagation, critical for infilling missing or out-of-date urban elements (Rafaeli et al., 2 Apr 2026).
The trajectory of DSM research is characterized by synergistic advances in photogrammetric data acquisition, multi-modal deep learning, semantic alignment, and scalable global registration—directly impacting domains ranging from urban simulation to environmental monitoring and global energy mapping.
References:
- (Mutreja et al., 24 Mar 2025) HiRes-FusedMIM: A High-Resolution RGB-DSM Pre-trained Model for Building-Level Remote Sensing Applications
- (Panangian et al., 26 Jan 2025) Dfilled: Repurposing Edge-Enhancing Diffusion for Guided DSM Void Filling
- (Corley et al., 2023) Single-View Height Estimation with Conditional Diffusion Probabilistic Models
- (Stucker et al., 2021) ResDepth: A Deep Residual Prior For 3D Reconstruction From High-resolution Satellite Images
- (Batchu et al., 2024) Satellite Sunroof: High-res Digital Surface Models and Roof Segmentation for Global Solar Mapping
- (Liebel et al., 2020) A Generalized Multi-Task Learning Approach to Stereo DSM Filtering in Urban Areas
- (Xu et al., 2024) Large-scale DSM registration via motion averaging
- (Rafaeli et al., 2 Apr 2026) Test-Time Adaptation for Height Completion via Self-Supervised ViT Features and Monocular Foundation Models
- (Dhaouadi et al., 13 Nov 2025) GrounDiff: Diffusion-Based Ground Surface Generation from Digital Surface Models
- (Bittner et al., 2019, Bittner et al., 2019, Qin, 2019, Szostak et al., 2021, Szostak et al., 2023, Andreu et al., 2015, Lu et al., 2019, Marà et al., 2021)