HexaLCSeg: Historical Land-Cover Segmentation
- HexaLCSeg dataset is a high-resolution land-cover segmentation resource derived from declassified KH-9 imagery, providing precise built-up mapping at native 1 m and deliverable 100 m resolutions.
- It employs advanced preprocessing and segmentation methods including georeferencing, radiometric correction, and Random Forest classification to ensure spatial accuracy and consistency.
- The dataset significantly enhances dasymetric population modeling by correcting historical census misallocations, demonstrated by high validation metrics (95% overall accuracy and a Kappa of 0.90).
HexaLCSeg is a high-resolution land-cover segmentation dataset derived from declassified Hexagon KH-9 reconnaissance imagery, designed to enhance the spatial accuracy of historical built-up land mapping and subsequently improve dasymetric modeling of population distributions. Developed as part of efforts to refine 1970s rural and peri-urban population estimates in the Greater Istanbul area, HexaLCSeg provides temporally consistent built-up fractions at 1 m native and 100 m deliverable resolutions, enabling granular reconstruction of settlement patterns often missed by conventional global products (Gerrits et al., 14 Dec 2025).
1. Source Imagery and Preprocessing
HexaLCSeg is constructed from U.S. Hexagon KH-9 reconnaissance satellite film scans (declass 3 series), with target acquisition in spring 1977. The imagery is panchromatic, with a native ground sampling distance (GSD) of approximately 0.6–1.2 m per pixel. Preprocessing includes:
- Georeferencing and Orthorectification: Automated tie-point detection against basemaps (Landsat, OSM), manual ground control points on stable features, and local rubber-sheeting yield residuals under 1–2 pixels. All products are projected to World Mollweide (EPSG:54009) and resampled to a standardized 100 m grid for workflow consistency.
- Radiometric Correction: Frame-wise histogram matching to reference tiles normalizes illumination across mosaics, with linear gain/bias adjustments reducing vignetting and film-scan artefacts.
- Mosaicking and Clipping: Seam-lines are calculated with overlap-weighting, and mosaics are clipped to approximately 710 km² covering Arnavutköy (≈ 620 km²) and Çekmeköy (≈ 90 km²).
2. Semantic Segmentation Methodology
HexaLCSeg employs an object-based image analysis (GeOBIA) workflow implemented in eCognition (Trimble):
- Multi-Resolution Segmentation: With a scale parameter of 20, segment sizes correspond to ~20 m²; shape-to-color weight is 0.1:0.9 (favoring spectral over spatial homogeneity) and compactness is set at 0.5.
- Feature Computation: Segments are characterized by gray-level co-occurrence texture (PanTex), mean and variance, and an NDVI proxy trained using contemporary imagery to delineate sparse vegetation from built-up extents.
- Classification: A Random Forest classifier (100 trees, maximum depth 20, Gini impurity) is trained on roughly 1,000 hand-labeled segments stratified by land cover class. Input features include mean DN, PanTex entropy, segment area, and neighbor-class majority.
- Post-processing: Majority filter smoothing (3 × 3 object window) and small-object removal (merging built-up segments < 50 m² into adjacent classes) improve spatial coherence.
- Classes: The output schema follows the ESA WorldCover taxonomy: built-up, cropland, bare/sparse vegetation, grassland/shrubland, tree cover, and water bodies.
- Output Layer: The built-up layer is a binary mask at 1 m resolution, downsampled to 100 m grids, with each cell value representing the proportion of built-up cover in [0,1].
3. Dataset Specifications
HexaLCSeg’s chief deliverable is the 100 m built-up fraction raster in GeoTIFF format, accompanied by:
- Intermediate Data: ESRI shapefiles or GeoPackages containing vector segments.
- Metadata: ISO19115-style XML with fields for acquisition_date, GSD, classifier_version, segmentation_params, validation_accuracy.
- Layer Schema:
| BAND_1 (GeoTIFF) | Data Mask | |------------------|------------------| | Built-up fraction (0–1 float32) | valid = 1, no-data = 0 |
Temporal coverage is a single epoch (1977). The raster spatial resolution is 1 m native, 100 m for population modeling workflows.
4. Validation and Accuracy Metrics
Validation of the HexaLCSeg built-up mask leverages 500 stratified reference points per AOI, derived from high-resolution modern basemaps and aerial imagery, and 200 hand-digitized 1977 urban footprints. Stratified sampling ensures equitable allocation across six land cover classes. The confusion matrix for built-up vs. non-built yields:
| Reference Built-up | Reference Non-built | |
|---|---|---|
| Pred. Built-up | 440 | 30 |
| Pred. Non-built | 20 | 510 |
Associated metrics are:
- Overall accuracy: 0.95
- User’s accuracy (built-up): 0.936
- Producer’s accuracy (built-up): 0.957
- F1 score (built-up): ≈ 0.946
- Kappa: 0.90
This signifies high correspondence between the segmentation product and reference standards.
5. Integration with Dasymetric Population Modeling
HexaLCSeg forms a core input to the dasymetric redistribution of population in the GHS-POP workflow for 1975–1990:
- Zonal Allocation: For census zone with population , and 100 m cell endowed with built-up fraction , each cell receives:
Cells with receive , ensuring allocation strictly to built-up extents.
- Workflow:
- Harmonize and geocode settlement-level census data.
- Align HexaLCSeg raster to census grid.
- Redistribute (for year ) to cells proportional to .
- Mosaic outputs into 100 m annual population grids.
- Annotate outputs with versioned metadata.
- Evaluate error via comparison with GHSL baseline (pixel-wise/Moran’s I, zonal sMAPE, RMSE).
For the 1975 Arnavutköy example, population bias was +1.3%, sMAPE 0.18, RMSE 48 persons/100 m cell (built-up).
6. Applications and Comparative Significance
The integration of HexaLCSeg into historical population modeling enables correction of misallocations prevalent in legacy global products, particularly over-assignment of population in undeveloped areas and underestimation of fragmented rural settlements. Comparative analyses show that GHSL surfaces misallocate populations, whereas the Hexagon-derived built-up fractions rectify such errors, especially in peri-urban and rural mosaics. Incorporation of LAU-2 census data further refines intra-zonal distributions, enhancing consistency with historical spatial extents. A plausible implication is that similar methods could be applied globally, leveraging the world-wide coverage of declassified reconnaissance missions to address data scarcities in diverse settings (Gerrits et al., 14 Dec 2025).
7. Data Accessibility and Future Directions
HexaLCSeg facilitates reproducible historical land-cover and population mapping for late-20th-century periods lacking high-resolution earth observation or granular census micro-data. By providing metadata-compliant rasters and vector artifacts, it supports integration with other spatial demographic frameworks. While current coverage is limited to select Istanbul metropolitan peripheries for 1977, the methodology’s reliance on globally available KH-9 archives suggests potential scaling to other metropolitan or data-scarce regions, subject to local ground-truth constraints and continued digitization of historical imagery (Gerrits et al., 14 Dec 2025).