LandCoverNet: Global High-Res Land Cover Dataset
- LandCoverNet is a globally benchmarked land cover dataset derived from multispectral Sentinel-2 imagery, offering 10 m resolution through advanced processing and consensus annotation.
- It employs robust methodology with strict cloud filtering, temporal sampling, and alignment with GlobeLand30 to achieve high classification accuracy across 7 classes.
- The dataset supports practical applications like SDG monitoring, precision agriculture, urban mapping, and environmental modeling, and is openly licensed under CC BY 4.0.
LandCoverNet is a globally benchmarked, pixel-wise land cover training dataset derived from multispectral Sentinel-2 satellite imagery at 10 m spatial resolution, designed to facilitate high-resolution land cover classification for monitoring Sustainable Development Goals (SDGs), precision agriculture, urban assessment, and environmental modeling (Alemohammad et al., 2020). By combining advanced pre-processing pipelines, consensus-based human annotation, and robust spatial and temporal sampling strategies, LandCoverNet addresses prior deficits of geographic coverage, class agreement confidence, and open licensing.
1. Motivation and Dataset Scope
LandCoverNet was initiated to fulfill the demand for high-quality, globally representative, and openly licensed land cover (LC) training data, crucial for consistent SDG monitoring across agricultural, ecological, and urban domains (Alemohammad et al., 2020). Previous LC benchmarks such as MODIS or GlobeLand30 have suffered from inadequate spatial resolution (250 m–1 km) or restrictive licenses. LandCoverNet’s v1.0 release covers Africa with 1,980 image chips (256×256 pixels each, totaling approximately 589 million labeled pixels). Planned versions will expand to 9,000 chips (from 300 tiles) globally. The primary sensor is Sentinel-2 MSI L2A, leveraging both 10 m and resampled 20 m spectral bands for annual pixel-wise classification.
2. Data Sources and Preprocessing Pipeline
Sentinel-2 Imagery
- 10 spectral bands employed: B2 (blue), B3 (green), B4 (red), B8 (NIR) at 10 m; B5–B7, B8A, B11, B12 (red-edge, SWIR) at 20 m, resampled to 10 m.
- Temporal sampling: 24 “least-cloudy” scenes per tile per year, including one per calendar month and the 12 lowest-cloud scenes available. All valid pixels are interpolated, forming a 240-dimensional annual feature vector per pixel (Alemohammad et al., 2020).
- Cloud-filtering: Only non-cloudy, non-shadow, and non-saturated pixels are included; pixels with cloud probability ≥10% are discarded (Nachmany et al., 2018).
- Initial experiments used four 100×100 km² tiles in Europe for development, transitioning to broad global coverage contingent on pipeline scalability (Nachmany et al., 2018).
GlobeLand30 Labels and Alignment
- Reference source: GlobeLand30 2010 dataset at 30 m resolution, reported accuracy >80%.
- Preprocessing: Nearest-neighbor regridding to 10 m grid followed by strict agreement filtering. Only pixels where GlobeLand30 class exactly matches the corresponding scene-level Sentinel-2 20 m classification are retained for training; no partial-thresholding allowed (Nachmany et al., 2018).
- Agreement criterion:
excluding any pixel with disagreement between contemporaneous and historical labels.
3. Class Taxonomy and Annotation Procedure
LandCoverNet defines seven Level-3 classes (grouped into Level-1/2 hierarchies) through annual NDVI-type time-series inspection:
- Bare: Snow/Ice, Water, Bare ground–Artificial (urban, quarries), Bare ground–Natural (soil, deserts)
- Vegetation: Woody (trees, shrubs), Non-woody–Cultivated (crops), Non-woody–(Semi)Natural (grasslands) (Alemohammad et al., 2020)
Class assignments use a vegetation cover threshold:
- “Bare”: annual max vegetation cover ≤10 %
- “Vegetation”: annual max ≥10 %
For annotation, a model-assist workflow is used:
- Guess-labels via tile-wise Balanced Random Forest, mapping GlobeLand30 (2010) classes onto LandCoverNet taxonomy and discarding pixels with Sentinel-2 SCL/classification mismatch.
- 75% train / 25% test split among retained pixels.
- Human annotators (n=3 per task) label chips using full temporal RGB/NIR/NDVI/NDWI/SWIR views with pan/zoom; skill calibrated against 540 expert-labeled control tasks.
- Consensus label for each pixel is computed via Bayesian averaging:
where is annotator accuracy, and (Alemohammad et al., 2020). The pixel is assigned to the class of highest .
4. Dataset Structure, Access, and Quality Assessment
Structure and Formats
- Data hierarchy: Tile (100×100 km), Chip (256×256 px), Task (32×32 px).
- Each chip: GeoTIFF stack of 24-scene Sentinel-2 bands, per-pixel label raster (7 classes), and consensus score (0–1).
- Metadata: tile/chip IDs, bounds, year, scene dates, consensus histogram.
Licensing and Distribution
- CC BY 4.0 license.
- Available through Radiant MLHub (www.mlhub.earth) and www.landcover.net via REST API or Python client (Alemohammad et al., 2020).
Quality Assessment
- Annotator competence calibrated on 540 gold-standard expert tasks.
- Africa v1.0 consensus distribution:
- ≃60% pixels with score = 1.0 (perfect consensus)
- ≃34% in (0.6, 1.0)
- ≃6% < 0.6
- This suggests strong inter-annotator reliability for primary classes; pixels of low consensus may be down-weighted or reviewed (Alemohammad et al., 2020).
5. Quantitative Evaluation and Experimental Outcomes
Scene-level Random Forest classifiers (SciKit-Learn RF, trees) on the initial European tiles yielded:
- Mean overall accuracy (OA): 88.75% across four tiles; all scenes >80% (Nachmany et al., 2018)
- High OA (>95%): water bodies, permanent snow/ice
- Moderate OA (85–90%): woody, cultivated vegetation—benefiting from red-edge bands
- Lower OA (80–85%): wetlands, semi-natural vegetation, reflecting nuanced spectral character
- Producer’s accuracy (PA) and User’s accuracy (UA) are computed as:
Confusions were most prominent between artificial bare ground and seasonally unclassified/mixed ground truth pixels, attributed to off-season agricultural signatures.
6. Applications, Limitations, and Future Directions
Applications
LandCoverNet supports:
- Global/regional LC modeling for SDG indicators
- Precision agriculture (crop v. natural vegetation differentiation)
- Urban expansion and impervious surface mapping
- Water-body monitoring and change detection
- Benchmarking time-series algorithms (e.g., TCNs, LSTMs, TempCNNs) (Alemohammad et al., 2020)
Limitations
- Class taxonomy is restricted to 7 major classes; lacks crop/forest subclasses and excludes snow/ice in Africa v1.0.
- Human labeling error is present; pixels with low consensus require prudence.
- Despite stratified sampling by MODIS-IGBP, rare LC-geography combinations may be underrepresented.
- The published dataset does not impose fixed splits; users should utilize consensus scores for weighting and sample selection.
Future Directions
Planned expansions include global coverage, annual label aggregation via temporal consensus rules
with cloud-free observations, and extensive crowdsourced validation through MLHub.Earth to support crop-specific and thematic labeling (Nachmany et al., 2018).
7. Comparison with Prior Datasets
LandCoverNet represents an advance over GlobeLand30 and MODIS by:
- Providing 10 m pixel-wise labels derived from annual time-series of Sentinel-2, rather than single-epoch or low-resolution products
- Open licensing, enabling unrestricted use
- Consensus-based annotation protocol that yields a quantifiable agreement score for each pixel, allowing granularity in sample selection A plausible implication is that LandCoverNet will serve as a foundational resource for benchmarking and improving global LC modeling, as well as facilitating downstream tasks requiring spatially and temporally explicit ground truth (Alemohammad et al., 2020, Nachmany et al., 2018).