Multi-Temporal Sentinel-2 Imagery
- Multi-temporal Sentinel-2 imagery is a dense time series of satellite data with 10 m resolution, high revisit frequency, and multi-spectral coverage for precise Earth observation.
- It employs advanced deep learning techniques such as GRU, LSTM, ConvLSTM, and temporal attention to extract spatiotemporal features from seasonal and phenological data.
- Fusion strategies integrating multi-modal inputs and temporal aggregation enhance mapping accuracy and robust change detection across diverse land cover applications.
Multi-temporal Sentinel-2 imagery refers to dense time series of satellite data collected by the ESA Sentinel-2 system, which offers decametric spatial resolution (10 m for core bands), high revisit frequency (global median ~5 days), and broad multi-spectral coverage. Such imagery supports diverse Earth observation tasks including land cover mapping, agricultural potential estimation, field boundary delineation, change detection, urban mapping, and super-resolution reconstruction. Multi-temporal approaches exploit the temporal dynamics—phenology, seasonality, disturbance events—implicit in sequences of planetary surface reflectances, often in combination with derived vegetation indices and cloud-filtering protocols.
1. Data Acquisition and Temporal Structure
Sentinel-2 provides Level-1C (TOA) and Level-2A (BOA) reflectance products with 13 spectral bands: four at 10 m (Blue B2, Green B3, Red B4, NIR B8), six at 20 m (B5, B6, B7, B8A, B11, B12), and three at 60 m (atmospheric). Typical multi-temporal datasets assemble between 10 and >100 dates per site per annum, selected to minimize cloud cover (e.g., filtering to <2–5 % cloudy pixels per scene) and to maximize temporal regularity (e.g., monthly, seasonal medians, or custom phenological benchmarks) (Sakka et al., 13 Jun 2025, Dimitrovski et al., 1 Oct 2024, Zahid et al., 24 Nov 2024, Sultana et al., 12 Dec 2025). Cloud gaps are commonly filled by linear interpolation or gap-filling algorithms (Gbodjo et al., 2019), or alternatively dropped or smoothed via monthly averaging/tabular aggregation (Dimitrovski et al., 1 Oct 2024, Garioud et al., 2023).
Per-date variables include:
- Native reflectance bands (B2, B3, B4, B8), possibly with upsampling of 20/60 m bands to 10 m or 5 m in research settings (Sakka et al., 13 Jun 2025, Tarasiewicz et al., 2023).
- Derived NDVI: ; other indices such as EVI, SAVI, NDWI, IRECI, or texture metrics sometimes accompany (Sultana et al., 12 Dec 2025, Zahid et al., 24 Nov 2024).
- Vegetation index stacks reflecting seasonal dynamics (Zahid et al., 24 Nov 2024, Sultana et al., 12 Dec 2025).
Object-based aggregation is applied in some studies for noise reduction: high-res segmentation yields super-pixels/objects, followed by per-date averaging to form object-level multivariate time series (Gbodjo et al., 2019, Benedetti et al., 2018).
2. Temporal Feature Extraction and Model Formulations
Deep learning is dominant in extracting the spatio-temporal signatures embedded in multi-temporal Sentinel-2 sequences. Principal architectures include:
- Recurrent units: Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM) blocks, often with fully connected input enrichment, are standard for processing per-object or per-pixel time series (Gbodjo et al., 2019, Benedetti et al., 2018, Mazzia et al., 2020). FCGRU (fully-connected enriched GRU) formulations expand the raw input via learned nonlinear projections prior to gating:
(cf. (Gbodjo et al., 2019), Eq. 1)
- Temporal attention mechanisms: Learnable attention weights on hidden states improve selective focus, with both softmax and tanh activations used. In HOb2sRNN, customized tanh-attention (without normalization to sum-to-1) allows up- or down-weighting each time step independently (including negative contributions), critical for handling strongly seasonal or ambiguous phenology (Gbodjo et al., 2019):
where and .
- ConvLSTM or attention across space and time: For spatially-resolved segmentation or change detection, convolutional LSTM layers within encoder-decoder frameworks (e.g., U-Net+ConvLSTM) are used; these maintain spatial context in hidden states while propagating temporal information (Papadomanolaki et al., 2019, Dimitrovski et al., 1 Oct 2024). Recent transformer-based architectures (multi-head self-attention over temporal tokens) target long series (e.g., 45 dates) for domain-adversarial adaptation (Martini et al., 2021) and sequence modeling.
- Self-supervised and multi-modal approaches: Self-supervised pretraining exploits radiometric invariance between overlapping patches and triplet-margin losses to learn transferrable features for change detection (Leenstra et al., 2021). Multi-modal fusion combines Sentinel-2 with other modalities (e.g., Sentinel-1 SAR, aerial VHR, PlanetScope) either by late fusion in latent space (Hafner et al., 2023, Dimitrovski et al., 1 Oct 2024, Garioud et al., 2023), or by reconstructing missing optical features from SAR (Hafner et al., 2023).
3. Fusion Strategies: Temporal, Spectral, Modal, and Spatial
Multi-temporal Sentinel-2 imagery is maximally exploited using advanced fusion schemes:
- Temporal fusion: Recursive pairwise merging (e.g., HighResNet) (Okabayashi et al., 25 Apr 2024), temporal max-pooling across features (Jindgar et al., 25 Sep 2024, Martini et al., 2021), and permutation-invariant mean-pooling (SPInet) (Valsesia et al., 2022).
- Spectral fusion: Simultaneous integration across bands at each GSD, and cross-resolution fusion for super-resolving lower-resolution bands (DeepSent) (Tarasiewicz et al., 2023).
- Modal fusion: Multi-source architectures combine features from Sentinel-2 optical, Sentinel-1 radar, VHR aerial, or PlanetScope, with dedicated feature branches and fusion nodes (Gbodjo et al., 2019, Benedetti et al., 2018, Hafner et al., 2023, Garioud et al., 2023, Dimitrovski et al., 1 Oct 2024). Two-stream fusion with weighted loss/classification heads is recommended for operational land cover mapping (Gbodjo et al., 2019).
- Spatial fusion: Object aggregation, tiling, and super-pixel strategies are applied for spatial noise and speckle reduction, critical in crop mapping and land-cover segmentation (Gbodjo et al., 2019, Benedetti et al., 2018, Sakka et al., 13 Jun 2025).
4. Applications: Land Cover Mapping, Change Detection, Agricultural Analytics, Super-Resolution, and Field Delineation
- Land Cover and Crop Classification: Recurrent convolutional architectures (Pixel R-CNN, FCGRU+attention) learn phenological signatures to classify >15 crop/vegetation classes with overall accuracy up to 96.5 % and Cohen’s (Mazzia et al., 2020, Gbodjo et al., 2019, Benedetti et al., 2018). Object-based aggregation and multi-source fusion further improve results.
- Functional Field Boundary Extraction: Multi-date NDVI stacks facilitate boundary delineation, encoding crop growth and senescence for improved IoU by 5–8 pp compared to single-date input (Zahid et al., 24 Nov 2024). Transfer learning indicates scale/geography sensitivity; multi-region training increases generalizability.
- Change Detection: Multi-temporal image pairs enable shallow CNN-based self-supervised pretraining on unlabeled stacks, supporting unsupervised and supervised change vector analysis (Leenstra et al., 2021, Papadomanolaki et al., 2019). ConvLSTM-augmented networks outperform bi-temporal-only approaches, with F1 gains up to +1.5 pp (Papadomanolaki et al., 2019).
- Agricultural Potential Mapping: Monthly Sentinel-2 cubes are used for pixel-wise ordinal regression on viticulture, market gardening, and field crops (Sakka et al., 13 Jun 2025). Multi-label and spatio-temporal (3D-CNN, ConvLSTM) tasks are supported; baseline UNet accuracy is enhanced using ordinal targets.
- Super-Resolution: Multi-temporal fusion recovers fine spatial structure at 2.5–3.3 m GSD by merging temporal sequences with recursive fusion and prior-informed deep SISR backbones (SEN4X, DeepSent, SPInet) (Retnanto et al., 30 May 2025, Tarasiewicz et al., 2023, Valsesia et al., 2022, Okabayashi et al., 25 Apr 2024). Multi-modal super-resolved segmentation at 2.5 m (SPInet) achieves MCC=0.802–0.862, outperforming standard CNN baselines by +0.119 MCC (Valsesia et al., 2022). Temporal attention and permutation invariance increase robustness to date order and cloud noise.
- Semantic Segmentation with Pre-trained Backbones: Latent space temporal-max fusion yields +5–17 % mIoU improvement over single-image or output-fusion approaches using SWIN, U-Net, or ViT pre-trained architectures (Jindgar et al., 25 Sep 2024, Dimitrovski et al., 1 Oct 2024).
- Invasive Species Monitoring: Multi-seasonal feature engineering offers comparable accuracy to high-resolution aerial, with Sentinel-2 model M76* (OA=68 %, =0.55) slightly outperforming aerial reference (OA=67 %, =0.52). NDVI, EVI, SAVI, NDWI, IRECI, TDVI, NLI, MNLI computed per season and texture metrics form the feature basis (Sultana et al., 12 Dec 2025).
5. Quantitative Findings and Comparative Performance
A sampling of representative quantitative results is presented for quick reference.
| Application | Model/Method | mIoU / OA / F1 / MCC | Dataset / Region | Notable Finding |
|---|---|---|---|---|
| Land cover mapping | HOb2sRNN (S2-only) | F1=78.7–87.6 % | Reunion, Senegal | Multi-source fusion: +1 pp F1 |
| Land cover segmentation | M³Fusion GRU+att + CNN | OA=90.7 % | Reunion | Fusion head: +3 pp OA over RF |
| Crop classification | Pixel R-CNN (LSTM+CNN) | OA=96.5 % | North Italy | +20 pp above RF/SVM/XGBoost |
| Field boundary delineation | UNet (NDVI stack) | IoU=0.74 | Netherlands, Pakistan | NDVI temporal stacking: +5–8 pp IoU |
| Change detection | U-Net+ConvLSTM | OA=96 % / F1=57.78 % | OSCD urban scenes | 5 dates w/convLSTM: +1.5 F1 vs 2date |
| Urban mapping (cloud cover) | U-Net (S2+S1+SAR+reconstruction) | F1=0.423 | SpaceNet-7, 14 sites | Retains S2 features via SAR reconstr |
| Semantic segmentation | FLAIR U-TAE branch | mIoU=39.68 % | France (IGN FLAIR) | Best when fused with aerial VHR |
| Super-resolution segmentation | SPInet (PIUnet+MRF, 2.5 m SR mask) | MCC=0.802 | AI4EO Italy | +0.12 MCC vs DeepLabv3 |
| HR SR for urban mapping | SEN4X (MISR+SISR) | mIoU_macro=51.6 % | Hanoi, Vietnam | +2.7 pp mIoU (SISR), +12.9 pp (MISR) |
| Invasive grass species | S2 RF (multi-season/phenology: M76*) | OA=68 %, =0.55 | Victoria, Australia | Slightly outperforms best aerial |
6. Best Practices, Limitations, and Future Directions
- Best Practices:
- Normalize input reflectances to [0,1], filter cloud-contaminated scenes.
- Aggregate input time series by object/patch or context window (e.g., 128×128).
- Prefer deep temporal architectures (FCGRU+attention, ConvLSTM, temporal transformers) with supplementary attention or hierarchical pretraining for limited-label regimes (Gbodjo et al., 2019, Martini et al., 2021).
- For fusion, latent-space temporal-max, recursive multi-image fusion, and permutation-invariant mean pools are recommended.
- For operational mapping, object-based multi-temporal S2+S1 fusion with attention mechanism is efficient (Gbodjo et al., 2019).
- Multi-temporal NDVI stacking for boundary extraction leverages phenological cues better than raw bands, with reduced compute (Zahid et al., 24 Nov 2024).
- Limitations:
- Sentinel-2 spatial resolution constrains detection of sub-pixel objects (roads, narrow field boundaries); super-resolution or modal fusion partially addresses this.
- Geographic or phenological domain gaps degrade cross-region model transfer; domain-adversarial training alleviates but does not eliminate mismatch (Martini et al., 2021).
- Monthly averaging may undersample rapid events and blur phenology; finer grids are preferable given computational resources.
- Object-based, MLP/SVM baselines approach deep model performance in label-scarce regimes but fail to match multi-modal RNNs.
- Future Directions:
- Longer time series (5–45 dates) for improved modeling of phenological cycles, weighted against compute cost.
- Advanced temporal encoders: deep transformers, attention-unified ConvLSTM/self-attention hybrids.
- Joint diffusion, adversarial, and spectral-angle mapper losses to balance fidelity and perceptual realism in SR (Okabayashi et al., 25 Apr 2024, Retnanto et al., 30 May 2025).
- Incorporation of active learning, semi-supervised labeling, or topological priors for functional field delineation (Zahid et al., 24 Nov 2024).
- Fusion with SAR, VHR, or planetary data for domain-invariant feature reuse and enhanced robustness (Hafner et al., 2023, Garioud et al., 2023, Dimitrovski et al., 1 Oct 2024).
- Expanded use for continuous variables (crop yield, density) and irregular geographic domains (Sakka et al., 13 Jun 2025, Sultana et al., 12 Dec 2025).
Multi-temporal Sentinel-2 imagery forms the backbone of modern remote sensing pipelines, enabling rich statistical, deep learning, and multi-modal fusion approaches for accurate, scalable Earth surface monitoring. Multiple sequential acquisitions offer critical temporal cues for both discrete and continuous mapping tasks, rendering simple single-date/pixel approaches obsolete for most practical applications.