SECOND Dataset & SECOND-CC Overview
- SECOND Dataset is a benchmark suite for remote sensing change detection, offering pixel- and semantic-level evaluations via aligned bitemporal imagery.
- SECOND-CC extends the dataset with paired semantic maps and multiple human-generated captions to robustly assess natural language descriptions of changes.
- The resource aids urban planning and disaster assessment by simulating real-world challenges like misregistration, illumination shifts, and seasonal variations.
The SECOND Dataset, originally standing for SEmantic Change detectiON Dataset, is a benchmark dataset suite for remote sensing change detection. Derivative works such as the SECOND-CC dataset—designed for Remote Sensing Image Change Captioning (RSICC)—extend its scope to natural language description of bitemporal changes. These resources address the need for robust, real-world evaluation of remote-sensing algorithms under diverse conditions involving illumination, viewpoint, and georegistration challenges (Karaca et al., 17 Jan 2025). The following article focuses primarily on SECOND-CC, the most significant and widely-cited extension, owing to its detailed annotation protocols, semantic map coverage, and natural language captioning utility.
1. Origin, Scope, and Motivation
The SECOND Dataset was created to fill gaps in standardized evaluation for pixel-level and semantic-level change detection in aerial/satellite imagery. SECOND-CC, its change-captioning extension, specifically targets the RSICC task: generating natural language descriptions of changes between bitemporal RGB image pairs. The motivation for SECOND-CC is to benchmark neural and hybrid architectures under real-world conditions—such as spatial misalignment, variable illumination, seasonal changes, blurring, and different ground resolutions—that strongly impact algorithm performance but are underrepresented in synthetic or idealized datasets.
SECOND-CC leverages multisource high-resolution imagery from Hangzhou, Chengdu, and Shanghai, drawn from the original SECOND dataset. It introduces paired semantic segmentation masks, multiple per-pair human captions, and explicit land-cover transition categories to support both change detection and change captioning tasks.
2. Data Composition and Annotation Pipeline
The SECOND-CC dataset comprises 6,041 pairs of co-registered bitemporal RGB images, each 512×512 pixels (further split into 256×256 non-overlapping tiles for reduced complexity and sample size expansion). Each image pair is accompanied by two single-channel semantic masks—one per timepoint—encoding six land-cover classes: low vegetation, non-vegetated ground, tree, water, building, and playground. No-change pixels remain unlabeled.
Human annotation was conducted by seven in-house labelers over approximately one year. Each image pair received five expertly crafted captions, guided by a protocol specifying change type, location, color, shape, and intensity descriptors while avoiding redundancy, trivial artifacts, and excessive vocabulary growth (kept below 2,000 unique tokens). The final corpus includes 30,205 sentences. Each pair is additionally assigned one “most significant change” subcategory label from a set of 30 defined land-cover transitions, plus a no-change category.
Quality control steps included peer review against standardized guidelines, with rounds of revision for outliers (excessively short, long, or off-topic captions). Semantic segmentation maps were registered tile-wise to maximize alignment fidelity with the RGB imagery.
| Data Component | Description / Value | Notes |
|---|---|---|
| RGB Pairs | 6,041 pairs (split to 256×256 tiles) | Mixed 0.5–3 m/pixel; 3 bands (uint8) |
| Semantic Maps | 2 per pair (12,082 total) | Classes: 6 (+ unlabelled for no-change) |
| Human Captions | 5 per pair (30,205 total sentences) | Avg. length ≈ 10.4 words (σ ≈ 4.8) |
| Change Categories | 30 transitions + 1 no-change | Distribution balanced across splits |
| Dataset Splits | Train: 4,219 / Val: 595 / Test: 1,227 | Proportion 7:1:2 |
3. Technical Attributes and Dataset Structure
The dataset structure is hierarchical. Each split (train, validation, test) contains:
/rgb/: before/after images,/sem/: paired semantic maps,captions.json: ID-to-caption mapping (five per pair),categories.csv: subcategory labels.
The pairs are distributed approximately 72% in the “Change” class and 28% in “No-Change.” Change subcategories (e.g., “building→tree,” “non-veg ground→building”) are proportionally represented; the largest subcategory involves non-vegetated ground transitioning to building or vice versa.
Semantic maps use PNG or equivalent indexed single-channel format, in which pixel values in {0,…,5} correspond to land-cover classes and 255 (or 0) to unclassified/no-change. Image and semantic masks are spatially aligned, with misregistration effects intentionally preserved to mirror operational deployment scenarios.
4. Evaluation Protocols and Metrics
SECOND-CC adopts widely accepted natural language evaluation metrics as referenced in COCO and RS captioning benchmarks:
- BLEU-N (N=1…4): n-gram precision with brevity penalty.
- CIDEr-D: consensus-based n-gram similarity with TF-IDF weighting.
- METEOR: alignment-based recall and precision.
- ROUGE_L: longest common subsequence F-score.
- SPICE: semantic proposition overlap.
The key aggregate metric is ,
where, for “No-Change” cases (where reference captions are identical), CIDEr is omitted— averages only the remaining four.
PRIMARY metric computations employ per-tile, per-caption cross-validation, using five references per candidate. All details, including reference length normalization and clipped n-gram precision for BLEU, adhere to COCO/NLG standards.
5. Applications and Benchmarking Context
SECOND-CC is utilized for:
- Urban planning and environmental monitoring: Automated summarization of construction, land-use, and vegetation changes.
- Disaster assessment: Rapid textual reporting of infrastructure damage or redevelopment from multisource remote imagery.
- Algorithmic benchmarking: Robustness testing for models incorporating both visual (RGB) and semantic (mask) features, particularly under spatial misalignment, seasonal/illumination variation, and variable sampling distance.
The dataset has enabled evaluation of models such as RSICCformer, Chg2Cap, and PSNet, establishing state-of-the-art performance by supporting thorough ablation and attention visualization studies (Karaca et al., 17 Jan 2025). MModalCC, the baseline framework, demonstrated +4.6% BLEU4 and +9.6% CIDEr improvement over prior methods.
SECOND-CC’s use of challenging real-world artifacts distinguishes it from synthetic or perfectly-aligned RSICC datasets, offering a more realistic performance ceiling.
6. Limitations and Known Caveats
Key limitations include:
- Misalignment and Noise: Ground-truth registration preserves spatial misalignments (a few pixels), seasonal and illumination artifacts, and resolution mismatches, which may overwhelm models in subtle-change scenarios.
- Partial Semantic Annotation: Only change regions are manually labeled in semantic maps; non-change areas lack class annotation, complicating fully supervised end-to-end learning.
- Geographic Domain Specificity: With source imagery limited to Hangzhou, Chengdu, and Shanghai, SECOND-CC may not generalize to rural, non-Chinese, or non-urban land types without additional fine-tuning or domain adaptation.
- Annotation Dependency: Human captioning, while peer-reviewed, may still reflect subjective saliency or annotator biases within change description granularity.
- Scalability for Semantic Masks: Future extensions may need to incorporate automatically derived masks (with segmentation errors) to test real-world pipeline robustness.
7. Access, Replicability, and Future Directions
SECOND-CC is distributed via GitHub (https://github.com/ChangeCapsInRS/SecondCC) with full image, segmentation, and annotation structure. The data organization supports seamless scripting for supervised, semi-supervised, and multimodal fusion models. The dataset’s design and accompanying protocols are extensible, enabling adaptation to new metropolitan areas or integration with additional semantic classes.
A plausible implication is that widespread usage and further extensions of SECOND-CC could inform standards for RSICC benchmark development, especially in the context of automatic annotation quality assessment and urban-scale change detection under adverse conditions (Karaca et al., 17 Jan 2025). Future work is expected to explore model transferability across diverse remote sensing domains and to systematically evaluate the impact of imperfect semantic segmentation on captioning accuracy.