SocioSeg Dataset: Urban & Social Network Analysis
- SocioSeg is a dual dataset comprising an urban benchmark for socio-semantic segmentation and a large-scale social network dataset for analyzing socio-economic segregation.
- The urban dataset features a three-tiered hierarchy with 13,000 geo-referenced scenes and rigorous annotation, structured for training, validation, and testing.
- The social network dataset maps multi-layer relationships among 17M+ residents, using robust metrics to quantify income and categorical assortativity.
SocioSeg refers to two distinct datasets of significant importance in contemporary computational research: (1) the Urban Socio-Semantic Segmentation dataset focused on hierarchical socio-semantic segmentation in overhead urban imagery, and (2) the SocioSeg social-network dataset for large-scale analysis of socio-economic segregation among the Dutch population. Both are characterized by scale, methodological rigor, and unique taxonomies or network structures that enable probing aspects of social structure and urban environments not previously captured at such granularity (Wang et al., 15 Jan 2026, Kazmina et al., 2023).
1. Urban Socio-Semantic Segmentation Dataset (SocioSeg): Overview and Taxonomy
The Urban Socio-Semantic Segmentation dataset named SocioSeg is the first large-scale, hierarchically organized benchmark for the segmentation of socially defined urban entities (e.g., schools, parks, hospitals) from overhead (satellite) imagery (Wang et al., 15 Jan 2026). It comprises approximately 13,000 geo-referenced scenes spanning all provinces and major municipalities in China. Each scene contains:
- A high-resolution ($0.5$–$1$ m GSD) 512512 RGB satellite image.
- A co-registered 512512 digital map tile, both sourced from the Amap public API.
The dataset's core innovation is its three-tiered socio-semantic taxonomy:
- Socio-Name: 5,000 classes, each corresponding to a named Area of Interest (AOI), e.g., “Beijing Normal University.”
- Socio-Class: 90 third-level point-of-interest (POI) categories (e.g., "college," "park," "hospital").
- Socio-Function: 10 urban function super-categories aggregating the 90 Socio-Classes (e.g., "educational," "recreational," "healthcare").
Each Socio-Name is a leaf node under a Socio-Class, which itself is grouped within a Socio-Function, producing a hierarchy supporting evaluation at varying semantic depths.
2. Data Sources, Co-registration, and Annotation Pipeline
All imagery and map data in SocioSeg are retrieved from Amap’s public web API, ensuring data modality congruence (Web Mercator projection, perfect grid alignment, spatial resolution match). The satellite layer undergoes only normalization and resizing beyond vendor corrections.
Ground truth segmentation masks derive from Amap’s AOI polygon database, processed through a two-stage pipeline:
- Rasterization: AOI polygons are rasterized to match the 512512 scene grid.
- Manual Verification: Three trained annotators review each scene for semantic and geometric alignment, discarding ambiguous/misaligned samples.
For quality assurance, inter-annotator agreement (Cohen’s kappa) achieves over a random 500-scene subset: where is observed agreement and the expected chance agreement. This certifies strong annotation consistency.
3. Dataset Statistics and Splits
SocioSeg is partitioned into 7,800/1,300/4,000 scenes for training, validation, and testing (6:1:3 ratio). Each split mirrors the full Socio-Name, Socio-Class, and Socio-Function distributions. At the Socio-Function level, “residential” comprises about 20% of pixels, “industrial” about 5%. Socio-Class distributions are highly skewed: “residential area” (15%), “park” (10%), “embassy” (rare, 50 scenes).
4. Segmentation Evaluation Metrics
SocioSeg employs mainstream and robust evaluation metrics for segmentation accuracy:
- Intersection-over-Union (IoU) for each class:
where is the predicted mask and the ground truth.
- Generalized IoU (gIoU), adjusting for non-overlapping convex hull :
- Instance-level F1 Score, combining precision and recall:
with
These metrics support both coarse (function-level) and extremely fine-grained (name-level) segmentation evaluation.
5. Data Access, Format, and Usage
SocioSeg is distributed under a CC BY-NC-SA license. The dataset and codebase are accessible at https://github.com/AMAP-ML/SocioReasoner. Directory structure for each split:
/images: 512512 RGB PNG satellite tiles/maps: 512512 RGB PNG digital map tiles/masks: 8-bit indexed PNG semantic masks, with palette entries referencing Socio-Name, Socio-Class, or Socio-Function- Metadata JSON per split, containing scene IDs, geographic coordinates, AOI names.
This structure facilitates direct adaptation for training, validation, and reproducibility in segmentation and vision-language reasoning experiments.
6. SocioSeg Social-Network Dataset: Population-Scale Network Construction and Attributes
The SocioSeg social-network dataset was constructed for the population-scale analysis of socio-economic segregation in the Netherlands (Kazmina et al., 2023). It covers all 17,249,802 registered residents as of October 2018, with two aggregation levels:
- Person–person network: 17,249,802 nodes, 1,325,677,157 edges.
- Household–household network: 7,666,119 nodes, 914,165,057 edges.
Tie types are encoded in multilayer structure:
- Family (parent–child, siblings, extended kin)
- Household co-residence
- School classmates (by institution, year, type)
- Workplace colleagues (sampling 100 geographically nearest colleagues for large firms)
- Next-door neighbors (nearest ten neighboring households by geocoordinates)
Node attributes (derived from “founding” adults) include:
- Income (binned by decile, D=1...10)
- Education (ordered categorical or “mixed”)
- Ethnic group (eight-level nominal)
- Migrant generation (nominal: native, first-gen, second-gen, mixed)
7. File Structure, Assortativity Measures, and Analytical Workflows
The dataset format:
nodes.csv: household_id, income_decile, education_code, ethnic_code, migrant_genedges_family.csv,edges_school.csv,edges_work.csv,edges_neighbors.csv: household_id_u, household_id_v (undirected, deduplicated)- (optional)
edges_full_network.csv: household_id_u, household_id_v, layer_id
Income assortativity is the primary segregation metric: Let denote the share of ties between income deciles and , yielding overall income assortativity
where are row/column means, variances.
Discrete assortativity for categorical attributes (e.g., education) uses
where are row/col marginal sums. For continuous income,
with the mean of incident-endpoint incomes.
Python pseudocode for typical loading, network construction, and assortativity calculation is provided, facilitating reproducibility via standard libraries (pandas, networkx).
8. Privacy, Ethics, Data Access, and Applications
The social-network SocioSeg dataset is pseudonymized: personal identifiers are removed and all linkage is performed within the secure CBS infrastructure, with access conditional on project approval for legal, ethical, and data-protection compliance. The raw tables are available to accredited researchers under contract with Statistics Netherlands (Kazmina et al., 2023). Only aggregate results leave the protected environment.
Key findings emerging from this structure include that income assortativity ( in the social network) is more than double that observed when calculated over spatial neighborhoods (), providing robust evidence that social-network segregation exceeds spatial segregation. Segregation levels also exhibit strong context-dependence: (family), (school), (work), (neighbors), and are substantially higher in larger municipalities.
SocioSeg thus denotes two landmark datasets—one pioneering fine-grained, hierarchical socio-semantic segmentation in geospatial AI; the other enabling multi-layer, population-scale social-network analysis of socio-economic segregation (Wang et al., 15 Jan 2026, Kazmina et al., 2023). Both establish new standards for the methodological depth and richness of data necessary to probe complex social constructs at scale.