Papers
Topics
Authors
Recent
Search
2000 character limit reached

SocioSeg Dataset: Urban & Social Network Analysis

Updated 20 January 2026
  • SocioSeg is a dual dataset comprising an urban benchmark for socio-semantic segmentation and a large-scale social network dataset for analyzing socio-economic segregation.
  • The urban dataset features a three-tiered hierarchy with 13,000 geo-referenced scenes and rigorous annotation, structured for training, validation, and testing.
  • The social network dataset maps multi-layer relationships among 17M+ residents, using robust metrics to quantify income and categorical assortativity.

SocioSeg refers to two distinct datasets of significant importance in contemporary computational research: (1) the Urban Socio-Semantic Segmentation dataset focused on hierarchical socio-semantic segmentation in overhead urban imagery, and (2) the SocioSeg social-network dataset for large-scale analysis of socio-economic segregation among the Dutch population. Both are characterized by scale, methodological rigor, and unique taxonomies or network structures that enable probing aspects of social structure and urban environments not previously captured at such granularity (Wang et al., 15 Jan 2026, Kazmina et al., 2023).

1. Urban Socio-Semantic Segmentation Dataset (SocioSeg): Overview and Taxonomy

The Urban Socio-Semantic Segmentation dataset named SocioSeg is the first large-scale, hierarchically organized benchmark for the segmentation of socially defined urban entities (e.g., schools, parks, hospitals) from overhead (satellite) imagery (Wang et al., 15 Jan 2026). It comprises approximately 13,000 geo-referenced scenes spanning all provinces and major municipalities in China. Each scene contains:

  • A high-resolution ($0.5$–$1$ m GSD) 512×\times512 RGB satellite image.
  • A co-registered 512×\times512 digital map tile, both sourced from the Amap public API.

The dataset's core innovation is its three-tiered socio-semantic taxonomy:

  • Socio-Name: \sim5,000 classes, each corresponding to a named Area of Interest (AOI), e.g., “Beijing Normal University.”
  • Socio-Class: 90 third-level point-of-interest (POI) categories (e.g., "college," "park," "hospital").
  • Socio-Function: 10 urban function super-categories aggregating the 90 Socio-Classes (e.g., "educational," "recreational," "healthcare").

Each Socio-Name is a leaf node under a Socio-Class, which itself is grouped within a Socio-Function, producing a hierarchy supporting evaluation at varying semantic depths.

2. Data Sources, Co-registration, and Annotation Pipeline

All imagery and map data in SocioSeg are retrieved from Amap’s public web API, ensuring data modality congruence (Web Mercator projection, perfect grid alignment, spatial resolution match). The satellite layer undergoes only normalization and resizing beyond vendor corrections.

Ground truth segmentation masks derive from Amap’s AOI polygon database, processed through a two-stage pipeline:

  1. Rasterization: AOI polygons are rasterized to match the 512×\times512 scene grid.
  2. Manual Verification: Three trained annotators review each scene for semantic and geometric alignment, discarding ambiguous/misaligned samples.

For quality assurance, inter-annotator agreement (Cohen’s kappa) achieves κ=0.854\kappa=0.854 over a random 500-scene subset: κ=pope1pe\kappa = \frac{p_o - p_e}{1-p_e} where pop_o is observed agreement and pep_e the expected chance agreement. This certifies strong annotation consistency.

3. Dataset Statistics and Splits

SocioSeg is partitioned into 7,800/1,300/4,000 scenes for training, validation, and testing (6:1:3 ratio). Each split mirrors the full Socio-Name, Socio-Class, and Socio-Function distributions. At the Socio-Function level, “residential” comprises about 20% of pixels, “industrial” about 5%. Socio-Class distributions are highly skewed: “residential area” (\sim15%), “park” (\sim10%), “embassy” (rare, <<50 scenes).

4. Segmentation Evaluation Metrics

SocioSeg employs mainstream and robust evaluation metrics for segmentation accuracy:

  • Intersection-over-Union (IoU) for each class:

IoU=PGPG\mathrm{IoU} = \frac{|P \cap G|}{|P \cup G|}

where PP is the predicted mask and GG the ground truth.

  • Generalized IoU (gIoU), adjusting for non-overlapping convex hull CC:

gIoU=IoUC(PG)C\mathrm{gIoU} = \mathrm{IoU} - \frac{|C \setminus (P \cup G)|}{|C|}

  • Instance-level F1 Score, combining precision and recall:

F1=2PrecisionRecallPrecision+Recall\mathrm{F1} = \frac{2\,\mathrm{Precision}\,\cdot\,\mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}

with

Precision=PGP,Recall=PGG\mathrm{Precision}=\frac{|P\cap G|}{|P|},\quad \mathrm{Recall}=\frac{|P\cap G|}{|G|}

These metrics support both coarse (function-level) and extremely fine-grained (name-level) segmentation evaluation.

5. Data Access, Format, and Usage

SocioSeg is distributed under a CC BY-NC-SA license. The dataset and codebase are accessible at https://github.com/AMAP-ML/SocioReasoner. Directory structure for each split:

  • /images: 512×\times512 RGB PNG satellite tiles
  • /maps: 512×\times512 RGB PNG digital map tiles
  • /masks: 8-bit indexed PNG semantic masks, with palette entries referencing Socio-Name, Socio-Class, or Socio-Function
  • Metadata JSON per split, containing scene IDs, geographic coordinates, AOI names.

This structure facilitates direct adaptation for training, validation, and reproducibility in segmentation and vision-language reasoning experiments.

6. SocioSeg Social-Network Dataset: Population-Scale Network Construction and Attributes

The SocioSeg social-network dataset was constructed for the population-scale analysis of socio-economic segregation in the Netherlands (Kazmina et al., 2023). It covers all 17,249,802 registered residents as of October 2018, with two aggregation levels:

  • Person–person network: 17,249,802 nodes, 1,325,677,157 edges.
  • Household–household network: 7,666,119 nodes, 914,165,057 edges.

Tie types are encoded in multilayer structure:

  • Family (parent–child, siblings, extended kin)
  • Household co-residence
  • School classmates (by institution, year, type)
  • Workplace colleagues (sampling 100 geographically nearest colleagues for large firms)
  • Next-door neighbors (nearest ten neighboring households by geocoordinates)

Node attributes (derived from “founding” adults) include:

  • Income (binned by decile, D=1...10)
  • Education (ordered categorical or “mixed”)
  • Ethnic group (eight-level nominal)
  • Migrant generation (nominal: native, first-gen, second-gen, mixed)

7. File Structure, Assortativity Measures, and Analytical Workflows

The dataset format:

  • nodes.csv: household_id, income_decile, education_code, ethnic_code, migrant_gen
  • edges_family.csv, edges_school.csv, edges_work.csv, edges_neighbors.csv: household_id_u, household_id_v (undirected, deduplicated)
  • (optional) edges_full_network.csv: household_id_u, household_id_v, layer_id

Income assortativity is the primary segregation metric: Let XijX_{ij} denote the share of ties between income deciles ii and jj, yielding overall income assortativity

ρscalar=i,jijXijμrowμcolσrowσcol\rho_\mathrm{scalar} = \frac{\sum_{i,j} i\cdot j\cdot X_{ij} - \mu_\mathrm{row}\cdot\mu_\mathrm{col}}{\sigma_\mathrm{row} \cdot \sigma_\mathrm{col}}

where μrow,μcol\mu_\mathrm{row}, \mu_\mathrm{col} are row/column means, σrow2,σcol2\sigma_\mathrm{row}^2, \sigma_\mathrm{col}^2 variances.

Discrete assortativity for categorical attributes (e.g., education) uses

ρdiscrete=Tr(X)iaibi1iaibi\rho_\mathrm{discrete} = \frac{\mathrm{Tr}(X) - \sum_{i} a_ib_i}{1 - \sum_{i} a_ib_i}

where ai,bia_i,b_i are row/col marginal sums. For continuous income,

r=e=(u,v)(xuμ)(xvμ)e=(u,v)(xuμ)2r = \frac{\sum_{e=(u,v)} (x_u-\mu)(x_v-\mu)}{\sum_{e=(u,v)} (x_u-\mu)^2}

with μ\mu the mean of incident-endpoint incomes.

Python pseudocode for typical loading, network construction, and assortativity calculation is provided, facilitating reproducibility via standard libraries (pandas, networkx).

8. Privacy, Ethics, Data Access, and Applications

The social-network SocioSeg dataset is pseudonymized: personal identifiers are removed and all linkage is performed within the secure CBS infrastructure, with access conditional on project approval for legal, ethical, and data-protection compliance. The raw tables are available to accredited researchers under contract with Statistics Netherlands (Kazmina et al., 2023). Only aggregate results leave the protected environment.

Key findings emerging from this structure include that income assortativity (ρ=0.217\rho=0.217 in the social network) is more than double that observed when calculated over spatial neighborhoods (ρ=0.103\rho=0.103), providing robust evidence that social-network segregation exceeds spatial segregation. Segregation levels also exhibit strong context-dependence: ρ=0.14\rho=0.14 (family), ρ=0.16\rho=0.16 (school), ρ=0.20\rho=0.20 (work), ρ=0.34\rho=0.34 (neighbors), and are substantially higher in larger municipalities.


SocioSeg thus denotes two landmark datasets—one pioneering fine-grained, hierarchical socio-semantic segmentation in geospatial AI; the other enabling multi-layer, population-scale social-network analysis of socio-economic segregation (Wang et al., 15 Jan 2026, Kazmina et al., 2023). Both establish new standards for the methodological depth and richness of data necessary to probe complex social constructs at scale.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SocioSeg Dataset.