CoMa-20K: Urban & Stellar Benchmark
- CoMa-20K is a dual-domain dataset comprising an urban massing benchmark and an astronomical catalog of compact stellar systems.
- The urban subset provides 20,000 samples with detailed geometric, programmatic, economic, and visual data to support generative architecture models.
- The astronomical catalog lists over 23,000 entries, offering key metrics on globular clusters and ultra-compact dwarfs for studies of spatial distribution and metallicity.
CoMa-20K refers to two distinct, high-value datasets published in recent years under the same appellation but in markedly different research domains: (1) a multimodal architectural massing benchmark for urban context-aware generative modeling (Maslov et al., 13 Jan 2026), and (2) a Hubble Space Telescope-based catalog of compact stellar systems in the Coma galaxy cluster, focusing on globular clusters (GCs) and ultra-compact dwarfs (UCDs) (Pomeroy et al., 2 Jun 2025). Each represents a state-of-the-art resource for data-driven research in its respective field.
1. Architectural Massing: Dataset Structure and Content
The architectural CoMa-20K dataset comprises 20,000 unique development sites, curated to enable automated massing generation in urban planning. Each sample includes detailed geometric data, programmatic and economic attributes, and visual context images.
- Sample breakdown: The dataset consists of 18,000 samples for train/validation and 2,000 held-out test samples.
- Building functions: Dominated by “Residential Apartment” but encompasses “Commercial Accommodation,” “Educational/Research,” “Student Accommodation,” and others.
- Per-site statistics: Sites feature 1–4 buildings (mode 1), with residential units per building showing modes at 1 and 10–50 units, and commercial spaces per building most often numbering <5, but with a long tail up to ~100. Usable area is right-skewed (mode ~ 1,000 m²; maximum tens of thousands m²), the majority of buildings have <10 floors, and typical public capacity hovers around 100 persons.
2. Massing Geometry and Representation Standards
Each building massing is encoded as a sequence of horizontal extrusions, with each extrusion defined by polygonal footprint coordinates in a local metric Cartesian system and elevation bounds ( and ). Original GIS data is reprojected from EPSG:4326 to a local metric CRS.
- Computational formulas: Usable area per building , and average floor height .
- Formats: Geometries are supplied as structured JSON (tokenized for model outputs), OBJ meshes for visualization, and multi-view images in PNG/JPEG formats (exact resolutions unstated).
A plausible implication is that this level of detail permits direct ingestion by vision-LLM pipelines and downstream CAD/BIM workflows for generative design tasks.
3. Programmatic, Economic, and Contextual Data
Building requirements are specified as plain string labels within JSON dictionaries, including fields such as number of dwellings, number of commercial spaces, lists of offices and public spaces (with capacities), floor heights, and usable areas. The dataset references “economical requirements” but does not expose explicit cost/revenue variables or calculation formulas.
Context visual data comprises five distinct image modalities per site: high-detail photorealistic renderings, low-detail schematics, raw mesh visualizations, and two cartographic maps. Environments are synthesized by assembling 3D meshes of all buildings in the target CLUE block and adjacent blocks, with semantic enhancement applied via Qwen-Image-Edit.
4. Benchmarking, Evaluation Protocols, and Empirical Results
The primary research objective associated with CoMa-20K is the conditional generation of 3D massings using Vision-LLMs (VLMs):
- Task setup: Inputs combine textual requirements, site contour coordinates, and high-detail contextual images; outputs are structured JSON describing horizontal extrusions.
- Model adaptations: Fine-tuned models (Qwen3-VL-2B, 4B, 8B via LoRA; 3 epochs; AdamW optimizer, LR=, batch=128) and zero-shot models (Qwen3-VL-235B-Instruct, using explicit architectural constraints in prompts) are evaluated.
- Output assessment: Metrics include Pattern Match (extractable JSON rate), JSON Validity (ratio parseable), ID IoU, Floor Error, Area Error, Site IoU, and Contextual Relevance (VLM-judge binary score).
| Model | Pattern | JSON | ID IoU | FloorErr | AreaErr | Site IoU | Relevance |
|---|---|---|---|---|---|---|---|
| CoMa-2B | 0.84 | 0.63 | 0.46 | 0.76 | 1.30 | 0.01 | 0.15 |
| CoMa-4B | 0.95 | 0.72 | 0.71 | 0.49 | 2.86 | 0.03 | 0.18 |
| CoMa-8B | 0.94 | 0.79 | 0.75 | 0.42 | 1.90 | 0.05 | 0.24 |
| Qwen3-VL-235B | 1.00 | 0.99 | 0.99 | 0.12 | 0.79 | 0.10 | 0.25 |
These results highlight both the complexity of context-sensitive massing generation and the empirical progress with large-scale VLMs (Maslov et al., 13 Jan 2026).
5. Astronomical Catalog: Observational Scope and Data Structure
In astrophysics, CoMa-20K denotes the catalog of 23,351 compact stellar systems (CSSs)—including 22,426 globular clusters (GCs) and 523 ultra-compact dwarf (UCD) candidates—based on 26 HST/ACS pointings in the core ( Mpc radius) of the Coma cluster, centered at Mpc.
- Selection: Initial extraction via DAOPHOT, followed by visual inspection and multi-band (F475W, F814W) filtering to exclude contaminants. UCDs are defined by (F814W mag; $1.3 <$(F475W–F814W)).
- Morphological criteria: UCDs (–$100$ pc) are marginally resolved at this distance; GCs remain largely unresolved.
- Catalog fields: Each object lists equatorial coordinates, F475W and F814W magnitudes (Vega), color index, half-light radius (arcsec, pc), photometric S/N, membership flag, absolute magnitude , and stellar mass estimates via (Salpeter IMF, 10 Gyr, [Z/H]=).
6. Statistical Relations, Spatial Distribution, and Metallicity Trends
Key empirical relations extracted from the catalog include:
- Mass–magnitude (“blue tilt”) relation: , with ; more massive clusters appear increasingly red, supporting self-enrichment models.
- Luminosity function: The bright end exhibits a Gaussian GCLF, , with , mag. UCDs are overrepresented compared to pure GCLF extrapolation by 336 sources.
- Spatial density: UCDs cluster more tightly around giant ellipticals (higher Sérsic index , smaller ) compared to GCs. Bound fractions (within ): 86% for UCDs, 76% for GCs; thus UCDs remain preferentially near parent galaxies, with only ~14% in intracluster space.
- Metallicity: Red (metal-rich) UCDs are 5–10 denser within 10–25 kpc of the large ellipticals than blue (metal-poor) UCDs, which are more dispersed and exhibit lower local densities.
7. Accessibility, Reproducibility, and Significance
- Massing dataset: Based entirely on public City of Melbourne data, though the paper does not specify URL, API, or licensing information for direct download; users must contact authors or consult the project repository upon publication. Usage restrictions are not detailed (Maslov et al., 13 Jan 2026).
- Astrophysical catalog: Machine-readable tables are available via the AAS journal supplement, MAST, and CDS/VizieR repositories (program IDs 10861, 11711, 12918), with example code for Gaussian Mixture Model fitting, luminosity function modeling, and Sérsic profile fitting provided at https://github.com/CoMa-20K (Pomeroy et al., 2 Jun 2025).
Both versions of CoMa-20K set foundational benchmarks in their respective fields—enabling reproducible, multimodal benchmarking for generative architectural models and advancing the census, classification, and theoretical modeling of compact stellar systems in rich cluster environments.