Radio Galaxy Zoo Overview
- Radio Galaxy Zoo is a large-scale citizen science and machine learning initiative for classifying extragalactic radio sources with complex morphologies using surveys like FIRST and ATLAS.
- The project integrates human visual assessments with automated algorithms to accurately group radio components and identify host galaxies, enhancing studies of AGN feedback and cosmic structure.
- RGZ catalogs facilitate rare object discoveries, train machine learning pipelines, and set a foundation for scaling classification methods for next-generation surveys like ASKAP/EMU and SKA.
Radio Galaxy Zoo
Radio Galaxy Zoo (RGZ) is a large-scale citizen science and machine learning initiative dedicated to the morphological classification and host association of extragalactic radio sources, particularly those exhibiting complex, extended structures such as jets, lobes, and tails. Operating primarily on data from the FIRST and ATLAS surveys, and recently expanded to cover new-generation data from ASKAP/EMU and MeerKAT, RGZ delivers reliable, labeled catalogs crucial for AGN and galaxy evolution studies, rare object searches, and the training of automated algorithms destined for surveys of the scale anticipated with the SKA and its pathfinders.
1. Scientific Motivation and Project Design
The primary scientific aim of RGZ is to address the challenge of associating spatially extended and often multi-component radio emission with the correct host galaxy—a task that routinely defeats automated cross-matching algorithms due to the morphological complexity and angular separation of radio components in wide-area surveys. Visual inspection by expert astronomers is infeasible at the ∼108 source levels projected for surveys like EMU and SKA. RGZ applies the distributed classification power of citizen scientists to produce high-fidelity component groupings and host identifications for tens to hundreds of thousands of complex sources, providing reference datasets for machine-learning pipelines and enabling systematic studies of radio AGN feedback, galaxy environment, and cosmic structure (Wong et al., 2024, Willett, 2016, Banfield et al., 2015, Tang et al., 19 Jun 2025, Vardoulaki et al., 24 Sep 2025).
Key project features include:
- Cross-identification of all spatially resolved FIRST (1.4 GHz, 5") and ATLAS (1.4 GHz, 12–17") sources with infrared catalogs (WISE 3.4 μm, SWIRE 3.6 μm).
- Emphasis on extended/bent/jetted morphologies (FR I/II, WAT/NAT, hybrids, giants) excluded by positional algorithms.
- Multifaceted role: training set generation, direct science cataloging, and rare-object discovery.
2. Data Sources, Preprocessing, and Citizen-Science Workflow
Primary Surveys
- FIRST: 9,000+ deg², ∼175,000 resolved sources (cutouts: 3′×3′ at 1.37″/px).
- ATLAS: 6.3 deg² focus fields, deep sensitivity, higher-res IR via SWIRE.
Preprocessing Steps
- Detection of radio components via thresholding and median-absolute-deviation filtering.
- Overlays: radio contours (at ≥3σ or ≥4σ) on infrared grayscale or heatmaps.
- Consensus interface: Volunteers group radio components by physical source and locate host on IR image, with up to 20 independent annotations per complex subject.
- Host associations via kernel density estimation (KDE) peak of IR clicks.
Consensus and Reliability
- Weighted consensus metric (incorporating per-user agreement with expert “gold samples”) ensures reliability: average R ≈ 0.83 overall, rising to ≈0.91 for hosts W1<17 mag (Wong et al., 2024).
- Gold-verified high consensus (CL ≥0.65) entries adopted; lower CL sources excluded.
- Expert/volunteer agreement exceeds 85% for high-consensus cases; rarity and complexity reduces this for extreme sources.
| Survey | N_sources | Area (deg²) | Resolution (″) | IR counterpart | Main Cutout Size |
|---|---|---|---|---|---|
| FIRST | ~99,000 | ~9,000 | 5 | WISE | 3′ × 3′ |
| ATLAS | ~600 | 6.3 | 12–17 | SWIRE | 2′ × 2′ |
| EMU (ASKAP) | >10⁷ | ~30,000 | 10–18 | WISE/DES | 6′ × 6′ |
3. Morphological Classification, Host Demographics, and Rare Object Discovery
RGZ classifies sources by component and peak count (N_comp c N_peak p) as well as by traditional and semantic morphological tags derived both from citizen input and automated metrics (Wong et al., 2024, Bowles et al., 2023):
- Single, double, triple, multi-lobed, bent (WAT/NAT), hybrid (HyMoRS), giant radio galaxies (GRG), X-shaped, hourglass, diffuse, compact.
- Semantic taxonomy: plain-English tags extracted via NLP (e.g. “bent,” “bridge,” “tail”), enabling flexible multi-label classification without fixed class boundaries (Bowles et al., 2023).
Host population analysis leverages IR color–color diagrams (e.g., W1–W2 vs. W2–W3) to separate ellipticals, QSOs, LIRGs, and star-forming hosts (Banfield et al., 2015). RGZ identifies:
- Majority hosts: ellipticals/LIRGs, with ∼15% in QSO locus, ∼20% LIRGs/starbursts at high W2–W3.
- A redder subpopulation consistent with dust-rich merger-driven AGN fueling.
- Host types correlate with radio power: FR I in LIRGs/stars, FR II in QSOs.
Rare object discoveries include:
- Multiple new GRG candidates (>1 Mpc), including a 4.6 Mpc system (third largest known as of 2015).
- 25 new candidate HyMoRS spanning L_1.4 GHz ∼10²³–10²⁶ W Hz⁻¹ sr⁻¹ at 0.14<z<1.0 (Kapinska et al., 2017).
- Spiral-host doubles, “green bean” galaxy hybrids, and rare cluster-aligned systems.
4. Environmental Impact and Statistical Trends
RGZ datasets underpin research on radio AGN environmental dependence, jet–ICM interaction, and AGN duty cycles (Garon et al., 2019, Rodman et al., 2018, Banfield et al., 2016):
- 87% of spatially extended RGZ sources are cluster-associated, more centrally concentrated than general galaxy samples (Σ_radio ∝ r–1.10±0.03).
- Bending angle (Δθ) of radio jets/tails increases toward cluster centers and in higher mass clusters, tracking ICM ram pressure (P_icm).
- Radial orientation and tail asymmetry diagnostics provide evidence for radial orbits of bent sources within clusters, with inward tails suppressed by higher ICM densities.
- Wide-angle tail (WAT) and NAT systems discovered in RGZ are effective signposts for both rich and poor clusters, including previously undetected low-mass groups (Banfield et al., 2016).
- Asymmetry in lobe lengths/luminosities correlates quantitatively with environmental galaxy density, validating classical FR II dynamical models D ∝ ρ_amb–0.29±0.07 (Rodman et al., 2018).
5. Machine Learning Pipelines and Algorithmic Integration
RGZ labels directly enable scalable, state-of-the-art machine learning for source association, morphological classification, and cross-identification (Lukic et al., 2018, Alger et al., 2018, Galvin et al., 2019, Chen et al., 2023, Tang et al., 19 Jun 2025, Vardoulaki et al., 24 Sep 2025):
- CNN architectures (3–4 conv layers) achieve 93.5–97.4% accuracy on compact vs. extended or multi-class tasks for RGZ morphologies; data augmentation is essential for performance (Lukic et al., 2018).
- Cross-identification of radio sources with IR hosts via ML approaches (logistic regression, random forest, simple CNN) matches or exceeds nearest-neighbor baseline, ~93–96% for resolved sources, and is robust to crowdsourced vs. expert training labels (Alger et al., 2018).
- Dimensionality-reduction pipelines (e.g., rotation-invariant self-organising maps) condense two-channel (radio+IR) images to ∼200-dimensional fingerprints, supporting >85% transfer-accuracy for crowdsourced labels and allowing continuous, uncertainty-aware morphological scaling (Galvin et al., 2019).
- Multi-modal transformer pipelines jointly ingest images (Vision Transformer) and text (BERT) from RadioTalk, yielding up to 34% F1 improvement for rare morphologies and discovering >10,000 RGZ sources missed in the main DR1 catalog (Chen et al., 2023).
- New frameworks for EMU (ASKAP) integrate citizen scientist pattern recognition, Kolmogorov-complexity anomaly detection, natural language processing of semantic tags, and iterative active learning, producing catalog- and training-grade reliability (>90% agreement with experts for N_voters ≥15) (Tang et al., 19 Jun 2025, Vardoulaki et al., 24 Sep 2025).
| Model | Classification Task | Accuracy (%) | Notable Features |
|---|---|---|---|
| CNN (3 conv) | Compact vs. Extended (RGZ DR1) | 97.4 | Image augmentation, no σ‐clip |
| CNN (4-class) | Compact, 1-, 2-, ≥3-component | 93.5 | Some confusion on multi-lobed |
| Random Forest | Host galaxy cross-ID (ATLAS-SWIRE) | 93–96 | Comparable for RGZ & expert labels |
| Rot-inv SOM | # Peaks/Components regression | 85.7 (max) | Uncertainty-aware, rotation invariant |
6. Taxonomy, Semantic Tagging, and Catalog Outputs
Recognizing the limitations of rigid hierarchical morphological classification, RGZ EMU and related projects have transitioned to flat or graph-based systems of multi-label tags derived from NLP and citizen-science annotation (Bowles et al., 2023):
- 22 semantic tags (e.g., bent, core, jet, bridge, amorphous, hourglass, tail) now span both traditional and complex morphologies.
- Tagging pipeline: plain-English free-form annotation → vector embedding (spaCy/GloVe) → clustering → multi-label assignment.
- Tagging enhances rare-morphology recovery (e.g., ∼82% for star-forming galaxies) and supports arbitrary Boolean queries for scientific or ML use.
- Catalogs released in machine- and human-readable formats (CSV, VO-table), with host, morphology, semantic tags, cross-match flags, consensus weights, and ML confidences (Wong et al., 2024, Tang et al., 19 Jun 2025).
7. Future Developments and Scientific Impact
RGZ and its derivative projects are central to the success of next-generation radio sky surveys, shaping both catalog completeness and the training of fully automated pipelines (Tang et al., 19 Jun 2025):
- Extended to southern hemisphere surveys (ASKAP/EMU) via RGZ EMU, targeting 4 million extended sources and integrating machine learning, citizen science, and NLP at scale.
- Prototypes incorporate deep-learning architectures (CNNs, UNet/transformers), multi-modal fusion, and active learning for rare-object prioritization.
- Catalog outputs are cross-matched with polarization (POSSUM), H I (WALLABY), and deep optical/IR (Euclid, LSST) for direct environmental, AGN feedback, and cosmological studies.
- Synthetic data from guided diffusion models supplement training sets, address contrast/bias for rare morphologies, and enable restoration/inpainting for incomplete observations (Potevineau et al., 12 Jan 2026).
- Ongoing catalog releases (e.g., RGZ DR2, EMUCAT) expand coverage, reliability, and access, setting standards for open, reproducible, and scale-flexible astronomical databases.
Through the combination of citizen science and machine learning, RGZ has established the viability of distributed, accurate morphological classification and host identification in radio astronomy, underpinning studies of AGN feedback, cosmic magnetism, and cluster/filamentary structure, and serving as a model for scientific catalog generation in the era of Petabyte-scale astronomical surveys.