Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Source Benchmark for Organelle Segmentation

Updated 25 January 2026
  • The paper introduces a benchmark dataset from diverse EM sources with over 100,000 images for segmented multi-organelle evaluation.
  • It employs a novel 3D Label Propagation Algorithm combined with expert refinement to achieve near-perfect instance segmentation.
  • Experiments show that state-of-the-art models (e.g., Mask2Former) exhibit performance drops due to local-context limitations in heterogeneous EM data.

Accurate instance-level segmentation of cellular organelles in electron microscopy (EM) is a foundational requirement for quantitative analysis of subcellular architecture and inter-organelle interactions. Existing benchmarks, built on small and curated datasets, inadequately represent the broad heterogeneity and spatial context present in real-world EM data. To address these limitations, Lu et al. introduce a large-scale, multi-source benchmark for multi-organelle instance segmentation in the wild, comprising over 100,000 2D EM images extracted from a diverse array of cell types and imaging modalities, and annotated for five key organelle classes using a connectivity-aware Label Propagation Algorithm (3D LPA) with expert refinement (Lu et al., 18 Jan 2026). This benchmark reveals significant generalization gaps and failure modes in state-of-the-art segmentation models, highlighting the need for advances in architectures and annotation protocols to bridge the gap between local-context algorithms and the demands of large-scale heterogeneous EM volumes.

1. Dataset Architecture and Multi-Source Sampling

The benchmark dataset integrates EM slices from several distinct sources to maximize morphotype and imaging heterogeneity:

  • Sources: The primary data structures are from OpenOrganelle (whole-cell FIB-SEM volumes), BetaSeg (3D FIB-SEM reconstructions of mouse β-cells), and private in-house FIB-SEM/TEM volumes (stitched and registered).
  • Cell types: Dataset samples span multiple cell lines (neurons, β-cells, cultured mammalian cells), representing a broad spectrum of morphological and functional diversity.
  • Imaging modalities: The majority of data is acquired via FIB-SEM, with a smaller subset from TEM-tilt series, ensuring cross-modality representation.
  • Volume and resolution: The dataset includes over 100,000 unique 2D EM images with native resolutions spanning \sim4–16 nm/px (in-plane) and z-spacing of 4–15 nm. These volumes vary in anisotropy, and all are resampled to 8.0 nm/px in-plane for training, mitigating scale-induced variation.

2. Organelle Class Definitions and Morphological Variability

Five organelle classes are annotated at the instance level, capturing considerable spatial and morphological diversity:

  1. Mitochondria (“Mito”): Rod- and spheroid-shaped structures, with wide volumetric ranges.
  2. Nucleus: Large contiguous volumes characterized by smooth or invaginated envelopes.
  3. Endoplasmic Reticulum (“ER”): Sprawling, interconnected networks of tubules and sheets, often distributed globally across fields of view.
  4. Endosome (“Endo”): Vesicular compartments, variable in electron density and size.
  5. Golgi apparatus (“Golgi”): Stacked cisternal membranes with considerable morphological diversity, generally compact yet variable.

The benchmark systematically samples organelle instances at different volumetric scales: small (<<5k voxels), medium (5k–10k voxels), and large (>>10k voxels).

3. Annotation Protocols: Connectivity-Aware 3D Label Propagation and Expert Curation

Three major annotation workflows are employed to generate high-quality instance labels:

  • BetaSeg: Semantic labels are processed by the 3D Label Propagation Algorithm (3D LPA) to generate instance proposals, which are further subjected to expert proofreading.
  • OpenOrganelle: Semantic plus instance labels are provided, refined directly by expert annotators.
  • Private volumes: Initial coarse labels are derived from pretrained models, followed by expert iterative proofreading.

Connectivity-Aware Label Propagation Algorithm (3D LPA)

Let VZ3V \subset \mathbb{Z}^3 be the voxel grid and S:VZ0+S:V \rightarrow \mathbb{Z}_0^+ the input semantic volume (S(p)=0S(p) = 0 denotes background). The objective is to compute an integer instance label volume L:VNL:V \rightarrow \mathbb{N}.

  • Initialization (t=0t=0): Every foreground voxel is assigned a unique identifier (UID):

L(0)(p)={UID(p)if S(p)>0 0otherwiseL^{(0)}(p) = \begin{cases} \mathrm{UID}(p) \quad \text{if } S(p) > 0 \ 0 \quad \quad \quad \text{otherwise} \end{cases}

  • Neighborhood propagation: A(p)={qVq adjacent to p}\mathcal{A}(p) = \lbrace q \in V | q \text{ adjacent to } p \rbrace with 26-connectivity in 3D.
  • Iterative update:

L(t+1)(p)=min ⁣({L(t)(p)}{L(t)(q)  qA(p)})L^{(t+1)}(p) = \min\!\left(\left\{L^{(t)}(p)\right\} \cup \left\{L^{(t)}(q)\ |\ q \in \mathcal{A}(p)\right\}\right)

Background voxels remain at zero.

  • Convergence: The process iterates until L(T)=L(T1)L^{(T)} = L^{(T-1)}, at which point each connected foreground component adopts the minimal UID, effectively segmenting disjoint instances.

There is no explicit energy function; the model performs least-label graph propagation.

Annotation Accuracy and Expert Review

Annotators consisted of one senior expert and six trainees. For BetaSeg, initial 3D LPA proposals attained approximately 73% instance-wise correctness prior to proofreading. The remaining 27% (incorrect splits/merges) were resolved through iterative consensus, guaranteeing near-perfect consistency by design.

4. Benchmarked Segmentation Models and Training Schemes

Quantitative and qualitative evaluation was conducted across several representative model architectures:

U-Net (Bottom-Up Pipeline)

  • Network: 2D U-Net with skip connections, outputting pixel-wise class logits with optional boundary channel.
  • Decoding: Instance masks inferred by watershed segmentation on affinity maps, followed by 3D connected components.
  • Training: Pixel-wise cross-entropy (CE) loss; auxiliary boundary BCE when enabled. 512×512 input crops. Adam optimizer (lr=0.005, batch size=64) over 9,000 iterations using four A100 GPUs, with scale normalization at 8 nm/px.

SAM Variants (Prompt-able Vision Transformer Models)

  • Base encoder: Vision Transformer; mask decoder prompted by box/point proposals.
  • Variants:
    • SAM(a): Vanilla, fully automatic (no prompts).
    • SAM(p): Prompt-guided with synthesized centroid and boundary points.
    • micro-SAM(a/p): Lightweight, microscopy-trained SAM in analogous configurations.
  • Evaluation: Zero-shot; no fine-tuning on benchmark data.

Mask2Former (End-to-End Transformer)

  • Architecture: Multi-scale deformable attention; decoder outputs per-query mask and class logits; Hungarian assignment to ground truth.
  • Losses: Standard instance segmentation losses (class CE, mask dice/BCE).
  • Training: Identical to U-Net pipeline.

5. Quantitative Performance and Failure Modality Analysis

Segmentation model effectiveness is evaluated by per-class Dice coefficient and Intersection-over-Union (IoU). AP/mAP are not reported.

Homogeneous Subset Performance

Model Mito Dice Nuc Dice ER Dice Endo Dice Golgi Dice Average Dice
U-Net 0.692 0.879 0.248 0.216 0.337 0.605
Mask2Former 0.829 0.941 0.273 0.537 0.408 0.672

In-the-Wild Benchmark Results

Model Mito Dice Nuc Dice ER Dice Endo Dice Golgi Dice Average Dice
U-Net 0.650 0.840 0.180 0.210 0.300 0.436
Mask2Former 0.825 0.940 0.255 0.530 0.390 0.588

This suggests that Mask2Former outperforms U-Net across most classes, but both models exhibit substantial performance degradation in globally distributed organelles such as ER and Golgi, especially in heterogeneous datasets.

Observed Generalization Gaps

  • Transitioning from homogeneous to in-the-wild data induces a Dice drop of \sim10% for Mask2Former (0.672 \rightarrow 0.588).
  • ER segmentation is systematically the worst-performing across models, reflecting its inherently global, interconnected morphology which is not adequately captured by local-context architectures.
  • Golgi performance is also suboptimal but less so than ER.

6. Architectural Limitations and Future Research Directions

Patch-based and local-context models, standard in current segmentation frameworks, exhibit fundamental limitations:

  • Generalization gap: Despite scale alignment, model performance decreases sharply when faced with cross-resolution and modality heterogeneity.
  • Local-global mismatch: 512×512 input crops artificially fragment globally continuous organelles, precluding recovery of true instance topology. Lu et al. note, “Models that only see 512×512 fragments are physically incapable of resolving the global instance topology of the ER.”

Recommendations for Future Work

Lu et al. advocate for the following directions:

  • Development of multi-scale and global-context architectures (e.g., full-slice attention, graph-based connectivity modules).
  • Utilization of cross-slice or 3D context to maintain long-range morphological continuity.
  • New partitioning strategies that avoid arbitrary cropping, such as overlapping windows with connectivity stitching.
  • Dynamic receptive fields tailored to organelle morphology.

A plausible implication is that fundamental advances in architectural and annotation strategies are necessary to bridge the mismatch between large-scale, heterogeneous EM volumes and the constraints of current patch-based segmentation frameworks (Lu et al., 18 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Source Benchmark for Multi-Organelle Instance Segmentation.