- The paper introduces an unsupervised method for cell instance segmentation using object-centric embeddings learned by training a CNN to reflect spatial pixel offsets.
- The unsupervised method achieves competitive results on diverse microscopy datasets, outperforming baseline SEG scores on six out of nine benchmarks, including 0.64 F1 on TissueNet.
- Leveraging unsupervised predictions as pseudo labels reduces manual annotation needs by an order of magnitude for supported supervised training, achieving 0.75 F1 on TissueNet.
The paper presents a method for unsupervised cell instance segmentation in microscopy images through the learning of object-centric embeddings (OCEs). The core idea is to embed small image patches using a convolutional neural network (CNN) so that the differences between their embeddings reflect the corresponding spatial offsets in the image. In other words, the embedding function is trained to satisfy
L=i,j∈P∑(d(i,j)−r(i,j))+λreg∥r(i)∥2,
- where i and j denote pixel positions in a restricted neighborhood P (with ∣i−j∣2≤κ),
- d(i,j)=i−j is the spatial displacement in the image,
- r(i,j)=f(patchi)−f(patchj) represents the difference in their embeddings,
- and λreg is a regularization hyperparameter.
The training objective is constructed based on the theoretical insight that in microscopy images, under the assumptions that (i) objects (cells) have similar appearances, (ii) are randomly distributed, and (iii) local patches are informative about their intra-object positions, the expected offset between patches originating from different objects has zero mean. Thus, when averaging over many pairs, the inter-object offsets vanish and the expected embedding difference approximates the intra-object spatial offset.
Several key technical points are highlighted in the work:
- Self-Supervised Loss and Damping:
To mitigate high variance contributions from patch pairs belonging to different objects, a dampening function is applied. They use a sigmoid-based distance measure
ϕ(δ)=(1+exp(−τ∥δ∥22))−1,
with the temperature parameter τ controlling the damping. This formulation stabilizes training by reducing the impact of large disparities.
- Network Architecture and Field of View (FoV):
The embedding function is implemented as a U-Net–style CNN. For self-supervised learning, the network is configured with a small FoV (16×16 pixels) to ensure that patches do not encapsulate an entire object while still containing sufficient discriminative information. For supervised refinement experiments, a deeper U-Net is used to expand the FoV.
- Instance Segmentation Pipeline:
- Foreground-Background Separation:
- Noise augmentation (via salt and pepper noise) is applied repeatedly to generate multiple noisy instantiations of the raw image. The per-pixel variance over the predicted embeddings is computed, and a bimodal variance distribution allows the use of a threshold (e.g., Otsu’s method) to effectively separate the foreground (cell regions) from the background.
- Clustering:
- Mean-shift clustering, as implemented in scikit-learn, is subsequently applied to the foreground regions’ embeddings to group patches into individual cell instances.
- Empirical Evaluations:
- F1 Score: Measures the harmonic mean between precision and recall based on an intersection-over-union (IoU) threshold (typically 0.5).
- SEG Score: Computes the average IoU for matched ground truth and predicted objects.
- Notable numerical results from the unsupervised setting include:
- On the Simulated dataset, the method achieved an F1 score of 0.83 and SEG of 0.65.
- On the aggregated TissueNet data, performance reached F1 0.64 and SEG 0.52.
- In six out of nine datasets, the proposed approach outperforms baselines such as the pre-trained models from Schmidt et al. and Cellpose when measured by the SEG score, while performing comparably on the remaining datasets.
- Supported Supervised Learning:
Beyond unsupervised segmentation, the paper investigates how the unsupervised predictions can serve as pseudo labels to support supervised training. A hybrid training regimen is presented, where a small fraction (as low as 1%) of manual annotations are combined with pseudo ground truth estimates. This supported supervision achieves an F1 score of 0.75 ± 0.03 on TissueNet, which is significantly better than the purely unsupervised score of 0.64. Results indicate that the annotation burden can be reduced by an order of magnitude without compromising segmentation accuracy.
- Sensitivity to Scale and Limitations:
The authors examine the effect of patch scale on segmentation performance. The selected patch size must be carefully adjusted—it needs to be small enough to exclude whole objects yet large enough to capture reliable spatial features. In practice, the method demonstrates robustness across a range of scale factors for different datasets (e.g., for Immune and Lung tissue types) while also highlighting typical failure modes, such as under-segmentation for outliers (cells with substantially different morphology) and difficulties when cells exhibit non-random spatial arrangements.
In summary, the paper introduces a theoretically motivated, self-supervised approach for learning spatially informative object-centric embeddings. By aligning embedding differences with spatial offsets and utilizing a robust post-processing pipeline (including noise-based variance thresholding and mean-shift clustering), the method yields competitive cell instance segmentation results across multiple imaging modalities. Moreover, the framework is shown to be effective in both fully unsupervised settings and as a means to support downstream supervised training with minimal annotations.