Oral Cytology Dataset: Multi-Institutional Benchmark

Updated 22 November 2025

Oral Cytology Dataset is a specialized collection of high-resolution, multi-institutional oral mucosa images with detailed annotations for various diagnostic tasks.
It supports a range of machine learning frameworks, from supervised to weakly supervised MIL pipelines, by incorporating cellular and regional labels.
The dataset enables robust digital pathology research through standardized imaging protocols, multi-level annotations, and rigorous cross-center validations.

Oral cytology datasets are specialized collections of digitized images, annotations, and metadata derived from cytological brush or liquid-based samples of oral mucosa. They are designed as benchmarks for the development of computational pathology methods—particularly in the context of early detection of oral squamous cell carcinoma (OSCC)—and support a spectrum of supervised, weakly supervised, and unsupervised machine learning tasks. These datasets encapsulate whole-slide images (WSIs), patch-level and cell-level annotations, and multi-class diagnostic labels across multiple patient and institutional sources. The recent emergence of large, multicenter, and rigorously annotated oral cytology datasets—such as the Oral Cytology Dataset (Jain et al. (Jain et al., 11 Jun 2025), Mukherjee et al. (Mukherjee et al., 15 Nov 2025))—has catalyzed research in digital cytology, facilitating the development of algorithms robust to domain shift, annotation incompleteness, and pathological heterogeneity.

1. Origins, Institutional Scope, and Imaging Protocols

The Oral Cytology Dataset as introduced by Jain et al. (Jain et al., 11 Jun 2025) and extended in Mukherjee et al. (Mukherjee et al., 15 Nov 2025) is the first large and multi-institutional oral cytology resource. It comprises WSIs sourced from ten collaborating medical centers across India, encompassing a total of 234 patients and 368 glass slides (with a WSI per slide). Centers span both tertiary referral hospitals and regional cancer centers, providing demographic and epidemiological diversity. Slide preparation follows uniform Standard Operating Procedures (SOPs):

Staining protocols: Both Papanicolaou (PAP) and May–Grünwald–Giemsa (MGG) stains are used, with 184 slides per protocol.
Scanning resolution: Slides are digitized using a 3DHISTECH scanner at 40× magnification. Image resolutions are 0.125 µm/pixel (58 WSIs) and 0.24 µm/pixel (310 WSIs), with average WSI size approximately 83,000 × 168,000 pixels.
File format: WSIs are provided in multi-resolution MRXS format (~700 GB total dataset size), supporting conventional digital pathology workflows.

2. Annotation Paradigm and Label Structures

Annotations in the Oral Cytology Dataset exhibit multi-tiered granularity:

Patient-/WSI-level labels: Each patient/slide is assigned a single diagnosis by an expert oral pathologist, selected from:
- Healthy (normal epithelium)
- Benign lesion (non-neoplastic/reactive)
- Oral Potentially Malignant Disorder (OPMD)
- Oral Squamous Cell Carcinoma (OSCC)
- These weak labels, encoded as $y_i \in \{0,1,2,3\}$ , are used for bag-level supervision in deep multiple instance learning (MIL) frameworks (Mukherjee et al., 15 Nov 2025).
Region-level annotations: Polygonal regions of interest (ROIs), typically 2048 × 2048 px, are preselected by pathologists to encompass highly cellular, diagnostically informative areas. For some analyses (e.g., RAA-MIL), only a subset of these manually selected patches is distributed, without exhaustive tiling.
Cell-/nucleus-level annotations: For a substantial subset, segmented nuclear instance masks (QuPath-compatible GeoJSON), bounding boxes, and centroid coordinates are provided. Total annotated nuclei: $N=39,\!246$ with detailed class-wise counts (e.g., Cat I: $n_1=19,888$ , Cat IV: $n_4=7,161$ ). Each nucleus is labeled according to standard cytological categories, supporting instance segmentation and classification benchmarks (Jain et al., 11 Jun 2025).
Quality control: All region and nucleus selections undergo review by two expert pathologists, with disagreements resolved by consensus.

3. Patch Extraction, Data Organization, and Preprocessing

Patch-level representations are integral to enabling deep learning on gigapixel WSIs. The data curation process entails:

Patch size: 2048 × 2048 px raw patches are manually extracted from cellular regions, as opposed to regular grid tiling.
Patch quantity per WSI: Variable, determined by slide-specific cellularity; no explicit formula, but typically several patches per slide (~5 as reported for the cell-level benchmark).
Patch vetting: Only high-cellularity and diagnostically valuable regions are extracted (background and artifact-prone areas are omitted).
Data splits: Stratified splitting at the patient level, with cross-validation and held-out test cohorts to ensure class balance. In the RAA-MIL task (Mukherjee et al., 15 Nov 2025), 162 patients (162 WSIs) are partitioned: 129 for train/val (with 5-fold cross-validation) and 33 patients for a final test set never used for tuning.
Preprocessing: Patches are resized to 224 × 224 px for DINO ViT model ingestion; stain normalization is not applied in the RAA-MIL pipeline but is noted as future work.
Augmentation: Not reported as used in RAA-MIL. In broader cytology literature, standard augmentations (random flips, rotations, color jitter) are common (Jiang et al., 2022).

Table 1 summarizes primary dataset elements:

Element	Description	Reported Value
# Patients	Total: 234 (full), Subset: 162 (RAA-MIL)	(Jain et al., 11 Jun 2025, Mukherjee et al., 15 Nov 2025)
# Slides (WSI)	368 (PAP: 184, MGG: 184)	(Jain et al., 11 Jun 2025)
Patch size (raw)	2048 × 2048 px	(Mukherjee et al., 15 Nov 2025)
Nucleus annotations	39,246 nuclei (instance masks and classes)	(Jain et al., 11 Jun 2025)

4. Diagnostic Classes, Distributions, and Demographic Metadata

The dataset encodes multi-class diagnostic and cytological information:

Categorical labels: Four patient-level categories (Healthy, Benign, OPMD, OSCC) for classification; four nucleus-level categories (I–IV: normal, reactive, dysplasia, malignant) for segmentation/classification.
Class distributions: For cell-level labels, $P(\textrm{Cat I}) ≈ 51\%$ , $P(\textrm{Cat IV}) ≈ 18\%$ . For patient-level diagnostic groups within segmentation subset (162 WSIs): 97 Cat I, 14 Cat II, 26 Cat III, 25 Cat IV (Jain et al., 11 Jun 2025). The RAA-MIL test set shows Benign as 9% (3/33 patients); other test set class counts are not detailed.
Demographics: Age (18–70 yrs, mean 44, median 45) and gender are tracked, along with clinical metadata (lesion site, risk factors). Specific age/sex breakdown for the RAA-MIL subset: not reported.

5. Benchmarking Tasks, Algorithms, and Performance Metrics

The Oral Cytology Dataset supports a hierarchy of computational pathology tasks:

Nucleus instance segmentation: Models such as U-Net, U-Net++, HoverNet, and StarDist are benchmarked using pixel-level Dice, Aggregated Jaccard Index (AJI), Panoptic Quality (PQ), and class-wise F-scores. HoverNet achieves the best panoptic quality ( $PQ = 0.542$ ), but all models show drop-off for rare (III/IV) classes, indicating class imbalance challenges (Jain et al., 11 Jun 2025).
Patch/slide-level diagnostic classification: RAA-MIL introduces the first weakly supervised patient-level benchmark (training bags of patches per WSI, label per-patient). The Region-Affinity Attention MIL (RAA-MIL) achieves average accuracy 72.7%, weighted F1-score 0.69, outperforming baseline MIL architectures on the held-out set (Mukherjee et al., 15 Nov 2025).
Cell classification: Assigning each nucleus to its cytological category (I–IV) using downstream classifiers.
Anomaly and out-of-distribution detection: Cross-center evaluation (geographic domain shift) is supported via explicit partitioning of centers into train/validation/test sets (Jain et al., 11 Jun 2025).

6. Public Access, Code Support, and Processing Pipelines

The dataset and companion resources are released under a research/academic license:

Repository and DOI: Oral Cytology Dataset (Jain et al.) at https://arxiv.org/abs/([2506.09661](/papers/2506.09661)), DOI:10.48550/arXiv.(Jain et al., 11 Jun 2025)
Annotation files: JSON or XML-based region/nucleus metadata, directly compatible with OpenSlide and QuPath.
Pipeline code: Python and Jupyter notebook samples are provided for slide loading, patch extraction, and mask conversion.
Illustrative extraction snippet:

import openslide, json, numpy as np
from PIL import Image
# Load slide, parse patch JSON, extract and resize patches
slide = openslide.OpenSlide("/path/to/slide.svs")
with open("/path/to/slide_patches.json") as f:
    patch_list = json.load(f)
patches = [np.array(
    slide.read_region((info["x"], info["y"]), 0, (info["width"], info["height"]))
    .convert("RGB")
    .resize((224,224), Image.BILINEAR)
) for info in patch_list]
bag = np.stack(patches) # [P,224,224,3]

A plausible implication is that adoption of these standardized pipelines will facilitate reproducible MIL-based oral cytology research across laboratories and model architectures.

7. Position in the Broader Cytology Dataset Landscape

Relative to pre-existing oral cytology datasets, the Oral Cytology Dataset sets a new standard in scale, annotation depth, and multi-institutional variation:

Existing resources: The only other extensively annotated and public dataset is Oral 2021 by Matias et al., with 1,934 images and 4,287 nuclei labeled for segmentation/classification (Jiang et al., 2022). This contrasts with the much larger scale and broader spectrum of tasks supported by the Oral Cytology Dataset.
Public vs. restricted access: Full MRXS images from Jain et al. are public for research; other real-world datasets (e.g., those used by MIDA-group (Koriakina et al., 2022, Acerbis et al., 9 Apr 2025)) may require direct author contact and data-use agreements.
Modalities: Extensions such as the multimodal dataset in Lian et al. (Lian et al., 2024) incorporate autofluorescence imaging, enhancing weakly supervised learning pipelines.
Benchmarking best practices: Standardized train/validation/test partitioning, rigorous annotation review, and cross-center generalization assessment are consistently recommended (Jain et al., 11 Jun 2025, Mukherjee et al., 15 Nov 2025, Jiang et al., 2022).

In summary, the Oral Cytology Dataset and derivatives provide a comprehensive, multiclass, multi-granularity standard for computational oral cytology benchmarking. Their rigorous annotation, standardized curation, and open access are advancing the development and evaluation of robust, generalizable AI methods for early OSCC screening and diagnosis, enabling multi-scale learning from nuclei to slide and population levels.