OrganCMNIST: CT Scan Dataset for Optimal Transport

Updated 25 September 2025

OrganCMNIST is a medical imaging dataset of 2,000 CT scans represented as discrete probability measures for optimal transport analysis.
The dataset is used to benchmark efficient Nyström approximation methods that reduce computation by sampling only 10–20% of the distance matrix columns.
Empirical results indicate stable classification accuracy and reliable low-dimensional embeddings, underscoring its significance in scalable medical image analysis.

OrganCMNIST is a medical imaging dataset consisting of organ CT scans, originating from the MedMNIST benchmark. It is specifically referenced as a target for robust manifold learning and classification via Wasserstein distance matrices, with a focus on computational efficiency using partial matrix estimations. The dataset comprises 2,000 images, each representing a discrete probability measure, and is employed in research contexts requiring optimal transport metrics, multidimensional scaling (MDS), and related dimensionality reduction and classification workflows.

1. Dataset Characteristics and Construction

OrganCMNIST consists of 2,000 CT scan images of organs. Each image is treated as a discrete probability measure—suitable for applications using the quadratic Wasserstein metric, defined by

$W_2(\mu, \nu) = \min_{P \in \Gamma(\mu, \nu)} \langle C, P \rangle,$

where $C_{ij} = \|x_i - y_j\|^2$ is the cost matrix and $P$ is a coupling respecting image marginals. The dataset is incorporated as part of the MedMNIST benchmark and is primarily referenced as an evaluation ground for algorithms estimating Wasserstein distance matrices (Rana et al., 23 Sep 2025).

2. Wasserstein Distance Matrix Formulation

Given $N$ data measures from OrganCMNIST, one constructs a squared Wasserstein distance matrix $D$ with entries

$D_{ij} = W_2(\mu_i, \mu_j)^2.$

Computation of the full $D$ is computationally expensive due to the quadratic optimal transport problem. The dataset's structure—each image as a high-dimensional probability measure—exacerbates the cost of pairwise distance computation, motivating research into matrix approximation and sampling methods to allow scalable analysis.

3. Efficient Matrix Estimation: Nyström Approach

The Nyström method is utilized to approximate $D$ from a small subset of its columns, dramatically reducing computational requirements. Specifically, for a symmetric distance matrix $D$ partitioned as

$D = \begin{bmatrix} A & B \ B^T & C \end{bmatrix},$

with $A \in \mathbb{R}^{m \times m}$ for $m \ll n$ , and $B \in \mathbb{R}^{m \times (n-m)}$ , the assumption of (near) low-rank structure allows expressing $B$ in $A$ 's column space. Using the Moore–Penrose pseudoinverse $A^\dagger$ , $C$ can be estimated by

$C \approx B^T A^\dagger B.$

Practically, $c$ random columns are selected (set $I \subset \{1, ..., N\}$ ) and the estimate

$D_{\text{est}} = C_{\text{obs}} \, U^\dagger \, C_{\text{obs}}^T$

is computed, where $C_{\text{obs}} = D(:,I)$ and $U = D(I,I)$ , with further adjustments to enforce symmetry and zero diagonals (Rana et al., 23 Sep 2025).

4. Dimensionality Reduction and Classification Workflow

Once $D_{\text{est}}$ is acquired, multidimensional scaling (MDS) is performed by centering the estimated matrix:

$B = -\frac{1}{2} H D_{\text{est}} H \quad \text{with} \quad H = I - \frac{1}{N}\mathbf{1}\mathbf{1}^T$

Subsequent truncated singular value decomposition yields low-dimensional embeddings. These are supplied as features to classifiers including linear discriminant analysis (LDA), 1-nearest neighbor (KNN), support vector machine (SVM), and random forest.

Empirical results on OrganCMNIST demonstrate that the Nyström method maintains stable classification accuracy when only 10–20% of the matrix columns are computed. For example, at a 10% sampling rate (∼103 columns), the relative error in the Nyström approximation is roughly $2.85 \times 10^{-2}$ , which is significantly lower than that obtained via matrix completion ( $5.65 \times 10^{-2}$ ). Classification accuracy stabilizes after 10–20% of the columns are used, regardless of classifier type, as shown by experimental curves (Rana et al., 23 Sep 2025).

5. Theoretical Analysis: Stability and Guarantees

The stability of low-dimensional embeddings derived from Nyström-approximated Wasserstein matrices is formalized in Theorem 3.1 of (Rana et al., 23 Sep 2025). If $m \gtrsim \mathcal{O}((d+2)\log(d+2))$ columns are selected—where $d$ is the target embedding dimension—the MDS embedding $Z$ is provably close to the true embedding $Y$ , up to orthogonal transformation. Specifically, under bounded noise $E$ , the Procrustes error is bounded by

$\min_{Q \in O(d)} \|Z - QY\|_2 \leq (1+\sqrt{2}) \|Y^\dagger\|_2 \|E\|_2 \left( \frac{5N}{(d+2)(1-\delta)} + 3\sqrt{ \frac{N}{(d+2)(1-\delta)} + 2 } \right)$

This establishes that reliable embeddings—and thus reliable downstream classification—can be obtained from highly incomplete matrix measurements, provided a sufficient number of columns are sampled.

6. Comparative Performance and Computational Implications

Performance comparisons between the Nyström method and matrix completion demonstrate that Nyström achieves lower approximation errors and stable classification accuracy for the OrganCMNIST dataset, with less computational overhead. Specifically, empirical results in Table 1 of (Rana et al., 23 Sep 2025) show that: for a fixed sampling rate, the Nyström method consistently produces lower relative errors in reconstructed Wasserstein matrices than matrix completion.

A plausible implication is that for large-scale medical imaging datasets where computational resources are constrained, the Nyström approach to Wasserstein matrix estimation enables robust and scalable manifold learning and classification procedures that were previously impractical.

7. Significance in Large-Scale Optimal Transport

The deployment of OrganCMNIST within studies examining Wasserstein matrix estimation methods provides a reference use case demonstrating that optimal transport-based analysis is tractable on large image datasets. By enabling the computation of faithful embeddings and stable classification results with only a small fraction of required distance matrix entries, methodologies validated on OrganCMNIST constitute a practical advancement in medical image analysis and general manifold learning—especially under sampling- and computation-limited regimes.

This approach makes advanced tools from optimal transport practical for high-dimensional settings found in medical imaging repositories such as MedMNIST. The robust stability of embeddings and classification performance, even with only $O((d+2)\log(d+2))$ sampled columns, indicates potential for broader application in datasets with similar structure and requirements.

PDF Markdown Chat (Pro)

References (1)

Recovering Wasserstein Distance Matrices from Few Measurements (2025)

Follow Topic

Get notified by email when new papers are published related to OrganCMNIST Dataset.