Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 154 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 110 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 450 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

OrganCMNIST: CT Scan Dataset for Optimal Transport

Updated 25 September 2025
  • OrganCMNIST is a medical imaging dataset of 2,000 CT scans represented as discrete probability measures for optimal transport analysis.
  • The dataset is used to benchmark efficient Nyström approximation methods that reduce computation by sampling only 10–20% of the distance matrix columns.
  • Empirical results indicate stable classification accuracy and reliable low-dimensional embeddings, underscoring its significance in scalable medical image analysis.

OrganCMNIST is a medical imaging dataset consisting of organ CT scans, originating from the MedMNIST benchmark. It is specifically referenced as a target for robust manifold learning and classification via Wasserstein distance matrices, with a focus on computational efficiency using partial matrix estimations. The dataset comprises 2,000 images, each representing a discrete probability measure, and is employed in research contexts requiring optimal transport metrics, multidimensional scaling (MDS), and related dimensionality reduction and classification workflows.

1. Dataset Characteristics and Construction

OrganCMNIST consists of 2,000 CT scan images of organs. Each image is treated as a discrete probability measure—suitable for applications using the quadratic Wasserstein metric, defined by

W2(μ,ν)=minPΓ(μ,ν)C,P,W_2(\mu, \nu) = \min_{P \in \Gamma(\mu, \nu)} \langle C, P \rangle,

where Cij=xiyj2C_{ij} = \|x_i - y_j\|^2 is the cost matrix and PP is a coupling respecting image marginals. The dataset is incorporated as part of the MedMNIST benchmark and is primarily referenced as an evaluation ground for algorithms estimating Wasserstein distance matrices (Rana et al., 23 Sep 2025).

2. Wasserstein Distance Matrix Formulation

Given NN data measures from OrganCMNIST, one constructs a squared Wasserstein distance matrix DD with entries

Dij=W2(μi,μj)2.D_{ij} = W_2(\mu_i, \mu_j)^2.

Computation of the full DD is computationally expensive due to the quadratic optimal transport problem. The dataset's structure—each image as a high-dimensional probability measure—exacerbates the cost of pairwise distance computation, motivating research into matrix approximation and sampling methods to allow scalable analysis.

3. Efficient Matrix Estimation: Nyström Approach

The Nyström method is utilized to approximate DD from a small subset of its columns, dramatically reducing computational requirements. Specifically, for a symmetric distance matrix DD partitioned as

D=[AB BTC],D = \begin{bmatrix} A & B \ B^T & C \end{bmatrix},

with ARm×mA \in \mathbb{R}^{m \times m} for mnm \ll n, and BRm×(nm)B \in \mathbb{R}^{m \times (n-m)}, the assumption of (near) low-rank structure allows expressing BB in AA's column space. Using the Moore–Penrose pseudoinverse AA^\dagger, CC can be estimated by

CBTAB.C \approx B^T A^\dagger B.

Practically, cc random columns are selected (set I{1,...,N}I \subset \{1, ..., N\}) and the estimate

Dest=CobsUCobsTD_{\text{est}} = C_{\text{obs}} \, U^\dagger \, C_{\text{obs}}^T

is computed, where Cobs=D(:,I)C_{\text{obs}} = D(:,I) and U=D(I,I)U = D(I,I), with further adjustments to enforce symmetry and zero diagonals (Rana et al., 23 Sep 2025).

4. Dimensionality Reduction and Classification Workflow

Once DestD_{\text{est}} is acquired, multidimensional scaling (MDS) is performed by centering the estimated matrix:

B=12HDestHwithH=I1N11TB = -\frac{1}{2} H D_{\text{est}} H \quad \text{with} \quad H = I - \frac{1}{N}\mathbf{1}\mathbf{1}^T

Subsequent truncated singular value decomposition yields low-dimensional embeddings. These are supplied as features to classifiers including linear discriminant analysis (LDA), 1-nearest neighbor (KNN), support vector machine (SVM), and random forest.

Empirical results on OrganCMNIST demonstrate that the Nyström method maintains stable classification accuracy when only 10–20% of the matrix columns are computed. For example, at a 10% sampling rate (∼103 columns), the relative error in the Nyström approximation is roughly 2.85×1022.85 \times 10^{-2}, which is significantly lower than that obtained via matrix completion (5.65×1025.65 \times 10^{-2}). Classification accuracy stabilizes after 10–20% of the columns are used, regardless of classifier type, as shown by experimental curves (Rana et al., 23 Sep 2025).

5. Theoretical Analysis: Stability and Guarantees

The stability of low-dimensional embeddings derived from Nyström-approximated Wasserstein matrices is formalized in Theorem 3.1 of (Rana et al., 23 Sep 2025). If mO((d+2)log(d+2))m \gtrsim \mathcal{O}((d+2)\log(d+2)) columns are selected—where dd is the target embedding dimension—the MDS embedding ZZ is provably close to the true embedding YY, up to orthogonal transformation. Specifically, under bounded noise EE, the Procrustes error is bounded by

minQO(d)ZQY2(1+2)Y2E2(5N(d+2)(1δ)+3N(d+2)(1δ)+2)\min_{Q \in O(d)} \|Z - QY\|_2 \leq (1+\sqrt{2}) \|Y^\dagger\|_2 \|E\|_2 \left( \frac{5N}{(d+2)(1-\delta)} + 3\sqrt{ \frac{N}{(d+2)(1-\delta)} + 2 } \right)

This establishes that reliable embeddings—and thus reliable downstream classification—can be obtained from highly incomplete matrix measurements, provided a sufficient number of columns are sampled.

6. Comparative Performance and Computational Implications

Performance comparisons between the Nyström method and matrix completion demonstrate that Nyström achieves lower approximation errors and stable classification accuracy for the OrganCMNIST dataset, with less computational overhead. Specifically, empirical results in Table 1 of (Rana et al., 23 Sep 2025) show that: for a fixed sampling rate, the Nyström method consistently produces lower relative errors in reconstructed Wasserstein matrices than matrix completion.

A plausible implication is that for large-scale medical imaging datasets where computational resources are constrained, the Nyström approach to Wasserstein matrix estimation enables robust and scalable manifold learning and classification procedures that were previously impractical.

7. Significance in Large-Scale Optimal Transport

The deployment of OrganCMNIST within studies examining Wasserstein matrix estimation methods provides a reference use case demonstrating that optimal transport-based analysis is tractable on large image datasets. By enabling the computation of faithful embeddings and stable classification results with only a small fraction of required distance matrix entries, methodologies validated on OrganCMNIST constitute a practical advancement in medical image analysis and general manifold learning—especially under sampling- and computation-limited regimes.

This approach makes advanced tools from optimal transport practical for high-dimensional settings found in medical imaging repositories such as MedMNIST. The robust stability of embeddings and classification performance, even with only O((d+2)log(d+2))O((d+2)\log(d+2)) sampled columns, indicates potential for broader application in datasets with similar structure and requirements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to OrganCMNIST Dataset.