Class-Wise Cluster Assignments

Updated 30 December 2025

Class-wise cluster assignments are the mapping of entities to clusters that preserves distinct class characteristics and statistical properties.
They utilize hard, soft, and hierarchical methods—including spectral clustering, EM, and contrastive learning—to optimize assignment fidelity.
These assignments support applications in astroinformatics, high-dimensional analysis, and weakly supervised tasks, enhancing interpretability and model accuracy.

Class-wise cluster assignments refer to the allocation of observed or latent entities to categories (classes) or clusters in a manner that preserves, represents, or exploits per-class statistical or structural properties. This concept appears across multiple domains—ranging from astroinformatics and unsupervised deep learning to weak supervision, model-based clustering, and information-theoretic coding—whenever the clustering process or its analysis is conditioned on, stratified by, or used to uncover discrete classes. As such, class-wise cluster assignments support nuanced inference on structure, improve interpretability, enable robust evaluation, and facilitate downstream modeling by explicitly handling the mapping of instances to class-like partitions.

1. Formalization and Key Definitions

A class-wise cluster assignment is a surjection from a (possibly labeled) set $X = \{x_i\}$ of entities to a finite index set $\{1,\ldots,K\}$ representing classes or clusters. In typical scenarios, the assignment

$\pi\colon X \to \{1,\ldots,K\}$

partitions $X$ into disjoint subsets $C_k = \pi^{-1}(k)$ , with each cluster or class $k$ potentially associated with additional label semantics or statistical properties.

The assignment can be:

Hard: Each object is assigned exclusively to a single cluster.
Soft: An object $x_i$ has a vector of assignment probabilities $(\pi_i(1), ..., \pi_i(K))$ (e.g., output of softmax or similar probabilistic mapping), as in deep and contrastive learning settings (Shen et al., 2021, Chen et al., 2023).
Hierarchical/conditional: Assignment occurs within or across classes, relevant for hierarchical or class-stratified tasks (Filho et al., 23 Dec 2025).

2. Methodological Approaches

Spectral, Likelihood, and Model-Based Methods

In latent class modeling, clustering typically proceeds via a sequence of initialization and refinement. For binary response matrices $R\in\{0,1\}^{N\times J}$ , spectral clustering (via SVD or eigendecomposition) embeds each subject in $\mathbb{R}^K$ , and a subsequent (likelihood-based) maximization assigns final class labels: $\hat{s}_i = \arg\max_{k=1,\ldots,K} \sum_{j=1}^J [R_{i,j}\log\hat{\theta}_{j,k} + (1-R_{i,j})\log(1-\hat{\theta}_{j,k})]$ where $\hat{\theta}_{j,k}$ are the estimated item parameters for class $k$ (Lyu et al., 8 Jun 2025).

In mixture models for high-dimensional data, the Multinomial Cluster-Weighted Model defines responsibilities (posterior cluster membership probabilities): $\tau_{ik} = \frac{\pi_k\,g_k(x_i;\theta_k)\,f_k(y_i|x_i;\beta_k)}{\sum_{\ell=1}^K \pi_{\ell}\,g_{\ell}(x_i;\theta_{\ell})\,f_{\ell}(y_i|x_i;\beta_{\ell})}$ With the EM algorithm, hard assignments are given by maximizing $\tau_{ik}$ per $i$ . Class-wise assignment rates $C_{jk}$ (proportion of label $j$ instances in cluster $k$ ) offer further stratification (Olobatuyi et al., 2022).

Weakly Supervised and Multiple-Instance Settings

Class-wise cluster recovery under weak supervision is exemplified by unique class count (UCC) methods in multiple-instance learning. If one can perfectly predict, for each bag $\sigma$ , the number of distinct classes it contains, then the true per-instance class assignments can be inferred through agglomeration: $ucc(\sigma) = \left|\{L(x_i) : x_i \in \sigma\}\right|$ A neural $ucc$ -classifier pipeline produces instance embeddings that, upon clustering, approximate the true per-class cluster partition as well as fully supervised models under certain conditions (Oner et al., 2019).

Deep and Contrastive Learning-Based Class-Cluster Assignments

Self-supervised representation learning frameworks such as SwAV and Twin-Contrast Clustering (TCC) encode class-wise cluster structure via prototype-based assignments and contrastive losses (Caron et al., 2020, Shen et al., 2021):

Prototype or anchor-based soft assignment (SwAV): If $z_i$ is a feature vector and $c_k$ a prototype,

$q_{i,k} \propto \exp(z_i^\top c_k/\varepsilon)$

with balancing via optimal transport constraints to ensure equal cluster usage.

Categorical assignment confidence (TCC): Assignment confidence $z_{i,k}$ for each $x_i$ and cluster $k$ via

$q_\theta(k|x_i) = \frac{\exp(\mu_k^\top f_\theta(x_i))}{\sum_{k'=1}^K \exp(\mu_{k'}^\top f_\theta(x_i))}$

Class-wise cluster consistency and assignment regularity are enforced through contrastive objectives at both the instance and cluster levels.

In hierarchical approaches, class-wise K-means on autoencoder bottleneck embeddings produces within-class clusters (pseudo-labels), which are then used for hierarchical classification tasks such as fine-grained categorization (Filho et al., 23 Dec 2025).

Table: Representative methodologies for class-wise cluster assignment

Method	Assignment Mode	Core Objective/Step
Spectral + Likelihood Refinement (Lyu et al., 8 Jun 2025)	Hard	SVD-embed, k-means, then likelihood maximization for labels
MCWM (Olobatuyi et al., 2022)	Soft/Hard	Posterior $\tau_{ik}$ , hard via $\arg\max$ , class-wise rates $C_{jk}$
Weakly supervised UCC (Oner et al., 2019)	Hard (recoverable)	Instance embedding via UCC prediction, then clustering
TCC (Shen et al., 2021)	Soft/Hard	Assignment confidence $\pi_i(k)$ , cluster-level and instance-level contrastive loss
SwAV (Caron et al., 2020)	Soft	Balanced assignments via Sinkhorn, soft assignment matrix $Q$
FGDCC (Filho et al., 23 Dec 2025)	Multi-level Hard	Per-class K-means on AE features, hierarchical assignment in two-level classification

3. Assignment in Astronomical and Scientific Domains

An early illustration of class-wise cluster assignments arises in astronomical studies of young stellar objects (YSOs). Here, per-object spectral indices $\alpha$ (from SEDs) are thresholded to assign each object to a physically motivated class (Class I, Flat, Class II). Spatially, membership fractions for each class within and beyond the core cluster radius quantify mass segregation and evolutionary gradients: $\text{Fraction(Class I, %%%%29%%%%)} = 0.36,\quad \text{Fraction(Class II, %%%%30%%%%)} = 0.25$ This spatial stratification provides insight into cluster formation and dynamical processes (Majaess et al., 2011).

4. Alignment, Consistency, and Theoretical Guarantees

Deep clustering paradigms have formalized cluster assignment alignment using cross-view and cross-instance objectives. In multiview settings, (CVCL) aligns soft assignment distributions across views, pulling together corresponding cluster assignment vectors while pushing apart non-matching assignments: $\ell^{(v_1,v_2)} = -\frac{1}{K} \sum_{k=1}^K \log \frac{ \exp\!\bigl(s(\mathbf{p}_k^{(v_1)},\mathbf{p}_k^{(v_2)})/\tau\bigr) }{ ... }$ Alignment at the cluster level yields higher purity, normalized mutual information (NMI), and balanced partitions (Chen et al., 2023).

In weakly supervised settings, a perfect unique class count classifier enables exact recovery of per-instance assignments, given sufficient bag diversity and class coverage (Oner et al., 2019).

For latent class models, spectral + likelihood refinement (SOLA) achieves minimax-optimal mis-clustering rates under separability and balance constraints, matching theoretical lower bounds in high dimensions (Lyu et al., 8 Jun 2025).

In conformal prediction, class-wise embedding and clustering of label score distributions enables calibration of predictive sets at the cluster level with rigorous (approximate) coverage guarantees (Ding et al., 2023).

5. Evaluation, Information-Theoretic Coding, and Practical Impact

The evaluation of cluster–class alignment leverages cross-tabulation, Adjusted Rand Index, accuracy, and coverage metrics. In MCWM, the class-wise cluster allocation is explicitly quantified by

$C_{jk} = \frac{|\{i : Y_{ij}=1 \wedge \hat{k}_i = k\}|}{|\{i:Y_{ij}=1\}|}$

to measure how well clusters recover or represent ground-truth classes.

From an information-theoretic perspective, the assignment map $\pi$ itself can be the object of compression. Random Cycle Coding (RCC) is an optimal algorithm for losslessly encoding cluster assignments of $N$ objects into $K$ clusters, with net code length

$L(n_1,\dots,n_K) = \log N! - \sum_{i=1}^K \log((n_i-1)!)$

where $n_i$ is the size of cluster $i$ . RCC achieves theoretical minimal rates and high efficiency for vector database indexing and storage (Severo et al., 30 Nov 2024).

6. Applications, Challenges, and Contemporary Directions

Class-wise cluster assignments enable:

Structure recovery and label discovery in survey and high-dimensional data (Lyu et al., 8 Jun 2025, Olobatuyi et al., 2022)
Weakly supervised or bag-level label exploitation (Oner et al., 2019)
Fine-grained visual categorization via intra-class structure modeling (Filho et al., 23 Dec 2025)
Conformal prediction with better class-conditional guarantees when classes are data-sparse (Ding et al., 2023)
Improved unsupervised feature learning and clustering stability in deep architectures (Caron et al., 2020, Shen et al., 2021, Chen et al., 2023)

Challenges remain in balancing cluster uniformity (entropy regularization), minimizing assignment errors under misspecification, dealing with high intra-class variability (which motivates class-wise clustering), and extending efficient compression of fine-grained and multi-level assignment structure for storage or transmission (Severo et al., 30 Nov 2024, Filho et al., 23 Dec 2025). The continual development of hybrid models that integrate domain labels, data-driven clusterings, and structural priors mark current research.

References:

"A Cluster of Class I/f/II YSOs Discovered Near the Cepheid SU Cas" (Majaess et al., 2011)
"Weakly Supervised Clustering by Exploiting Unique Class Count" (Oner et al., 2019)
"You Never Cluster Alone" (Shen et al., 2021)
"Spectral Clustering with Likelihood Refinement is Optimal for Latent Class Recovery" (Lyu et al., 8 Jun 2025)
"Multinomial Cluster-Weighted Models for High-Dimensional Data" (Olobatuyi et al., 2022)
"Unsupervised Learning of Visual Features by Contrasting Cluster Assignments" (Caron et al., 2020)
"Class-Conditional Conformal Prediction with Many Classes" (Ding et al., 2023)
"FGDCC: Fine-Grained Deep Cluster Categorization -- A Framework for Intra-Class Variability Problems in Plant Classification" (Filho et al., 23 Dec 2025)
"Deep Multiview Clustering by Contrasting Cluster Assignments" (Chen et al., 2023)
"Random Cycle Coding: Lossless Compression of Cluster Assignments via Bits-Back Coding" (Severo et al., 30 Nov 2024)