Joint Feature Weighting & Clustering

Updated 4 February 2026

Joint Feature Weighting and Clustering is an unsupervised paradigm that simultaneously infers feature relevance and assigns observations to clusters, improving interpretability in high-dimensional data.
It embeds feature weighting directly into clustering objectives using methods such as weighted K-means, mean-shift, and matrix factorization with regularization techniques like entropy and sparsity.
Empirical benchmarks indicate these methods enhance performance metrics like accuracy and mutual information, outperforming sequential feature selection and clustering approaches.

Joint feature weighting and clustering refers to a class of unsupervised learning algorithms that simultaneously infer the relevance of input features and assign observations to clusters. In high-dimensional settings where many features are irrelevant or redundant, this paradigm is essential for robust, interpretable, and accurate clustering. Modern joint approaches avoid the pitfalls of separate feature selection and clustering by embedding feature-weight learning directly into the clustering objective or its optimization loop, often leveraging convex or block-coordinate frameworks, regularization terms (entropy, sparsity, group structure), and explicit integration of feature weights into distance, affinity, or factorization steps.

1. Mathematical Frameworks for Joint Feature Weighting and Clustering

Contemporary joint feature weighting and clustering algorithms integrate feature relevance parameters directly into clustering cost functions. Notable archetypes include:

Feature-weighted mean-shift clustering: Adaptively weights features within Gaussian kernel density estimates to suppress irrelevant dimensions, minimizing an entropy-regularized cost

$\frac{1}{n} \sum_{i=1}^n \lVert x_i - y_i \rVert_w^2 + \lambda \sum_{l=1}^p w_l \log w_l,$

where $w$ lies on the simplex $w \ge 0,\ \sum_l w_l=1$ (Chakraborty et al., 2020).

Feature-weighted K-means and its variants: Generalize the K-means objective with either global or cluster-specific weights, e.g.,

$J(S, C, w) = \sum_{k=1}^K \sum_{i \in S_k} \sum_{v=1}^m w_{kv}^{\alpha_v} \cdot d_v(x_{iv}, c_{kv})$

with normalization and sometimes entropy or sparsity constraints on weights (Amorim, 2015).

Matrix factorization approaches: Embed feature-weights into NMF, imposing simplex or pairwise orthogonality constraints for vectorized weighting of features, as in FNMF (Chen et al., 2021), or into deep matrix factorization architectures with dynamically updated exponents to control selection sparsity (Khalafaoui et al., 2024).
Sparse and information-theoretic models: Penalize features via group-lasso or projected $\ell_1/\ell_2$ constraints or select features by their contribution to mutual information or geometric clustering objectives (Ohl et al., 2023, Costa et al., 28 Jan 2026).

This direct coupling is central to ensuring optimized feature weights not only reflect within-cluster homogeneity but also maximize between-cluster separation under the learned clustering assignment.

2. Optimization Paradigms and Update Mechanisms

Most joint weighting and clustering algorithms employ alternating minimization (block coordinate descent):

Weighted mean-shift: Alternates between updating cluster prototypes via feature-weighted kernel averages and updating $w$ via closed-form entropy-regularized within-cluster variance minimization. Both steps yield explicit updates (Chakraborty et al., 2020).
K-means variants: The update loop typically consists of (i) E-step: given current $w$ , assign points to clusters minimizing the weighted distance, (ii) M-step for centroids, and (iii) M-step for weights—usually inverse-variance or entropy-weighted exp-normalized updates, sometimes with cluster-specific weights and exponent parameters (Amorim, 2015).
Matrix factorization (shallow and deep): Utilize block updates for factor matrices, feature weights, and (in multi-view) view/fusion assignments; closed-form solutions are derived for each block, ensuring monotonic decrease of the composite objective. Certain deep methods dynamically adapt feature selection hyperparameters via control-theoretic PI rules that tie the degree of sparsity to reductions in reconstruction loss (Chen et al., 2021, Khalafaoui et al., 2024).
Geometry-aware and information-theoretic methods: Alternate between clustering assignments (by maximizing mutual information, MMD, or Wasserstein distances) and feature weights (by projecting feature relevance statistics onto a simplex or sparsity-constrained set), commonly using proximal gradient or projection algorithms (Ohl et al., 2023, Costa et al., 28 Jan 2026).
Multi-view and fusion-based models: Employ double self-weighted schemes (e.g., DSMC), alternating adaptive feature weighting in each view and global view weighting, fusing via consensus clustering (Fang et al., 2020).

Convergence is monotonic (no increase in objective) for all major approaches. The point of termination is defined either by the stabilization of weight vectors and cluster assignments or by objective reductions below a nominal threshold.

3. Types and Structures of Feature Weights

A variety of granularity and constraints are found:

Weight Structure	Description	Example Algorithms
Global feature weights	One weight per feature for all clusters	Weighted K-means, FNMF
Cluster-specific weights	Each cluster has its own weight vector for features	AWK, Entropy-weighted K-means
Block/group weights	Feature weights tied to predefined groups (blocks)	SYNCLUS, FG-K-Means
Multi-component/ensemble	Multiple weight vectors/components per sample or view	FNMF (multi-component), DMFAW
Deep/learned hierarchical	Feature weights emerge via neural or deep architectures	Deep matrix factorization, Sparse GEMINI, DIB

Normalization is commonly enforced via simplex constraints $\sum_l w_l=1$ , and sparsity is imposed via explicit $\ell_1$ -norm constraints or entropy penalties. Some models encourage orthogonality or diversity among multiple weight vectors.

Sophisticated models operate in the multi-view and multi-modal regime, integrating view-level and feature-level weighting into unified clustering/fusion objectives:

Double Self-weighted Multi-view Clustering (DSMC) assigns adaptive feature weights within each view (matrix $M^{(v)}$ ), and adaptive global view weights $w_v$ , performing consensus clustering via a Procrustes-rotated, feature-filtered graph fusion (Fang et al., 2020).
Deep Matrix Factorization with Adaptive Weights for Multi-View Clustering (DMFAW) utilizes deep NMF, PI-type control of weighting exponents, dynamic feature weighting per view, orthogonal permutations for partition alignment, and late fusion for consensus (Khalafaoui et al., 2024).
Subspace and Local-Structure Regularized Models (JFLMSC): Integrate feature weighting, local graph learning, and robust self-representation, jointly optimizing view-specific weights, self-expressive subspaces, and consensus spectral embeddings (Lina et al., 2020).
Auto-weighted Factorization Approaches (RFA-LCF): Simultaneously factorize robustly preprocessed data, encode local structure, auto-weight learned similarity graphs, and enforce feature sparsity in the projection (Zhang et al., 2019).

Empirical evidence shows that simultaneous feature and view weighting enhances robustness to noise and heterogeneity, consistently outperforming sequential or unweighted baselines in both accuracy and mutual information metrics.

5. Filter-based and Model-agnostic Feature Weighting Techniques

In addition to embedded approaches, a large family of filter methods assess feature relevance prior to clustering and use these weights in standard distance- or kernel-based algorithms:

Variance, PCA loading, F-test, Minkowski norms, and mRMR: Assign weights to features based on unconditional or pseudo-labeled criteria (variance, loadings, ANOVA/F-statistics, mutual information, redundancy).
SHAP-based approaches: Employ a surrogate supervised model (e.g., random forest trained on pseudo-labels derived from initial clustering) and compute global feature importance via mean absolute SHAP values. These weights are used to rescale features for downstream clustering (Galis et al., 12 Mar 2025).

SHAP-based and ensemble filter strategies have been shown to achieve up to $\sim$ 40% improvement in clustering ARI versus unweighted or classical filter-based weighting, especially in scenarios with complex, nonlinear feature interactions (Galis et al., 12 Mar 2025).

6. Empirical Benchmarks and Theoretical Guarantees

Extensive benchmarking across synthetic and real datasets, including high-dimensional gene expression and image data, demonstrates the potency of joint feature weighting and clustering:

WBMS achieves perfect or near-perfect recovery of cluster structures and informative features, with cubic convergence rates and outperformance of mean shift variants, G-means, DP-means, and others (Chakraborty et al., 2020).
FNMF delivers 10–20% NMI/accuracy gains over standard NMF/canonical adaptive neighbor/centroid methods, with rapid convergence and resilience to noise or occlusion in vision datasets (Chen et al., 2021).
Sparse DIB and GEMINI match or surpass sparse K-means and wrapper methods for ARI, particularly when the number of truly informative features is small in large $p$ (Costa et al., 28 Jan 2026, Ohl et al., 2023).
Multi-view models such as JFLMSC and DSMC yield consistent improvement in accuracy/NMI/ARI across image, text, and multi-view databases, outperforming both single-view and naive multi-view baselines (Lina et al., 2020, Fang et al., 2020).

All frameworks demonstrate objective monotonicity, and many offer (local) convergence guarantees to stationary points or minimizers.

7. Open Questions and Future Directions

Several challenges and avenues for further research have been explicitly identified:

Automatic discovery of multi-feature groups and block structure: Rather than user-defined blocks, learn biclustered or subspace patterns within the data (Amorim, 2015).
Cluster-specific, hierarchical, or context-dependent weighting: Move beyond global or cluster-local weights to model more complex feature relevance, e.g., via learned attention, nested structures, or temporal/spatial domains.
Integration with robust statistical estimators and handling of outliers: Extend current frameworks to incorporate M-estimator losses or heavy-tailed models (Amorim, 2015).
Hybrid and semi-supervised approaches for parameter tuning: Leverage small amounts of side information (anchor labels, external metrics) to guide hyperparameter selection and enhance identification of sparsity/exponent controls (Amorim, 2015, Ohl et al., 2023).
Scalability and computational complexity: Address the $O(n^2p)$ scaling in kernel and information-theoretic models through approximate nearest neighbor, mini-batch, or distributed methods (Costa et al., 28 Jan 2026, Ohl et al., 2023).

These directions reflect the need for algorithms capable of simultaneous selection of salient features, interpretable cluster structures, and robust adaptation to diverse unsupervised domains.