PLS-SEM-KM: Joint Latent Modeling & Clustering
- PLS-SEM-KM is a methodological framework that simultaneously estimates latent variable path models and identifies homogeneous clusters in heterogeneous data.
- The approach optimizes measurement, structural, and clustering models jointly, thereby improving path coefficient recovery and cluster recognition over sequential methods.
- Empirical studies show enhanced performance with higher ARI and penalized R²* compared to traditional two-step PLS-SEM followed by K-means, in both simulations and real-world applications.
Partial Least Squares Structural Equation Modeling with K-Means Clustering (PLS-SEM-KM) is a methodology for simultaneous estimation of latent variable path models and identification of homogeneous clusters in heterogeneous data. It integrates Partial Least Squares Structural Equation Modeling (PLS-SEM) with a reduced K-means component, yielding cluster assignments that are homogeneous in terms of the latent variable model's structural relationships. PLS-SEM-KM addresses critical limitations of sequential “PLS-SEM → K-means” strategies and improves the accuracy of both path coefficient recovery and cluster recognition in data containing latent subpopulations (Fordellone et al., 2018).
1. Motivation and Background
Traditional SEM and its variance-based counterpart PLS-SEM assume a homogeneous underlying population of units. In practice, many real-world applications involve data with hidden latent segmentations (clusters) that introduce heterogeneity. Ignoring such latent segmentation biases parameter estimation and distorts model fit metrics.
A widely employed workaround—the two-step approach—involves (i) fitting a global PLS-SEM to the entire dataset, extracting latent variable (LV) scores, then (ii) applying K-means clustering to those scores. However, prior work (Sarstedt and Ringle, 2010; Vichi and Kiers, 2001) demonstrates that this sequential/tandem approach is sensitive to manifest variables with high variance that do not inform cluster membership. Factor or PLS extraction optimizes total variance directions, whereas clustering aims to maximize between-cluster variance; this misalignment can obscure true clusters.
PLS-SEM-KM circumvents these limitations by jointly optimizing the measurement, structural, and cluster models, so that the extracted latent variable hyperplane reflects both structural relations across variables and the underlying segmentation of units. Empirical evidence indicates superior clustering (higher Adjusted Rand Index, ARI) and improved path recovery compared with two-step and mixture-SEM techniques.
2. Model Structure
PLS-SEM-KM is defined by two standard PLS-SEM submodels and a K-means component:
A. Structural Model (Inner):
Let denote the number of observations, endogenous LVs, exogenous LVs: where () and () are matrices of endogenous and exogenous LVs, and are matrices of path coefficients, and 0 is a matrix of residuals.
B. Measurement Model (Outer):
1
for the reflective mode, where 2 (3) is the manifest variable matrix, 4 and 5 are loading matrices, and 6 is the measurement error.
3. Mathematical Formulation
The core innovation of PLS-SEM-KM is the simultaneous incorporation of a reduced K-means model:
- 7 (8): binary cluster membership matrix (9 if observation 0 belongs to cluster 1)
- 2 (3): cluster centroids
- 4 (5): full orthonormal loading matrix
The estimation problem is summarized as: 6 with constraints: 7, 8, 9.
The clustering constraint is interpreted as a De Soete–Carroll reduced K-means in the latent space: equivalently, 0. While no explicit scalar “loss + gain” objective is stated, the method can be seen as minimizing within-cluster sums of squares in the latent space, while maximizing the criterion for PLS covariance between LVs.
4. Estimation Algorithm
The algorithm follows a block-coordinate approach reminiscent of Wold’s original PLS, but includes steps for cluster assignment and centroid updating:
- Input: standardized data 1, measurement design 2, structural design 3, 4 (number of clusters, typically found using the gap statistic).
- Initialization:
- 5 (block-diagonal loading structure)
- Randomly initialize 6
- 7
- 8, tolerance 9, 0
- Iterate until convergence or maximum iterations:
- Calculate latent scores: 1
- Compute covariance: 2, 3
- Inner weights: 4
- Update scores: 5
- Update loadings:
6
then re-orthonormalize. 6. Update memberships:
7
- Update centroids: 8
- Check stopping rule: if 9, then halt; otherwise update 0 and repeat.
- After convergence, estimate path coefficients (1) by OLS regression of each endogenous LV on its parents.
5. Assumptions and Computational Considerations
PLS-SEM-KM requires metric data, preferably standardized. No distributional (e.g., normality) assumptions are imposed. The major computational difficulty is nonconvexity introduced by the binary membership matrix 2, which can create local minima; the practical solution is to perform multiple random starts (typically 10–20), retaining the solution with the highest penalized 3 or best cluster-separation criterion.
Computational complexity per iteration is 4. For common use cases (5, 6, 7, 8), runtime is on the order of seconds.
Selection of 9 (number of clusters) is typically made using the gap statistic, either on PLS scores or directly on the residual sum of squares in the reduced (latent) space.
6. Empirical Evaluation
Simulation Studies
Fordellone and Vichi (2019) evaluate PLS-SEM-KM on 7,200 datasets covering various path models, sample sizes (0), cluster balances, noise levels, and 1. Performance is assessed by:
- Penalized 2
- Adjusted Rand Index (ARI) between recovered and true cluster assignments
Key results:
- PLS-SEM-KM outperforms FIMIX-PLS in nearly all scenarios, often by 10–30 percentage points in 3
- Sequential PLS4K-means yields ARI 5 0.65, while PLS-SEM-KM achieves ARI 6 under low noise
- The gap statistic reliably identifies the correct 7
- With moderate noise and 15 random starts, true cluster recovery rate is approximately 90%
Real-World Case Study
Applying PLS-SEM-KM to the European Consumer Satisfaction Index (ECSI) for mobile telephony (8, 24 manifest, 7 latent constructs):
- Data normalized to [0, 100] scale
- Gap statistic indicates 9
- Model fit: average communality 0 0.59, average structural 1, GoF 2, penalized 3
- Clusters: 1 (high image/expectations/satisfaction), 2 (moderate satisfaction), 3 (low satisfaction), corresponding roughly to latent satisfaction strata
- Recovered structural path coefficients are consistent with literature, but now cluster-adjusted
7. Significance and Applications
PLS-SEM-KM provides a unified framework for extracting both latent causal structures and segmentations directly informed by model structure. Avoiding tandem PLS-SEM and clustering overcomes traditional pitfalls and empirical evidence demonstrates robust improvements in both clustering and path estimation, especially in settings with substantial heterogeneity. The distribution-free nature, computational tractability, and capability to recover and interpret latent classes with respect to structural relationships underscore its value in marketing science, consumer research, and other domains where observed units may comprise structurally distinct subpopulations (Fordellone et al., 2018).