PLS-SEM-KM: Joint Latent Modeling & Clustering

Updated 2 April 2026

PLS-SEM-KM is a methodological framework that simultaneously estimates latent variable path models and identifies homogeneous clusters in heterogeneous data.
The approach optimizes measurement, structural, and clustering models jointly, thereby improving path coefficient recovery and cluster recognition over sequential methods.
Empirical studies show enhanced performance with higher ARI and penalized R²* compared to traditional two-step PLS-SEM followed by K-means, in both simulations and real-world applications.

Partial Least Squares Structural Equation Modeling with K-Means Clustering (PLS-SEM-KM) is a methodology for simultaneous estimation of latent variable path models and identification of homogeneous clusters in heterogeneous data. It integrates Partial Least Squares Structural Equation Modeling (PLS-SEM) with a reduced K-means component, yielding cluster assignments that are homogeneous in terms of the latent variable model's structural relationships. PLS-SEM-KM addresses critical limitations of sequential “PLS-SEM → K-means” strategies and improves the accuracy of both path coefficient recovery and cluster recognition in data containing latent subpopulations (Fordellone et al., 2018).

1. Motivation and Background

Traditional SEM and its variance-based counterpart PLS-SEM assume a homogeneous underlying population of units. In practice, many real-world applications involve data with hidden latent segmentations (clusters) that introduce heterogeneity. Ignoring such latent segmentation biases parameter estimation and distorts model fit metrics.

A widely employed workaround—the two-step approach—involves (i) fitting a global PLS-SEM to the entire dataset, extracting latent variable (LV) scores, then (ii) applying K-means clustering to those scores. However, prior work (Sarstedt and Ringle, 2010; Vichi and Kiers, 2001) demonstrates that this sequential/tandem approach is sensitive to manifest variables with high variance that do not inform cluster membership. Factor or PLS extraction optimizes total variance directions, whereas clustering aims to maximize between-cluster variance; this misalignment can obscure true clusters.

PLS-SEM-KM circumvents these limitations by jointly optimizing the measurement, structural, and cluster models, so that the extracted latent variable hyperplane reflects both structural relations across variables and the underlying segmentation of units. Empirical evidence indicates superior clustering (higher Adjusted Rand Index, ARI) and improved path recovery compared with two-step and mixture-SEM techniques.

2. Model Structure

PLS-SEM-KM is defined by two standard PLS-SEM submodels and a K-means component:

A. Structural Model (Inner):

Let $n$ denote the number of observations, $H$ endogenous LVs, $L$ exogenous LVs: $\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},$ where $\mathbf{H}$ ( $n\times H$ ) and $\mathbf{\Xi}$ ( $n\times L$ ) are matrices of endogenous and exogenous LVs, $\mathbf{B}$ and $\mathbf{\Gamma}$ are matrices of path coefficients, and $H$ 0 is a matrix of residuals.

B. Measurement Model (Outer):

$H$ 1

for the reflective mode, where $H$ 2 ( $H$ 3) is the manifest variable matrix, $H$ 4 and $H$ 5 are loading matrices, and $H$ 6 is the measurement error.

3. Mathematical Formulation

The core innovation of PLS-SEM-KM is the simultaneous incorporation of a reduced K-means model:

$H$ 7 ( $H$ 8): binary cluster membership matrix ( $H$ 9 if observation $L$ 0 belongs to cluster $L$ 1)
$L$ 2 ( $L$ 3): cluster centroids
$L$ 4 ( $L$ 5): full orthonormal loading matrix

The estimation problem is summarized as: $L$ 6 with constraints: $L$ 7, $L$ 8, $L$ 9.

The clustering constraint is interpreted as a De Soete–Carroll reduced K-means in the latent space: equivalently, $\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},$ 0. While no explicit scalar “loss + gain” objective is stated, the method can be seen as minimizing within-cluster sums of squares in the latent space, while maximizing the criterion for PLS covariance between LVs.

4. Estimation Algorithm

The algorithm follows a block-coordinate approach reminiscent of Wold’s original PLS, but includes steps for cluster assignment and centroid updating:

Input: standardized data $\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},$ 1, measurement design $\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},$ 2, structural design $\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},$ 3, $\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},$ 4 (number of clusters, typically found using the gap statistic).
Initialization:
- $\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},$ 5 (block-diagonal loading structure)
- Randomly initialize $\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},$ 6
- $\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},$ 7
- $\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},$ 8, tolerance $\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},$ 9, $\mathbf{H}$ 0
Iterate until convergence or maximum iterations:
1. Calculate latent scores: $\mathbf{H}$ 1
2. Compute covariance: $\mathbf{H}$ 2, $\mathbf{H}$ 3
3. Inner weights: $\mathbf{H}$ 4
4. Update scores: $\mathbf{H}$ 5
5. Update loadings:
$\mathbf{H}$ 6

then re-orthonormalize. 6. Update memberships:

$\mathbf{H}$ 7

Update centroids: $\mathbf{H}$ 8
Check stopping rule: if $\mathbf{H}$ 9, then halt; otherwise update $n\times H$ 0 and repeat.

After convergence, estimate path coefficients ( $n\times H$ 1) by OLS regression of each endogenous LV on its parents.

5. Assumptions and Computational Considerations

PLS-SEM-KM requires metric data, preferably standardized. No distributional (e.g., normality) assumptions are imposed. The major computational difficulty is nonconvexity introduced by the binary membership matrix $n\times H$ 2, which can create local minima; the practical solution is to perform multiple random starts (typically 10–20), retaining the solution with the highest penalized $n\times H$ 3 or best cluster-separation criterion.

Computational complexity per iteration is $n\times H$ 4. For common use cases ( $n\times H$ 5, $n\times H$ 6, $n\times H$ 7, $n\times H$ 8), runtime is on the order of seconds.

Selection of $n\times H$ 9 (number of clusters) is typically made using the gap statistic, either on PLS scores or directly on the residual sum of squares in the reduced (latent) space.

6. Empirical Evaluation

Simulation Studies

Fordellone and Vichi (2019) evaluate PLS-SEM-KM on 7,200 datasets covering various path models, sample sizes ( $\mathbf{\Xi}$ 0), cluster balances, noise levels, and $\mathbf{\Xi}$ 1. Performance is assessed by:

Penalized $\mathbf{\Xi}$ 2
Adjusted Rand Index (ARI) between recovered and true cluster assignments

Key results:

PLS-SEM-KM outperforms FIMIX-PLS in nearly all scenarios, often by 10–30 percentage points in $\mathbf{\Xi}$ 3
Sequential PLS $\mathbf{\Xi}$ 4K-means yields ARI $\mathbf{\Xi}$ 5 0.65, while PLS-SEM-KM achieves ARI $\mathbf{\Xi}$ 6 under low noise
The gap statistic reliably identifies the correct $\mathbf{\Xi}$ 7
With moderate noise and 15 random starts, true cluster recovery rate is approximately 90%

Real-World Case Study

Applying PLS-SEM-KM to the European Consumer Satisfaction Index (ECSI) for mobile telephony ( $\mathbf{\Xi}$ 8, 24 manifest, 7 latent constructs):

Data normalized to [0, 100] scale
Gap statistic indicates $\mathbf{\Xi}$ 9
Model fit: average communality $n\times L$ 0 0.59, average structural $n\times L$ 1, GoF $n\times L$ 2, penalized $n\times L$ 3
Clusters: 1 (high image/expectations/satisfaction), 2 (moderate satisfaction), 3 (low satisfaction), corresponding roughly to latent satisfaction strata
Recovered structural path coefficients are consistent with literature, but now cluster-adjusted

7. Significance and Applications

PLS-SEM-KM provides a unified framework for extracting both latent causal structures and segmentations directly informed by model structure. Avoiding tandem PLS-SEM and clustering overcomes traditional pitfalls and empirical evidence demonstrates robust improvements in both clustering and path estimation, especially in settings with substantial heterogeneity. The distribution-free nature, computational tractability, and capability to recover and interpret latent classes with respect to structural relationships underscore its value in marketing science, consumer research, and other domains where observed units may comprise structurally distinct subpopulations (Fordellone et al., 2018).

Markdown Report Issue Upgrade to Chat

References (1)

Structural Equation Modeling and simultaneous clustering through the Partial Least Squares algorithm (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PLS-SEM-KM.