Papers
Topics
Authors
Recent
Search
2000 character limit reached

PLS-SEM-KM: Joint Latent Modeling & Clustering

Updated 2 April 2026
  • PLS-SEM-KM is a methodological framework that simultaneously estimates latent variable path models and identifies homogeneous clusters in heterogeneous data.
  • The approach optimizes measurement, structural, and clustering models jointly, thereby improving path coefficient recovery and cluster recognition over sequential methods.
  • Empirical studies show enhanced performance with higher ARI and penalized R²* compared to traditional two-step PLS-SEM followed by K-means, in both simulations and real-world applications.

Partial Least Squares Structural Equation Modeling with K-Means Clustering (PLS-SEM-KM) is a methodology for simultaneous estimation of latent variable path models and identification of homogeneous clusters in heterogeneous data. It integrates Partial Least Squares Structural Equation Modeling (PLS-SEM) with a reduced K-means component, yielding cluster assignments that are homogeneous in terms of the latent variable model's structural relationships. PLS-SEM-KM addresses critical limitations of sequential “PLS-SEM → K-means” strategies and improves the accuracy of both path coefficient recovery and cluster recognition in data containing latent subpopulations (Fordellone et al., 2018).

1. Motivation and Background

Traditional SEM and its variance-based counterpart PLS-SEM assume a homogeneous underlying population of units. In practice, many real-world applications involve data with hidden latent segmentations (clusters) that introduce heterogeneity. Ignoring such latent segmentation biases parameter estimation and distorts model fit metrics.

A widely employed workaround—the two-step approach—involves (i) fitting a global PLS-SEM to the entire dataset, extracting latent variable (LV) scores, then (ii) applying K-means clustering to those scores. However, prior work (Sarstedt and Ringle, 2010; Vichi and Kiers, 2001) demonstrates that this sequential/tandem approach is sensitive to manifest variables with high variance that do not inform cluster membership. Factor or PLS extraction optimizes total variance directions, whereas clustering aims to maximize between-cluster variance; this misalignment can obscure true clusters.

PLS-SEM-KM circumvents these limitations by jointly optimizing the measurement, structural, and cluster models, so that the extracted latent variable hyperplane reflects both structural relations across variables and the underlying segmentation of units. Empirical evidence indicates superior clustering (higher Adjusted Rand Index, ARI) and improved path recovery compared with two-step and mixture-SEM techniques.

2. Model Structure

PLS-SEM-KM is defined by two standard PLS-SEM submodels and a K-means component:

A. Structural Model (Inner):

Let nn denote the number of observations, HH endogenous LVs, LL exogenous LVs: H  =  HBT  +  ΞΓT  +  Z,\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z}, where H\mathbf{H} (n×Hn\times H) and Ξ\mathbf{\Xi} (n×Ln\times L) are matrices of endogenous and exogenous LVs, B\mathbf{B} and Γ\mathbf{\Gamma} are matrices of path coefficients, and HH0 is a matrix of residuals.

B. Measurement Model (Outer):

HH1

for the reflective mode, where HH2 (HH3) is the manifest variable matrix, HH4 and HH5 are loading matrices, and HH6 is the measurement error.

3. Mathematical Formulation

The core innovation of PLS-SEM-KM is the simultaneous incorporation of a reduced K-means model:

  • HH7 (HH8): binary cluster membership matrix (HH9 if observation LL0 belongs to cluster LL1)
  • LL2 (LL3): cluster centroids
  • LL4 (LL5): full orthonormal loading matrix

The estimation problem is summarized as: LL6 with constraints: LL7, LL8, LL9.

The clustering constraint is interpreted as a De Soete–Carroll reduced K-means in the latent space: equivalently, H  =  HBT  +  ΞΓT  +  Z,\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},0. While no explicit scalar “loss + gain” objective is stated, the method can be seen as minimizing within-cluster sums of squares in the latent space, while maximizing the criterion for PLS covariance between LVs.

4. Estimation Algorithm

The algorithm follows a block-coordinate approach reminiscent of Wold’s original PLS, but includes steps for cluster assignment and centroid updating:

  • Input: standardized data H  =  HBT  +  ΞΓT  +  Z,\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},1, measurement design H  =  HBT  +  ΞΓT  +  Z,\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},2, structural design H  =  HBT  +  ΞΓT  +  Z,\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},3, H  =  HBT  +  ΞΓT  +  Z,\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},4 (number of clusters, typically found using the gap statistic).
  • Initialization:
    • H  =  HBT  +  ΞΓT  +  Z,\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},5 (block-diagonal loading structure)
    • Randomly initialize H  =  HBT  +  ΞΓT  +  Z,\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},6
    • H  =  HBT  +  ΞΓT  +  Z,\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},7
    • H  =  HBT  +  ΞΓT  +  Z,\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},8, tolerance H  =  HBT  +  ΞΓT  +  Z,\mathbf{H}\;=\;\mathbf{H}\,\mathbf{B}^T\;+\;\mathbf{\Xi}\,\mathbf{\Gamma}^T\;+\;\mathbf{Z},9, H\mathbf{H}0
  • Iterate until convergence or maximum iterations:

    1. Calculate latent scores: H\mathbf{H}1
    2. Compute covariance: H\mathbf{H}2, H\mathbf{H}3
    3. Inner weights: H\mathbf{H}4
    4. Update scores: H\mathbf{H}5
    5. Update loadings:

    H\mathbf{H}6

    then re-orthonormalize. 6. Update memberships:

    H\mathbf{H}7

  1. Update centroids: H\mathbf{H}8
  2. Check stopping rule: if H\mathbf{H}9, then halt; otherwise update n×Hn\times H0 and repeat.
  • After convergence, estimate path coefficients (n×Hn\times H1) by OLS regression of each endogenous LV on its parents.

5. Assumptions and Computational Considerations

PLS-SEM-KM requires metric data, preferably standardized. No distributional (e.g., normality) assumptions are imposed. The major computational difficulty is nonconvexity introduced by the binary membership matrix n×Hn\times H2, which can create local minima; the practical solution is to perform multiple random starts (typically 10–20), retaining the solution with the highest penalized n×Hn\times H3 or best cluster-separation criterion.

Computational complexity per iteration is n×Hn\times H4. For common use cases (n×Hn\times H5, n×Hn\times H6, n×Hn\times H7, n×Hn\times H8), runtime is on the order of seconds.

Selection of n×Hn\times H9 (number of clusters) is typically made using the gap statistic, either on PLS scores or directly on the residual sum of squares in the reduced (latent) space.

6. Empirical Evaluation

Simulation Studies

Fordellone and Vichi (2019) evaluate PLS-SEM-KM on 7,200 datasets covering various path models, sample sizes (Ξ\mathbf{\Xi}0), cluster balances, noise levels, and Ξ\mathbf{\Xi}1. Performance is assessed by:

  • Penalized Ξ\mathbf{\Xi}2
  • Adjusted Rand Index (ARI) between recovered and true cluster assignments

Key results:

  • PLS-SEM-KM outperforms FIMIX-PLS in nearly all scenarios, often by 10–30 percentage points in Ξ\mathbf{\Xi}3
  • Sequential PLSΞ\mathbf{\Xi}4K-means yields ARI Ξ\mathbf{\Xi}5 0.65, while PLS-SEM-KM achieves ARI Ξ\mathbf{\Xi}6 under low noise
  • The gap statistic reliably identifies the correct Ξ\mathbf{\Xi}7
  • With moderate noise and 15 random starts, true cluster recovery rate is approximately 90%

Real-World Case Study

Applying PLS-SEM-KM to the European Consumer Satisfaction Index (ECSI) for mobile telephony (Ξ\mathbf{\Xi}8, 24 manifest, 7 latent constructs):

  • Data normalized to [0, 100] scale
  • Gap statistic indicates Ξ\mathbf{\Xi}9
  • Model fit: average communality n×Ln\times L0 0.59, average structural n×Ln\times L1, GoF n×Ln\times L2, penalized n×Ln\times L3
  • Clusters: 1 (high image/expectations/satisfaction), 2 (moderate satisfaction), 3 (low satisfaction), corresponding roughly to latent satisfaction strata
  • Recovered structural path coefficients are consistent with literature, but now cluster-adjusted

7. Significance and Applications

PLS-SEM-KM provides a unified framework for extracting both latent causal structures and segmentations directly informed by model structure. Avoiding tandem PLS-SEM and clustering overcomes traditional pitfalls and empirical evidence demonstrates robust improvements in both clustering and path estimation, especially in settings with substantial heterogeneity. The distribution-free nature, computational tractability, and capability to recover and interpret latent classes with respect to structural relationships underscore its value in marketing science, consumer research, and other domains where observed units may comprise structurally distinct subpopulations (Fordellone et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PLS-SEM-KM.