Papers
Topics
Authors
Recent
Search
2000 character limit reached

Simultaneous Clustering & PLS-SEM

Updated 2 April 2026
  • The paper demonstrates that simultaneous clustering and PLS-SEM yield superior subgroup recovery by jointly optimizing latent variable estimation and clustering assignment.
  • The methodology integrates reflective measurement models with structural path coefficients and K-means clustering in a single optimization, achieving higher ARI and penalized R²* benchmarks.
  • This unified framework ensures that the latent space supports both measurement validity and causal network structure, offering actionable insights for segmenting complex data.

Simultaneous clustering and Partial Least Squares Structural Equation Modeling (PLS-SEM) refers to a unified statistical framework for unsupervised segmentation and causal modeling, designed for contexts in which true population heterogeneity is not fully captured by the classical assumption that all units are homogeneous with respect to underlying structural relationships. The simultaneous approach—realized in the Partial Least Squares K-Means (PLS-SEM-KM) algorithm—integrates non-hierarchical clustering and reflective PLS-SEM estimation within a single optimization, in contrast to tandem or sequential procedures that separately estimate SEM and then cluster latent scores. This synthesis enables group discovery that is both statistically and substantively aligned with the causal network of latent variables, supporting more robust insights in research fields where between-group heterogeneity in structural equations is critical (Fordellone et al., 2018).

1. Motivation and Conceptual Foundations

Traditional SEM and PLS-SEM assume unit-level homogeneity. Segmentation strategies in the literature typically estimate models for pre-defined groups, or apply clustering algorithms—such as K-means—to latent scores after SEM estimation. However, these sequential (“tandem”) approaches exhibit two central weaknesses: (1) PLS-SEM identifies latent directions maximizing variance in manifest variables, potentially ignoring clustering structure if spurious directions dominate variance; (2) clustering is agnostic to causal relationships, assigning units to clusters purely on proximity in latent space without ensuring homogeneity in SEM pathways (Fordellone et al., 2018).

The simultaneous framework addresses these issues by orienting the latent space both to fit the measurement and structural model and to support cluster recovery. Cluster centroids exist in the PLS-SEM-constrained latent space, ensuring segmentation respects the architecture of the causal network. As a result, both measurement and group structure are recovered more faithfully, especially when manifest variables contain irrelevant noise dimensions or group differences are directly related to structural equation features.

2. Mathematical Formulation

The simultaneous clustering and PLS-SEM model structurally integrates three components: the SEM inner model, the measurement (reflective) outer model, and K-means clustering over the latent space. Let XRn×JX \in \mathbb{R}^{n \times J} be the matrix of standardized manifest variables for nn units and JJ manifest variables. The P=H+LP = H + L latent variables consist of HH exogenous and LL endogenous constructs, with associated block loadings Λ=[ΛH,ΛL]\Lambda = [\Lambda_H, \Lambda_L]. Matrices Γ\Gamma and BB capture the path coefficients among and between exogenous and endogenous latent variables.

Structural Model (Inner Model)

H=HBT+ΞΓT+ZH = H B^T + \Xi \Gamma^T + Z

where nn0 contains endogenous latent scores and nn1 exogenous latent scores.

Measurement Model (Reflective Outer Model)

nn2

with nn3 for identification.

Reduced-Space K-means Clustering

In the latent score space nn4, assign each unit to the closest centroid:

nn5

where nn6 is the cluster membership indicator matrix and nn7 gives latent centroids.

Joint Optimization Problem

The objective is to minimize a combined loss:

nn8

subject to cluster and orthonormality constraints, where nn9 is the sum of measurement- and structural-model squared residuals. Alternate forms eliminate nuisance latent variables by expressing the objective only in terms of JJ0, JJ1, JJ2, and JJ3, retaining the core structure for joint optimization (Fordellone et al., 2018).

3. Algorithmic Procedure

PLS-SEM-KM employs an alternating optimization reminiscent of Wold’s alternating least squares. At each iteration, cluster centroids are updated in the manifest variable space, projected into latent space, and used to generate new assignments. Structural and measurement loadings update using inner and outer PLS steps, leveraging design matrices (JJ4, JJ5) for specifying block structure and network adjacency. Reassignment of clusters uses the minimum Euclidean distance between each unit’s manifest representation and cluster centers projected via the current loading matrix.

Convergence is typically declared when the change in cluster centroids and loadings falls below a tolerance threshold (e.g., JJ6), or after a maximum number of iterations (JJ7). Owing to the discrete nature of JJ8, multiple random starts mitigate local minima.

Outlined PLS-SEM-KM Procedure:

Γ\Gamma8

After cluster assignments stabilize, path coefficients are estimated via ordinary least squares regression of endogenous latent variables on their parents in the structural model.

4. Parameter Selection and Computational Complexity

The number of clusters JJ9 is determined using the Gap statistic or pseudo-P=H+LP = H + L0 criterion, applied either to the total deviance explained in P=H+LP = H + L1 or in the latent space P=H+LP = H + L2. Regularization is limited to the latent cluster-SEM tradeoff parameter P=H+LP = H + L3, though the PLS-SEM-KM alternates these objectives without requiring explicit tuning in practical applications.

The per-iteration computational complexity is dominated by P=H+LP = H + L4 when manifest variable count P=H+LP = H + L5 is large, since steps involving centroids, cluster assignments, and loadings scale with P=H+LP = H + L6, P=H+LP = H + L7, and P=H+LP = H + L8 (number of clusters) but typically P=H+LP = H + L9 (number of latent variables) and HH0 are small compared to HH1. Typical iteration counts HH2 are in the range of HH3–HH4, and the need for multiple random starts multiplies total runtimes accordingly.

Step Computational Order Dominant Inputs
Centroids (J-space) HH5 HH6, HH7, HH8
Latent-score update HH9 LL0, LL1
Covariance/PLS updates LL2, LL3 LL4, LL5, LL6, LL7
Cluster assignment LL8 LL9, Λ=[ΛH,ΛL]\Lambda = [\Lambda_H, \Lambda_L]0, Λ=[ΛH,ΛL]\Lambda = [\Lambda_H, \Lambda_L]1

5. Simulation and Empirical Validation

Extensive simulation studies evaluate PLS-SEM-KM on synthetic data generated using mixtures of Gaussians for exogenous latent scores with cluster means located at simplex vertices, and structural models with specified path coefficients (e.g., Λ=[ΛH,ΛL]\Lambda = [\Lambda_H, \Lambda_L]2) and noise. Manifest variables include pure noise features to mask group structure.

Key metrics:

  • Adjusted Rand Index (ARI): cluster recovery relative to ground truth.
  • Penalized Λ=[ΛH,ΛL]\Lambda = [\Lambda_H, \Lambda_L]3: combines mean Λ=[ΛH,ΛL]\Lambda = [\Lambda_H, \Lambda_L]4 for endogenous LVs with between-cluster deviance.

Results:

  • PLS-SEM-KM achieves ARI Λ=[ΛH,ΛL]\Lambda = [\Lambda_H, \Lambda_L]5, significantly outperforming sequential PLS-SEM + K-means (ARI Λ=[ΛH,ΛL]\Lambda = [\Lambda_H, \Lambda_L]6).
  • Penalized Λ=[ΛH,ΛL]\Lambda = [\Lambda_H, \Lambda_L]7 outperforms FIMIX-PLS by Λ=[ΛH,ΛL]\Lambda = [\Lambda_H, \Lambda_L]8–Λ=[ΛH,ΛL]\Lambda = [\Lambda_H, \Lambda_L]9.
  • Gap statistic recovers the correct Γ\Gamma0 in nearly Γ\Gamma1 of replicates.
  • Under high noise, Γ\Gamma2–Γ\Gamma3 random starts restore near-perfect recovery (Fordellone et al., 2018).

On real data from the European Consumer Satisfaction Index (ECSI), PLS-SEM-KM identifies Γ\Gamma4 consumer segments, with cluster-specific latent score profiles corresponding to high, medium, and low satisfaction segments. Model fit statistics (average communality Γ\Gamma5, endogenous Γ\Gamma6, penalized Γ\Gamma7) confirm simultaneous achievement of structural and clustering objectives.

6. Substantive Interpretation and Methodological Impact

PLS-SEM-KM facilitates discovery of population segments distinguished by their structure in the SEM causal network, enabling segmentation that is interpretable both statistically and substantively in terms of latent constructs and their measured indicators. The detected clusters are homogeneous with respect to the system of structural relationships. This is illustrated in applications where customer satisfaction, loyalty, and related constructs are linked through a known causal structure; identified clusters manifest coherent SEM pathways, supporting nuanced characterization of subpopulations.

The joint methodology is demonstrated to be more reliable for segmentation than tandem PLS-SEM and clustering or finite mixtures of PLS-SEM (FIMIX-PLS), especially in settings characterized by noisy indicators or structural heterogeneity tied to latent variable relationships. This suggests the simultaneous approach provides a decisive methodological advance in multivariate causal modeling for heterogeneous populations (Fordellone et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Simultaneous Clustering and PLS-SEM.