Simultaneous Clustering & PLS-SEM
- The paper demonstrates that simultaneous clustering and PLS-SEM yield superior subgroup recovery by jointly optimizing latent variable estimation and clustering assignment.
- The methodology integrates reflective measurement models with structural path coefficients and K-means clustering in a single optimization, achieving higher ARI and penalized R²* benchmarks.
- This unified framework ensures that the latent space supports both measurement validity and causal network structure, offering actionable insights for segmenting complex data.
Simultaneous clustering and Partial Least Squares Structural Equation Modeling (PLS-SEM) refers to a unified statistical framework for unsupervised segmentation and causal modeling, designed for contexts in which true population heterogeneity is not fully captured by the classical assumption that all units are homogeneous with respect to underlying structural relationships. The simultaneous approach—realized in the Partial Least Squares K-Means (PLS-SEM-KM) algorithm—integrates non-hierarchical clustering and reflective PLS-SEM estimation within a single optimization, in contrast to tandem or sequential procedures that separately estimate SEM and then cluster latent scores. This synthesis enables group discovery that is both statistically and substantively aligned with the causal network of latent variables, supporting more robust insights in research fields where between-group heterogeneity in structural equations is critical (Fordellone et al., 2018).
1. Motivation and Conceptual Foundations
Traditional SEM and PLS-SEM assume unit-level homogeneity. Segmentation strategies in the literature typically estimate models for pre-defined groups, or apply clustering algorithms—such as K-means—to latent scores after SEM estimation. However, these sequential (“tandem”) approaches exhibit two central weaknesses: (1) PLS-SEM identifies latent directions maximizing variance in manifest variables, potentially ignoring clustering structure if spurious directions dominate variance; (2) clustering is agnostic to causal relationships, assigning units to clusters purely on proximity in latent space without ensuring homogeneity in SEM pathways (Fordellone et al., 2018).
The simultaneous framework addresses these issues by orienting the latent space both to fit the measurement and structural model and to support cluster recovery. Cluster centroids exist in the PLS-SEM-constrained latent space, ensuring segmentation respects the architecture of the causal network. As a result, both measurement and group structure are recovered more faithfully, especially when manifest variables contain irrelevant noise dimensions or group differences are directly related to structural equation features.
2. Mathematical Formulation
The simultaneous clustering and PLS-SEM model structurally integrates three components: the SEM inner model, the measurement (reflective) outer model, and K-means clustering over the latent space. Let be the matrix of standardized manifest variables for units and manifest variables. The latent variables consist of exogenous and endogenous constructs, with associated block loadings . Matrices and capture the path coefficients among and between exogenous and endogenous latent variables.
Structural Model (Inner Model)
where 0 contains endogenous latent scores and 1 exogenous latent scores.
Measurement Model (Reflective Outer Model)
2
with 3 for identification.
Reduced-Space K-means Clustering
In the latent score space 4, assign each unit to the closest centroid:
5
where 6 is the cluster membership indicator matrix and 7 gives latent centroids.
Joint Optimization Problem
The objective is to minimize a combined loss:
8
subject to cluster and orthonormality constraints, where 9 is the sum of measurement- and structural-model squared residuals. Alternate forms eliminate nuisance latent variables by expressing the objective only in terms of 0, 1, 2, and 3, retaining the core structure for joint optimization (Fordellone et al., 2018).
3. Algorithmic Procedure
PLS-SEM-KM employs an alternating optimization reminiscent of Wold’s alternating least squares. At each iteration, cluster centroids are updated in the manifest variable space, projected into latent space, and used to generate new assignments. Structural and measurement loadings update using inner and outer PLS steps, leveraging design matrices (4, 5) for specifying block structure and network adjacency. Reassignment of clusters uses the minimum Euclidean distance between each unit’s manifest representation and cluster centers projected via the current loading matrix.
Convergence is typically declared when the change in cluster centroids and loadings falls below a tolerance threshold (e.g., 6), or after a maximum number of iterations (7). Owing to the discrete nature of 8, multiple random starts mitigate local minima.
Outlined PLS-SEM-KM Procedure:
8
After cluster assignments stabilize, path coefficients are estimated via ordinary least squares regression of endogenous latent variables on their parents in the structural model.
4. Parameter Selection and Computational Complexity
The number of clusters 9 is determined using the Gap statistic or pseudo-0 criterion, applied either to the total deviance explained in 1 or in the latent space 2. Regularization is limited to the latent cluster-SEM tradeoff parameter 3, though the PLS-SEM-KM alternates these objectives without requiring explicit tuning in practical applications.
The per-iteration computational complexity is dominated by 4 when manifest variable count 5 is large, since steps involving centroids, cluster assignments, and loadings scale with 6, 7, and 8 (number of clusters) but typically 9 (number of latent variables) and 0 are small compared to 1. Typical iteration counts 2 are in the range of 3–4, and the need for multiple random starts multiplies total runtimes accordingly.
| Step | Computational Order | Dominant Inputs |
|---|---|---|
| Centroids (J-space) | 5 | 6, 7, 8 |
| Latent-score update | 9 | 0, 1 |
| Covariance/PLS updates | 2, 3 | 4, 5, 6, 7 |
| Cluster assignment | 8 | 9, 0, 1 |
5. Simulation and Empirical Validation
Extensive simulation studies evaluate PLS-SEM-KM on synthetic data generated using mixtures of Gaussians for exogenous latent scores with cluster means located at simplex vertices, and structural models with specified path coefficients (e.g., 2) and noise. Manifest variables include pure noise features to mask group structure.
Key metrics:
- Adjusted Rand Index (ARI): cluster recovery relative to ground truth.
- Penalized 3: combines mean 4 for endogenous LVs with between-cluster deviance.
Results:
- PLS-SEM-KM achieves ARI 5, significantly outperforming sequential PLS-SEM + K-means (ARI 6).
- Penalized 7 outperforms FIMIX-PLS by 8–9.
- Gap statistic recovers the correct 0 in nearly 1 of replicates.
- Under high noise, 2–3 random starts restore near-perfect recovery (Fordellone et al., 2018).
On real data from the European Consumer Satisfaction Index (ECSI), PLS-SEM-KM identifies 4 consumer segments, with cluster-specific latent score profiles corresponding to high, medium, and low satisfaction segments. Model fit statistics (average communality 5, endogenous 6, penalized 7) confirm simultaneous achievement of structural and clustering objectives.
6. Substantive Interpretation and Methodological Impact
PLS-SEM-KM facilitates discovery of population segments distinguished by their structure in the SEM causal network, enabling segmentation that is interpretable both statistically and substantively in terms of latent constructs and their measured indicators. The detected clusters are homogeneous with respect to the system of structural relationships. This is illustrated in applications where customer satisfaction, loyalty, and related constructs are linked through a known causal structure; identified clusters manifest coherent SEM pathways, supporting nuanced characterization of subpopulations.
The joint methodology is demonstrated to be more reliable for segmentation than tandem PLS-SEM and clustering or finite mixtures of PLS-SEM (FIMIX-PLS), especially in settings characterized by noisy indicators or structural heterogeneity tied to latent variable relationships. This suggests the simultaneous approach provides a decisive methodological advance in multivariate causal modeling for heterogeneous populations (Fordellone et al., 2018).