Tensor Latent Dirichlet Allocation (TLDA)

Updated 18 November 2025

TLDA is a spectral algorithm that leverages low-order moment tensors and whitening techniques to recover latent topic structures.
It employs tensor power methods and joint diagonalization to overcome non-convexity issues found in traditional LDA inference.
The approach is computationally efficient and scalable, providing provable, globally consistent parameter recovery for large and sparse text corpora.

Tensor Latent Dirichlet Allocation (TLDA) is a class of spectral algorithms for learning the parameters of Latent Dirichlet Allocation (LDA) models using low-order moment tensors and their decompositions. TLDA leverages algebraic properties of the observable moments under LDA to obtain provable and computationally efficient parameter estimators, bypassing the non-convexity and local optima issues of traditional likelihood-based inference. The core methodology involves centering and whitening empirical moment tensors, then applying tensor decomposition (e.g., robust tensor power method or joint diagonalization) to recover the topic–word matrix and Dirichlet prior. TLDA generalizes the singular value decomposition for matrices to higher-order tensors, yielding globally consistent estimators for LDA under mild identifiability conditions.

1. Moment Tensor Characterization of LDA Models

Under LDA, each document is generated by sampling a topic mixture $h \sim \mathrm{Dir}(\boldsymbol{\alpha})$ , followed by i.i.d. draws of words according to $P[w = e_v] = \sum_j h_j \mu_{j,v}$ , where $\mu_{j} \in \Delta^{d-1}$ are topic-word distributions and $d$ is the vocabulary size. This generative process admits explicit population moments for word indicators $w$ :

First-order moment: $\mathbf{m}_1 = \mathbb{E}[w] = \sum_{j=1}^k (\alpha_j / \alpha_0) \mu_j$ .
Centered 2nd-order moment: $M_2 = \mathbb{E}[w \otimes w] - (\alpha_0/(\alpha_0+1)) \mathbf{m}_1 \otimes \mathbf{m}_1 = \sum_{j=1}^k (\alpha_j / (\alpha_0(\alpha_0+1))) \mu_j \otimes \mu_j$ .
Centered 3rd-order moment: $M_3 = \mathbb{E}[w \otimes w \otimes w] - (\alpha_0/(\alpha_0+2))[\mathbf{m}_1 \otimes M_2 + \text{perm}] + (2\alpha_0^2/((\alpha_0+2)(\alpha_0+1)\alpha_0)) \mathbf{m}_1^{\otimes 3} = \sum_{j=1}^k (2\alpha_j / (\alpha_0(\alpha_0+1)(\alpha_0+2))) \mu_j^{\otimes 3}$ .

These moments encode the model structure: both $M_2$ and $M_3$ are explicit symmetric sums of rank-one terms determined by $\{\mu_j\}$ , with weights fixed by Dirichlet parameters (Anandkumar et al., 2012, Do et al., 29 Sep 2025, Ruffini et al., 2016, Anandkumar et al., 2012).

2. Whitening and Reduction to Orthogonal Tensor Decomposition

To facilitate decomposition, TLDA whitens the second moment. Let $M_2 = U D U^\top$ be the top- $k$ eigen-decomposition. The whitening matrix $W = U D^{-1/2}$ guarantees $W^\top M_2 W = I_k$ . Applying $W$ to all modes, the whitened third-order tensor is $T = M_3(W, W, W) = \sum_{j=1}^k \lambda_j v_j^{\otimes 3}$ , where $v_j = W^\top \mu_j / \sqrt{w_j}$ are orthonormal and $\lambda_j$ are scalar weights. The resulting tensor admits a symmetric CP decomposition into $k$ rank-one orthogonal factors (Anandkumar et al., 2012, Do et al., 29 Sep 2025, Ruffini et al., 2016).

Whitening renders subsequent tensor operations tractable: all heavy linear algebra is confined to $k \times k \times k$ tensors, avoiding the curse of dimensionality ( $d \gg k$ ). In practice, whitening and forming $T$ can be performed using randomized SVD techniques and sparse updates for scalability (Kangaslahti et al., 11 Nov 2025).

3. Tensor Power Method and Joint Diagonalization Algorithms

Once the whitened tensor $T$ is formed, TLDA extracts components using spectral methods. The robust tensor power method iteratively contracts and deflates $T$ to recover $(\lambda_j, v_j)$ :

Initialize a random unit vector $\theta_0$ .
Loop: $\theta_t \leftarrow T(I, \theta_{t-1}, \theta_{t-1}) / \|T(I, \theta_{t-1}, \theta_{t-1})\|$ until convergence.
Deflate: $T \leftarrow T - \lambda_j v_j^{\otimes 3}$ after each component (Anandkumar et al., 2012, Ruffini et al., 2016).

Random restarts ensure quadratic convergence to each $v_j$ with high probability. Stopping criteria involve the cubic form $T(\theta, \theta, \theta)$ and contraction norms. Algorithmic variants include joint diagonalization approaches, particularly when connecting LDA to discrete ICA via Gamma–Poisson models, allowing joint diagonalization of projected matrices to yield the components more stably, especially under sparse or variable document length regimes (Podosinnikova et al., 2015).

4. Parameter Recovery and Identifiability Guarantees

Each recovered pair $(v_j, \lambda_j)$ is mapped back (“unwhitened”) to the original parameter space:

$\mu_j \approx W v_j / \|W v_j\|_1$ (topic-word distribution).
$\alpha_j \propto \|W v_j\|_1^2$ or via the explicit cubic form relations.

The procedure yields unique recovery up to permutation and sign, provided the topic-word matrix $\Phi$ (or $\Theta$ ) is full column rank (Kruskal rank $r$ ) and $\alpha_j > 0$ . TLDA algorithms are globally identifiable: the CP decomposition is unique under mild conditions, and the mapping from discrete LDA model $G = \sum_j \alpha_j \delta_{\mu_j}$ to observed distribution is one-to-one if $N \geq \lceil(2K-1)/(r-1)\rceil$ (N is document length) (Do et al., 29 Sep 2025).

If over-fitting ( $K > K_0$ ), TLDA is robust: extra components collapse or receive zero mass at optimum. Empirical moment estimators converge at rate $O_p(1/\sqrt{N})$ (Ruffini et al., 2016).

5. Computational Complexity, Scalability, and Sample Efficiency

TLDA algorithms scale favorably:

Whitened moment computations: $O(d k^2)$ for range-basis, $O(k^3)$ SVD/Cholesky.
Tensor contraction and deflation: $O(k^5 \log k + k^3 \log(1/\epsilon))$ for the power method (Anandkumar et al., 2012, Ruffini et al., 2016).
Overall runtime is $O(N d + d^2 k + k^5)$ , dominated by tensor formation and decomposition steps. SVDs live in $k$ -dimensional space; memory requirements do not scale with $d^3$ .
Sample complexity to guarantee $\|\hat{\mu}_i - \mu_i\|_2 \leq \epsilon$ : $n = \tilde{O}(k^2 / \lambda_{\min}^6)$ for fixed spectral gap $\lambda_{\min}$ (Anandkumar et al., 2012, Do et al., 29 Sep 2025).

Online TLDA implementations achieve strict linear scaling in $N$ (number of documents) due to incremental PCA and stochastic tensor SGD. GPU-accelerated TLDA achieves 3–4× speedup over parallel CPU LDA and can process corpora with up to a billion documents without explicit $d^3$ tensors (Kangaslahti et al., 11 Nov 2025).

6. Extensions, Robustness, and Empirical Findings

TLDA methodology extends to discrete independent component analysis (ICA) models via Gamma–Poisson or Poisson-multinomial connections. Cumulant-based tensors under variable document lengths improve sample efficiency, especially for sparse topics; joint diagonalization techniques (JD) further increase robustness to model mis-specification and improve finite-sample performance (Podosinnikova et al., 2015).

Empirical studies confirm:

TLDA matches or improves topic-word recovery and log-likelihood scores relative to variational or MCMC methods.
GPU and CPU TLDA scale to millions and billions of documents, retaining high coherence and fast convergence.
Parameter recovery is demonstrably provable and unbiased in simulated and real data sets, including social media corpora with highly non-uniform word distributions (Kangaslahti et al., 11 Nov 2025).

7. Theoretical Consistency and Posterior Contraction

Tensor methods for LDA yield near-parametric posterior contraction rates for estimated densities, topic parameters, and per-document allocations. Provided the topics are bounded away from the simplex boundary and priors are regular, global topic learning rates are $O(\sqrt{\log(mN)/m})$ and document-specific allocation rates benefit from borrowing statistical strength from the corpus, achieving $O(\sqrt{\log\,\tilde{N}/\tilde{N}})$ contraction for new documents (Do et al., 29 Sep 2025).

Identifiability holds with relaxed conditions compared to earlier literature: uniqueness is guaranteed for $K$ linearly independent topics and $N \geq 3$ tokens per document, with robustness to misspecified topic numbers and model overfitting. These theoretical guarantees support TLDA as a rigorous alternative to variational and sampling-based LDA approaches, with established consistency and efficiency under broad conditions.