Papers
Topics
Authors
Recent
2000 character limit reached

Tensor Latent Dirichlet Allocation (TLDA)

Updated 18 November 2025
  • TLDA is a spectral algorithm that leverages low-order moment tensors and whitening techniques to recover latent topic structures.
  • It employs tensor power methods and joint diagonalization to overcome non-convexity issues found in traditional LDA inference.
  • The approach is computationally efficient and scalable, providing provable, globally consistent parameter recovery for large and sparse text corpora.

Tensor Latent Dirichlet Allocation (TLDA) is a class of spectral algorithms for learning the parameters of Latent Dirichlet Allocation (LDA) models using low-order moment tensors and their decompositions. TLDA leverages algebraic properties of the observable moments under LDA to obtain provable and computationally efficient parameter estimators, bypassing the non-convexity and local optima issues of traditional likelihood-based inference. The core methodology involves centering and whitening empirical moment tensors, then applying tensor decomposition (e.g., robust tensor power method or joint diagonalization) to recover the topic–word matrix and Dirichlet prior. TLDA generalizes the singular value decomposition for matrices to higher-order tensors, yielding globally consistent estimators for LDA under mild identifiability conditions.

1. Moment Tensor Characterization of LDA Models

Under LDA, each document is generated by sampling a topic mixture hDir(α)h \sim \mathrm{Dir}(\boldsymbol{\alpha}), followed by i.i.d. draws of words according to P[w=ev]=jhjμj,vP[w = e_v] = \sum_j h_j \mu_{j,v}, where μjΔd1\mu_{j} \in \Delta^{d-1} are topic-word distributions and dd is the vocabulary size. This generative process admits explicit population moments for word indicators ww:

  • First-order moment: m1=E[w]=j=1k(αj/α0)μj\mathbf{m}_1 = \mathbb{E}[w] = \sum_{j=1}^k (\alpha_j / \alpha_0) \mu_j.
  • Centered 2nd-order moment: M2=E[ww](α0/(α0+1))m1m1=j=1k(αj/(α0(α0+1)))μjμjM_2 = \mathbb{E}[w \otimes w] - (\alpha_0/(\alpha_0+1)) \mathbf{m}_1 \otimes \mathbf{m}_1 = \sum_{j=1}^k (\alpha_j / (\alpha_0(\alpha_0+1))) \mu_j \otimes \mu_j.
  • Centered 3rd-order moment: M3=E[www](α0/(α0+2))[m1M2+perm]+(2α02/((α0+2)(α0+1)α0))m13=j=1k(2αj/(α0(α0+1)(α0+2)))μj3M_3 = \mathbb{E}[w \otimes w \otimes w] - (\alpha_0/(\alpha_0+2))[\mathbf{m}_1 \otimes M_2 + \text{perm}] + (2\alpha_0^2/((\alpha_0+2)(\alpha_0+1)\alpha_0)) \mathbf{m}_1^{\otimes 3} = \sum_{j=1}^k (2\alpha_j / (\alpha_0(\alpha_0+1)(\alpha_0+2))) \mu_j^{\otimes 3}.

These moments encode the model structure: both M2M_2 and M3M_3 are explicit symmetric sums of rank-one terms determined by {μj}\{\mu_j\}, with weights fixed by Dirichlet parameters (Anandkumar et al., 2012, Do et al., 29 Sep 2025, Ruffini et al., 2016, Anandkumar et al., 2012).

2. Whitening and Reduction to Orthogonal Tensor Decomposition

To facilitate decomposition, TLDA whitens the second moment. Let M2=UDUM_2 = U D U^\top be the top-kk eigen-decomposition. The whitening matrix W=UD1/2W = U D^{-1/2} guarantees WM2W=IkW^\top M_2 W = I_k. Applying WW to all modes, the whitened third-order tensor is T=M3(W,W,W)=j=1kλjvj3T = M_3(W, W, W) = \sum_{j=1}^k \lambda_j v_j^{\otimes 3}, where vj=Wμj/wjv_j = W^\top \mu_j / \sqrt{w_j} are orthonormal and λj\lambda_j are scalar weights. The resulting tensor admits a symmetric CP decomposition into kk rank-one orthogonal factors (Anandkumar et al., 2012, Do et al., 29 Sep 2025, Ruffini et al., 2016).

Whitening renders subsequent tensor operations tractable: all heavy linear algebra is confined to k×k×kk \times k \times k tensors, avoiding the curse of dimensionality (dkd \gg k). In practice, whitening and forming TT can be performed using randomized SVD techniques and sparse updates for scalability (Kangaslahti et al., 11 Nov 2025).

3. Tensor Power Method and Joint Diagonalization Algorithms

Once the whitened tensor TT is formed, TLDA extracts components using spectral methods. The robust tensor power method iteratively contracts and deflates TT to recover (λj,vj)(\lambda_j, v_j):

  • Initialize a random unit vector θ0\theta_0.
  • Loop: θtT(I,θt1,θt1)/T(I,θt1,θt1)\theta_t \leftarrow T(I, \theta_{t-1}, \theta_{t-1}) / \|T(I, \theta_{t-1}, \theta_{t-1})\| until convergence.
  • Deflate: TTλjvj3T \leftarrow T - \lambda_j v_j^{\otimes 3} after each component (Anandkumar et al., 2012, Ruffini et al., 2016).

Random restarts ensure quadratic convergence to each vjv_j with high probability. Stopping criteria involve the cubic form T(θ,θ,θ)T(\theta, \theta, \theta) and contraction norms. Algorithmic variants include joint diagonalization approaches, particularly when connecting LDA to discrete ICA via Gamma–Poisson models, allowing joint diagonalization of projected matrices to yield the components more stably, especially under sparse or variable document length regimes (Podosinnikova et al., 2015).

4. Parameter Recovery and Identifiability Guarantees

Each recovered pair (vj,λj)(v_j, \lambda_j) is mapped back (“unwhitened”) to the original parameter space:

  • μjWvj/Wvj1\mu_j \approx W v_j / \|W v_j\|_1 (topic-word distribution).
  • αjWvj12\alpha_j \propto \|W v_j\|_1^2 or via the explicit cubic form relations.

The procedure yields unique recovery up to permutation and sign, provided the topic-word matrix Φ\Phi (or Θ\Theta) is full column rank (Kruskal rank rr) and αj>0\alpha_j > 0. TLDA algorithms are globally identifiable: the CP decomposition is unique under mild conditions, and the mapping from discrete LDA model G=jαjδμjG = \sum_j \alpha_j \delta_{\mu_j} to observed distribution is one-to-one if N(2K1)/(r1)N \geq \lceil(2K-1)/(r-1)\rceil (N is document length) (Do et al., 29 Sep 2025).

If over-fitting (K>K0K > K_0), TLDA is robust: extra components collapse or receive zero mass at optimum. Empirical moment estimators converge at rate Op(1/N)O_p(1/\sqrt{N}) (Ruffini et al., 2016).

5. Computational Complexity, Scalability, and Sample Efficiency

TLDA algorithms scale favorably:

  • Whitened moment computations: O(dk2)O(d k^2) for range-basis, O(k3)O(k^3) SVD/Cholesky.
  • Tensor contraction and deflation: O(k5logk+k3log(1/ϵ))O(k^5 \log k + k^3 \log(1/\epsilon)) for the power method (Anandkumar et al., 2012, Ruffini et al., 2016).
  • Overall runtime is O(Nd+d2k+k5)O(N d + d^2 k + k^5), dominated by tensor formation and decomposition steps. SVDs live in kk-dimensional space; memory requirements do not scale with d3d^3.
  • Sample complexity to guarantee μ^iμi2ϵ\|\hat{\mu}_i - \mu_i\|_2 \leq \epsilon: n=O~(k2/λmin6)n = \tilde{O}(k^2 / \lambda_{\min}^6) for fixed spectral gap λmin\lambda_{\min} (Anandkumar et al., 2012, Do et al., 29 Sep 2025).

Online TLDA implementations achieve strict linear scaling in NN (number of documents) due to incremental PCA and stochastic tensor SGD. GPU-accelerated TLDA achieves 3–4× speedup over parallel CPU LDA and can process corpora with up to a billion documents without explicit d3d^3 tensors (Kangaslahti et al., 11 Nov 2025).

6. Extensions, Robustness, and Empirical Findings

TLDA methodology extends to discrete independent component analysis (ICA) models via Gamma–Poisson or Poisson-multinomial connections. Cumulant-based tensors under variable document lengths improve sample efficiency, especially for sparse topics; joint diagonalization techniques (JD) further increase robustness to model mis-specification and improve finite-sample performance (Podosinnikova et al., 2015).

Empirical studies confirm:

  • TLDA matches or improves topic-word recovery and log-likelihood scores relative to variational or MCMC methods.
  • GPU and CPU TLDA scale to millions and billions of documents, retaining high coherence and fast convergence.
  • Parameter recovery is demonstrably provable and unbiased in simulated and real data sets, including social media corpora with highly non-uniform word distributions (Kangaslahti et al., 11 Nov 2025).

7. Theoretical Consistency and Posterior Contraction

Tensor methods for LDA yield near-parametric posterior contraction rates for estimated densities, topic parameters, and per-document allocations. Provided the topics are bounded away from the simplex boundary and priors are regular, global topic learning rates are O(log(mN)/m)O(\sqrt{\log(mN)/m}) and document-specific allocation rates benefit from borrowing statistical strength from the corpus, achieving O(logN~/N~)O(\sqrt{\log\,\tilde{N}/\tilde{N}}) contraction for new documents (Do et al., 29 Sep 2025).

Identifiability holds with relaxed conditions compared to earlier literature: uniqueness is guaranteed for KK linearly independent topics and N3N \geq 3 tokens per document, with robustness to misspecified topic numbers and model overfitting. These theoretical guarantees support TLDA as a rigorous alternative to variational and sampling-based LDA approaches, with established consistency and efficiency under broad conditions.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Tensor Latent Dirichlet Allocation (TLDA).