- The paper introduces a unified identifiability theorem that generalizes conditions by leveraging higher dimensionality and variable diversity.
- It establishes quantitative convergence rates with polynomial dependence on dimension, mitigating the curse of dimensionality in density estimation.
- The proposed recovery algorithm efficiently recovers component densities under an incoherence condition, ensuring controlled estimation errors.
This paper addresses the identifiability and estimation problems in high-dimensional nonparametric latent structure models. These models represent a data distribution μ as a mixture of m component distributions, where each component is a product of d marginal measures:
μ=k=1∑mπk(μk1×μk2×⋯×μkd)
The paper aims to overcome limitations in existing theoretical frameworks, particularly the reliance on strong linear independence conditions for the marginal measures μkj.
Key Contributions
The main contributions of the paper are:
- A Unified Identifiability Theorem: The paper introduces a new identifiability theorem that generalizes previous conditions. It explains how higher dimensionality, combined with diversity in variables, aids identifiability even when linear independence doesn't hold.
- Quantitative Rates of Convergence: It establishes a perturbation theory for estimating the component densities under an incoherence condition. This theory shows how errors in estimating the joint density propagate to errors in estimating the components. Additionally, near-optimal minimax risk bounds for high-dimensional nonparametric density estimation are derived, demonstrating that sample complexity scales polynomially with dimension, thus circumventing the "curse of dimensionality."
- A Recovery Algorithm: A practical algorithm is developed for recovering component densities from an estimator of the joint density. This algorithm relies on an incoherence condition rather than the stricter linear independence.
Identifiability without Linear Independence
Existing identifiability results, like Allman et al. (2009) [not cited with arXiv id in text, but common knowledge], often rely on the "linear independence condition," where for each dimension j, the marginals {μ1j,…,μmj} are linearly independent. This condition fails in important cases like conditional i.i.d. models (where μk1=⋯=μkd for each component k) and Bernoulli mixture models (where μkj are Bernoulli distributions, and linear independence fails for m≥3).
To address this, the paper introduces the concept of ℓ-independence:
- The j-th variable is ℓ-independent if every subset of {μkj}k=1m of cardinality ℓ is linearly independent.
- Indμ(j) is the maximum ℓ for the j-th variable.
- For a subset S⊆[d], τμ(S)≜min{m,Indμ(S)−∣S∣+1} denotes the total excess independence in S.
Theorem \ref{identi_thm} (Main Identifiability Result):
The model μ is identifiable if there exists a partition S1,S2,S3 of the d dimensions such that:
τμ(S1)+τμ(S2)+τμ(S3)≥2m+2
Conversely, a non-identifiable μ exists if for every partition, this sum is ≤2m+1.
A key corollary relates to the separability condition (where μkj=μk′j for k=k′):
Corollary \ref{sepa_thm}: If the number of separable variables N(μ)≥2m−1, then μ is identifiable. This unifies results from prior work showing d=2m−1 as a critical threshold.
The proof of Theorem \ref{identi_thm} uses Hilbert space embedding techniques. The measures μkj are represented as functions fkj in L2(ξ) for some dominating measure ξ. The joint density is mapped to a tensor in L2(ξ)⊗d. The proof relies on:
- Relating the Kruskal rank of the set of component functions to the Kruskal rank of their Gram matrices.
- Lemma \ref{Hadamard}: A novel result showing that the Kruskal rank of a Hadamard product of Gram matrices A∘B is at least min{n,kA+kB−1}, where kA,kB are the Kruskal ranks of A,B. This lemma is crucial for establishing the total Kruskal rank condition.
- An extension of Kruskal's theorem for tensor decomposition in Hilbert spaces (Lemma \ref{Extension}).
Rate of Convergence under Incoherence
This section assumes each μkj has a density fkj.
1. Recovering Component Density: Perturbation Analysis
The paper introduces an incoherence condition:
- A set of functions {fk}k=1m in a Hilbert space is μ-incoherent if ∣⟨fk,fk′⟩∣≤μ∥fk∥2∥fk′∥2 for k=k′, with μ<1.
Assumption \ref{assume} (Estimable Condition):
- fkj are square integrable. For each j, {fkj}k=1m is μ-incoherent.
- Mixing proportions πk≥ζ>0.
Theorem \ref{Angle_thm} (Robust Identifiability):
If f is (μ,ζ)-estimable and an estimator f~ (also a mixture of product densities) satisfies ∥f−f~∥2≤ϵ (for sufficiently small ϵ), then there exists a permutation σ such that the errors in component densities and mixing proportions are bounded by terms proportional to ϵ:
∥fkj−f~σ(k)j∥2≲ϵ
∥π−σ(π~)∥2≲ϵ
The constants depend on C (uniform bound on ∥fkj∥∞), m, μ, and ζ. This shows that small errors in estimating the joint density translate to small errors in estimating the components, under incoherence.
The proof sketch involves:
- Considering marginal densities to reduce the problem to d=2m−1.
- Representing densities f,f~ as tensors T,T~ in L2([0,1])⊗(2m−1).
- Analyzing the mode-1 multiplication T×1w (where w is a test function) and its matrix unfolding Tw=ADπ,wB∗.
- Using Weyl's inequality (Lemma \ref{Weyl}) for singular values: ∥w∥2=1supkmax∣σk(Tw)−σk(T~w)∣≤ϵ.
- Using a probabilistic method (Lemma \ref{lemma_test}) to find test functions w that can distinguish components or reveal contradictions if the theorem's conclusion doesn't hold. This involves showing that if a component f~kj is far from all true components fk′j, one can construct a w0 such that σm(T~w0)=0 while σm(Tw0)>ϵ, a contradiction. Similar arguments establish the permutation's uniqueness and consistency across dimensions.
2. Estimation of the Joint Distribution under Hölder Smoothness
The paper considers the density class GF(m,d) where component densities fkj belong to a Hölder smoothness class FL,q.
Theorem \ref{Holder_rate} (Minimax Rates):
For estimating f∈GFL,q(m,d) from n i.i.d. samples:
- Under Hellinger distance (H): Minimax risk RH,FL,q(m,d)∗≍n−q+1q (up to log factors and polynomial terms in m,d). For n≥md1+1/q:
(nlogn)−q+1qd≲RH∗≲n−q+1qmq+1qd
- Under Total Variation distance (TV): Minimax risk RTV,FL,q(m,d)∗≍n−2q+12q (similarly). For all n≥1:
(nlogn)−2q+12q≲RTV∗≲n−2q+12qm2q+12qd2q+12q+2
These rates show that the complexity depends polynomially on d, unlike standard nonparametric density estimation where it's exponential (n−q/(q+d) or n−2q/(2q+d)). This demonstrates that the latent product structure mitigates the curse of dimensionality. The proof uses classical information-theoretic arguments based on metric entropy.
Algorithm for Recovery of Components
A practical algorithm based on simultaneous diagonalization (Leurgans et al., 1993) is proposed to recover fkj from an estimator f^ of the joint density f. It focuses on d=2m−1.
Algorithm 1:
- Given f^(x1,…,x2m−1), compute T^+(y,z)=∫f^(y,z,x2m−1)dx2m−1, where y=(x1,…,xm−1),z=(xm,…,x2m−2).
- Find T^+,m, the best rank-m SVD approximation of T^+: T^+,m=k=1∑mλ^kϕ^k(y)ψ^k(z).
- Choose a subset A⊂[0,1] (support of x2m−1).
- Compute a matrix η^A with entries η^lt=λ^t1∫Aϕ^l(y)f^(y,z,x2m−1)ψ^t(z)dydzdx2m−1.
- Find eigenvectors w^k of η^A.
- Recover estimates for fk1 by first forming g^k(y)=∑hw^khϕ^h(y), normalizing to h^k=g^k/∥g^k∥1, and then marginalizing f^k1=∫h^kdx2…dxm−1.
Theorem \ref{algo_error} (Correctness of Algorithm 1):
If f is (μ,ζ)-estimable, ∥fkj∥∞≤C, and the set A is chosen such that ∫Afk(2m−1)(x)dx are well-separated and bounded away from zero, then if ∥f^−f∥2≤ϵ (for small ϵ), Algorithm 1 outputs f^k1 such that:
∥f^k1−fσ(k)1∥2≲ϵ
The error depends on ϵ, constants C,m, incoherence μ, proportion bound ζ, and properties of set A (measure μ0 and separation δ). This algorithm relies only on incoherence, not linear independence. For d>2m−1, it can be applied repeatedly to submodels.
Simulations:
The algorithm was tested on:
- Conditional i.i.d. model (m=3,d=5, support size 4 for each fkj).
- Bernoulli mixture model (m=3,d=5, with αkj=0.1j+0.2(k−1)).
An empirical estimate f^ was used from n samples. The error e=∑k∥fk1−f^k1∥2 was reported.
Results showed a linear decay of log error with increasing log sample size, confirming the theoretical linear relationship between joint density error and component density error. The algorithm performed well even without linear independence.
Discussion
The paper significantly advances the understanding of high-dimensional nonparametric latent structure models by:
- Providing a unified identifiability theory beyond linear independence.
- Establishing quantitative convergence rates that show polynomial dependence on dimension.
- Proposing a practical recovery algorithm with theoretical guarantees under incoherence.
Future work includes:
- Refining identifiability conditions (e.g., removing the 3-partition requirement).
- Developing methods to utilize information from more than $2m-1$ variables more effectively for estimation.
The appendices contain detailed proofs for the theorems and lemmas presented. For instance, Appendix A covers proofs for Section 2 (Identifiability), Appendix B for Theorem \ref{Angle_thm}, Appendix C for Theorem \ref{Holder_rate}, and Appendix D for Algorithm 1 and Theorem \ref{algo_error}.