Tucker Decomposition in Multilinear Models
- Tucker decomposition is a multilinear algebra method that factorizes high-order tensors into a core tensor and mode-specific factor matrices.
- It enables flexible low-rank representations, used for model compression, capturing multiway correlations, and improving image classification performance.
- The method offers depth efficiency with exponential CP-rank separation while balancing parameter scalability against computational overhead.
Tucker decomposition is a central methodology in multilinear algebra that factorizes a high-order tensor into a small core tensor and mode-specific factor matrices, yielding flexible low-rank representations especially adapted to multidimensional data analysis and machine learning. The decomposition is widely applied for model compression, capturing multiway correlations, and constructing expressive architectures in computer vision, statistics, and signal processing.
1. Formal Definition and Algebraic Structure
Given an -way tensor , its Tucker decomposition of multilinear rank is:
with core tensor and factor matrices for (Liu et al., 2019). Elementwise, this reads
providing a highly expressive multilinear model. Tucker generalizes matrix SVD to higher orders and is strictly more flexible than CP decomposition.
2. Tucker-Decomposition Network Architecture
The Tucker network formalism models multi-patch input (e.g., object composed of subpatches ) as follows (Liu et al., 2019):
- Representation layer: (e.g., ReLU-activated convolution), features per patch.
- Mode-wise projection: For each mode , project features via to .
- Core-tensor layer: Class scores , with class-wise -way cores.
- Depth: One representation layer, mode-wise projections, a product-pooling layer, and final linear classification.
This network is strictly deeper than shallow CP architectures and can be recursively deepened by hierarchical nesting.
3. Expressive Power: Depth Separation Theorem
The main theorem states that for any -way Tucker tensor generated with uniform rank , the CP-rank is exponentially larger: For even, the CP-rank is generically at least ; for odd, at least (Liu et al., 2019). More precisely, emulating a Tucker block with a single-layer CP (shallow) network requires exponentially many neurons:
where is the matricization of the core.
Proof sketch: CP-rank lower-bounds the rank of all matricizations. For full-rank mode-wise factors, Tucker maintains those ranks, and a generic core achieves maximal rank except on measure-zero sets. Thus, the exponential gap is ubiquitous unless the core is degenerate.
4. Comparison to Hierarchical Tensor Networks
HT formats (order-) arrange modes in binary-tree structure, using small cores at internal nodes and shared leaf factors. Theoretically, any HT decomposition of order can be rewritten as a Tucker decomposition and vice versa. Tucker networks are as "rich" as HT for the same leaf rank, but HT typically allocates parameters across small cores, reducing core size from to polynomial quantities (Liu et al., 2019). Tucker's single block simplicity is advantageous, but HT may be preferable for high due to parameter scaling.
5. Parameter and Computational Complexity
Given mode projections , the parameter counts are:
- Shallow CP:
- Tucker:
- HT:
Contraction of outer product vectors is handled by specialized product-pooling; in practice, (Liu et al., 2019).
6. Empirical Validation in Image Classification
Experiments use TensorFlow, batch normalisation, and SGD on MNIST and CIFAR-10 datasets. Tucker, HT, and shallow CP architectures are matched for parameter count (K MNIST, $23$K CIFAR-10). Metrics measured include test accuracy, parameter sensitivity, and convergence.
Key results (Liu et al., 2019):
- Tucker networks outperform both HT and shallow on training and test accuracy.
- Tucker achieves test accuracy on MNIST faster and higher than alternatives.
- For CIFAR-10, best Tucker networks consistently exceed HT and shallow by several percentage points at fixed model sizes.
- The optimal Tucker rank varies by dataset but always leads.
7. Practical Implications, Advantages, and Limitations
Advantages:
- Mode-wise rank control enables flexible model size–expressivity tradeoffs.
- Depth efficiency: exponential separation from shallow CP; theoretical guarantees.
- Single-block architecture is easier to operationalize in deep learning frameworks.
- Empirical superiority at constrained parameter budgets in computer vision.
Limitations:
- Core size grows as ; practical when moderate, small.
- Lack of spatial sharing may hamper hierarchical feature extraction in very deep nets.
- The final product-pooling step incurs computational overhead for large and .
Overall, Tucker-Decomposition Networks occupy an intermediate point between CP and HT designs, providing depth efficiency, parameter flexibility, and practical performance (Liu et al., 2019). Future extensions may further explore sparsity, sharing mechanisms, or hierarchical interleaving to extend Tucker principles to larger-scale architectures.