Papers
Topics
Authors
Recent
Search
2000 character limit reached

TLSTM: Advanced Tree & Tensorized LSTM Models

Updated 3 February 2026
  • TLSTM comprises advanced LSTM variants—including TreeLSTM, TT-LSTM, and tLSTM—that extend traditional sequence models to capture hierarchical structures and improve parameter efficiency.
  • TreeLSTM and its variant LdTreeLSTM model dependency trees by incorporating separate LSTM modules to handle diverse edge types and sibling correlations, leading to enhanced generative accuracy.
  • TT-LSTM and tLSTM leverage tensor-train factorization and hidden state tensorization respectively, significantly reducing parameter counts while maintaining or boosting performance on sequence tasks.

TLSTM refers to a set of advanced Long Short-Term Memory (LSTM) network architectures in which the core LSTM framework is generalized beyond standard sequential modeling. This encompasses (1) the Tree Long Short-Term Memory network (TreeLSTM), which models tree-structured rather than purely sequential data; (2) the Tensor-Train LSTM (TT-LSTM), where the weight matrices are factorized with tensor decompositions for parameter reduction; and (3) the Tensorized LSTM (tLSTM), which replaces vector-valued hidden states with higher-dimensional tensors, allowing for implicit deep computation and parameter sharing. Each of these distinct architectures extends the traditional LSTM to address specific challenges—hierarchical structure modeling, large parameter efficiency, and resource-constrained sequence learning—while maintaining or improving modeling power relative to standard LSTM architectures (Zhang et al., 2015, Samui et al., 2018, He et al., 2017).

1. Tree Long Short-Term Memory (TreeLSTM): Generative Dependency Modeling

TreeLSTM generalizes standard sequence LSTMs by defining computation on tree-structured data, specifically modeling the generation of dependency trees for sentences. The network explicitly parameterizes four separate LSTM modules, one per “edge type” in a tree: generation of first left child (Gen-L), next left sibling (Gen-Nx-L), first right child (Gen-R), and next right sibling (Gen-Nx-R). The path to a word in the tree is encoded as a traverse through these edge types, and each path is associated with a hidden state vector computed by the appropriate LSTM module. All hidden layers and embeddings are shared.

Letting hth_t denote the shared hidden representation for the path to word wtw_t, the conditional probability for wtw_t given its path D(wt)\mathcal{D}(w_t) is given by a softmax: P(wtD(wt))=exp(yt,wt)k=1Vexp(yt,k)P(w_t \mid \mathcal{D}(w_t)) = \frac{\exp(y_{t,w_t})}{\sum_{k=1}^{|V|} \exp(y_{t,k})} where yt=Whohty_t = W_{ho} h_t and WhoW_{ho} is the output matrix (Zhang et al., 2015).

The TreeLSTM specifies a generative model for the joint probability of a dependency tree TT over a sentence SS as: P(S,T)=wBFS(T){ROOT}P(wD(w))P(S, T) = \prod_{w \in \mathrm{BFS}(T)\setminus\{\mathrm{ROOT}\}} P(w \mid \mathcal{D}(w))

2. Left–Right Dependent Correlations and LdTreeLSTM

Standard TreeLSTM models left and right subtrees independently. To incorporate correlations between left and right children, LdTreeLSTM introduces an additional LSTM (Ld) that sweeps left dependents from furthest to nearest. When generating a right child, the preprocessing step passes a summary vector qKq_K encoding all previous left dependents to the right-child generator. The input for the Gen-R LSTM is then the concatenation of the usual embedding and qKq_K, explicitly allowing the right-child LSTM to condition on the sequence of left siblings (Zhang et al., 2015).

This architectural enhancement demonstrates improved accuracy in tasks where such dependencies are critical.

3. Tensor-Train Factorization: TT-LSTM

TT-LSTM addresses the computational inefficiency and memory demands of deep and wide LSTM models. It factorizes each weight matrix WRH×(H+D)W \in \mathbb{R}^{H \times (H+D)} for the four gates in an LSTM cell using the Tensor-Train (TT) format. The row and column dimensions are reshaped into products over NN factors (H=kIkH = \prod_k I_k, H+D=kJkH+D = \prod_k J_k), and WW is approximated using NN low-order core tensors GkG_k:

W^(i1,...,iN;j1,...,jN)=r0,...,rNk=1NGk[ik,jk,rk1,rk]\widehat{W}(i_1, ..., i_N; j_1, ..., j_N) = \sum_{r_0,...,r_N} \prod_{k=1}^N G_k[i_k, j_k, r_{k-1}, r_k]

where r0=rN=1r_0 = r_N = 1 and rkr_k are the TT ranks (Samui et al., 2018).

The resultant TT-LSTM layer dramatically reduces parameter counts: for TT-rank 4 and NN=3 decomposition with typical LSTM configurations, the parameter count per layer drops by >99% compared to uncompressed LSTM (e.g., 10,264 vs 2,623,488 for one layer). TT-LSTM retains standard gate equations with the linear transformations implemented via TT-matrix multiplication, preserving the original LSTM dynamics while providing orders-of-magnitude savings in memory.

4. Tensorized LSTM (tLSTM): Hidden State Tensorization and Cross-Layer Convolution

tLSTM represents each hidden state not as a vector but as a high-dimensional tensor (e.g., HtRP×MH_t \in \mathbb{R}^{P \times M} for 2D), with PP as the width (or depth/time) dimension and MM as the channel. Temporal computation leverages cross-layer convolution kernels shared across PP, enabling efficient network widening without parameter growth. Inputs are concatenated along a new axis at each timestep, and cross-layer convolutions (with kernels like WhRK×Mi×MoW^h \in \mathbb{R}^{K \times M^i \times M^o}) operate along the width/depth axis.

Deep computations are realized implicitly by delaying the output through the tensor slices, merging depth into time. For an LL-layer effective depth, the output at time tt is read from the “bottom” slice Ht+L1H_{t+L-1}, resulting in O(T+L)O(T+L) runtime for a sequence of length TT rather than O(TL)O(TL) for stacked LSTM layers (He et al., 2017).

All LSTM gating computations are performed on tensor-valued activations, including a dynamic memory-cell convolution parameterized by an additional “memory-kernel” module.

5. Model Compression and Computational Efficiency

Both TT-LSTM and tLSTM provide distinct methods for increasing modeling capacity or reducing resource footprints:

  • TT-LSTM: Parameter count is proportional to kIkJkrk1rk\sum_k I_k J_k r_{k-1} r_k per gate instead of H(H+D)H(H+D). With modest TT ranks and factorization dimensions, typical LSTM models can be compressed to 0.4–0.5% of their original size (Table 1).
  • tLSTM: Network width increases by growing the hidden tensor’s spatial sides (PP), while parameter count remains constant due to kernel sharing. Deepening is achieved by output delay, not by stacking, thus maintaining constant sequential runtime.
Architecture Parameter Efficiency Widening/Deepening Mechanism
TT-LSTM \downarrow99% TT-matrix decomposition for weight matrices
tLSTM \downarrow Cross-layer convolution; widen via tensor dim
TreeLSTM -- Four LSTMs for tree edges; tree-structured path state

6. Empirical Evaluation and Applications

TreeLSTM and LdTreeLSTM architectures provide improved performance on both generative language modeling and dependency parsing reranking. For the MSR sentence completion challenge, TreeLSTM (hidden size d=400d=400) achieved 56.7% accuracy, with LdTreeLSTM reaching 60.67%, outperforming standard large LSTM (57.0%) and previous log-bilinear models (54.7%). In dependency parsing reranking on PTB dev/test, TreeLSTM provided small but measurable improvements over strong MSTParser baselines (UAS=91.79%; LdTreeLSTM: UAS=91.99%) (Zhang et al., 2015).

TT-LSTM achieves nearly equivalent speech enhancement quality (PESQ/STOI metrics) as full-capacity LSTMs in monaural speech enhancement, while being \sim200×\times smaller (e.g., 32,760 vs 6.9 million parameters) (Samui et al., 2018). tLSTM matches or exceeds performance of comparably sized state-of-the-art RNNs on standard sequence learning tasks, including character-level language modeling, algorithmic addition/memorization, and sequential MNIST, without additional parameter or runtime overhead (He et al., 2017).

7. Practical Considerations and Optimization Guidelines

For TT-LSTM, TT-rank selection (rkr_k) and factorization dimensions (Ik,JkI_k, J_k) govern the compression-performance tradeoff. Training follows standard LSTM pipeline (BPTT, Adam/SGD, batch norm, dropout), with no TT-specific regularizer required.

tLSTM models require careful tuning of the spatial dimensions (PP or P1,P2P_1, P_2), kernel size (KK), and delay parameter (typically L1L-1). Channel normalization proved critical for stability. Delay ensures full temporal coverage in deep computation, and dynamic memory-cell convolution is necessary for maintaining gradient flow and capacity.

TreeLSTM models with large vocabularies use Noise Contrastive Estimation for efficient softmax training. Regularization via gradient clipping and dropout is recommended.


Tree- and tensor-structured LSTM variants robustly extend LSTM capabilities in both structure and scale. This enables efficient modeling of hierarchical syntax, significant model compression, and scalable widening/deepening for sequence learning without prohibitive computational costs (Zhang et al., 2015, Samui et al., 2018, He et al., 2017).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TLSTM.