Intrinsic Sparse Structures in LSTM
- Intrinsic Sparse Structures (ISS) are defined as groups of weights aligned with each hidden state dimension in LSTMs to maintain dimension consistency.
- The ISS method employs group Lasso regularization and groupwise pruning to remove entire hidden dimensions, optimizing model size and efficiency.
- Empirical evaluations show that ISS achieves significant computational speedups and memory savings while preserving predictive performance on benchmarks.
Intrinsic Sparse Structures (ISS) are a principled framework for learning structurally-sparse Long Short-Term Memory (LSTM) and related Recurrent Neural Network (RNN) architectures. ISS operates by identifying and pruning groups of hidden-dimension–aligned weights and connections, ensuring the “dimension consistency” required for valid recurrent computation. The approach enables the transformation of large, over-parameterized RNNs into smaller, equivalently functional models with substantial computational and memory efficiency gains. ISS is operationalized via group Lasso regularization and subsequent groupwise pruning, achieving real-world speedups with minimal loss in predictive performance (Wen et al., 2017).
1. Formal Definition and Embedding of ISS in LSTM
ISS are defined as groups of weights and connections in LSTM units that are uniquely aligned with specific coordinates of the hidden state. Considering a hidden size , for each dimension , the ‑th ISS component consists of all columns indexed by in the gate/input weight matrices: as well as the ‑th rows of any downstream weight matrices that read . Removing an ISS component—i.e., simultaneously setting all of these columns (and downstream rows) to zero—decreases the size of all LSTM gates, candidate vectors, hidden, and cell states by one, and ensures dimension consistency throughout the model.
Dimension consistency is mandatory in LSTM computation: all gates (, , ), the candidate update (), cell state (), and output state () must have matching dimensionality. ISS ensures this invariance by defining sparsity at the hidden dimension level rather than at the granularity of individual weights.
2. Training Objective: Group-Lasso Regularization
The ISS learning objective augments the task-specific loss with a group Lasso penalty over ISS groups: where are the ISS groups (one per hidden dimension), and balances sparsity versus task performance. Each group aggregates all parameters producing or consuming the ‑th hidden coordinate. This formulation induces sparsity at the hidden-dimension level, driving entire ISS components toward zero during training.
3. Optimization Algorithm and Emergence of Groupwise Sparsity
Optimization is accomplished with mini-batch stochastic gradient descent, modified to incorporate the subgradients of the group Lasso term. Specifically, for each group : where is the learning rate. To handle numerical stability as , an is added under the square root if needed. After each (or several) updates, any group with is explicitly set to zero, yielding hard pruning of the corresponding ISS component. Pruned ISS components can be safely excised from the model without violating dimensional constraints.
4. Transitioning from Learned ISS to Structurally Sparse LSTMs
Following groupwise-sparse training, the LSTM (or RHN) can be transformed into a smaller model:
- Identify ISS groups with zero norm.
- Remove the corresponding hidden dimensions from state vectors.
- Excise associated columns from all relevant gate/input weight matrices, and rows from any downstream matrices consuming .
- Adjust remaining architectural hyperparameters (biases, tracking of hidden size).
This procedure maintains functional integrity, as removal is always dimensionally consistent across the recurrent architecture. No manual interventions for correcting tensor dimensions are required.
5. Empirical Results and Trade-Offs
ISS enables substantial compression and acceleration while preserving predictive accuracy, as demonstrated by several benchmarks:
| Model/Dataset | Baseline | ISS-Compressed | Inference Speedup | Accuracy Impact |
|---|---|---|---|---|
| Penn Treebank (LSTM LM) | 2×1500-hidden LSTM<br\>66.0M params<br\>78.6 perplexity | 373/315 hidden<br\>21.8M params<br\>78.7 perplexity | 10.6× | None (or negligible) |
| Recurrent Highway Network (Penn Treebank) | Width 830<br\>23.5M params<br\>65.4 perplexity | Width 517<br\>11.1M params<br\>65.4 perplexity | 2× (param reduction) | None |
| SQuAD/BiDAF | LSTMs of size 100<br\>2.69M weights | Pruned to 52/12 dims<br\>1.01M weights | 2× | Slight drop in EM/F1 |
Initial model size; Fine-tuning or from-scratch ISS learning yields further tradeoffs.
Real speedup exceeds theoretical multiply-add reduction due to improved cache locality when matrix shapes shrink. Directly designing smaller dense models from scratch (e.g., 373/315 hidden LSTM) yields worse accuracy compared to ISS-pruned-from-large.
6. Implementation, Practical Considerations, and Limitations
- Hyperparameter tuning: is chosen via validation. The pruning threshold is set as large as possible without degrading validation performance.
- Dropout: With regularization via group Lasso, dropout rates can often be decreased.
- Training loop modifications: Standard gradients are simply augmented by group Lasso subgradients; no complex solvers are required. Zeroing of small-norm groups after each mini-batch stabilizes resultant sparsity.
- Deployment: The pruned LSTM is a smaller dense recurrent model, compatible with standard BLAS/cuBLAS/Intel MKL routines for efficient inference.
- Limitations:
- Hyperparameter tuning for and is required.
- For very small LSTMs (hidden size 50), little sparsification is possible without loss of accuracy.
- Direct applicability is limited to RNN-style gating architectures; new grouping strategies are needed for Transformer or CNN-RNN hybrids.
- Potential extension directions include dynamic ISS (re-activating dimensions during training), automated per-layer penalization, or Bayesian/variational group sparsity to capture uncertainty in dimension necessity.
ISS constitutes a systematic methodology for groupwise pruning in gated RNN architectures, targeting hidden-state dimensionalities. By associating all relevant weights and connections per dimension, the method achieves automatic, structurally-consistent reduction in model size—yielding high computational efficiency and maintaining predictive performance across language modeling and question answering tasks (Wen et al., 2017).