VAMPIRE: Efficient Variational Pretraining
- VAMPIRE is a lightweight, semi-supervised text classification framework that pretrains a variational autoencoder on in-domain bag-of-words data to generate document features.
- It extracts features by combining intermediate layer activations with a topic vector, capturing both high-level semantics and detailed patterns for improved classification.
- Empirical results show VAMPIRE achieves competitive accuracy on low-resource benchmarks with fewer parameters and faster training than large-scale Transformer models.
VAMPIRE (VAriational Methods for Pretraining In Resource-limited Environments) is a lightweight, semi-supervised text classification framework that pretrains a variational autoencoder (VAE) on in-domain, unlabeled data to produce document features for downstream classification under data- and resource-constrained regimes. Unlike computationally intensive neural LLMs, VAMPIRE is designed for rapid, effective deployment where labeled data and compute are scarce, achieving strong empirical results on multiple benchmarks (Gururangan et al., 2019).
1. Unigram Document VAE: Generative Model, Inference, and ELBO
VAMPIRE’s core is a bag-of-words VAE formulated over document count vectors. Each document is represented as , with the vocabulary size. The generative process samples from a spherical Gaussian prior , then applies a softmax to yield topic proportions . Word probabilities are computed via , and the likelihood factorizes as: The variational posterior is a diagonal Gaussian with parameters output by a multi-layer perceptron (MLP) encoder on : A sample is drawn with . The variational lower bound (ELBO) is: In practice, the expectation is approximated via a single sample.
2. VAE Feature Extraction for Downstream Classification
After pretraining, the VAE is frozen. Features for each document are extracted from both the encoder MLP’s intermediate activations and the final topic vector . The document embedding is a learnable weighted sum: where are softmax-normalized scalars. This concatenation of information captures both topic-level and generic feature structure for classification.
3. Model Architecture and Training Protocols
Key hyperparameters in VAMPIRE include: (vocabulary), –$125$ (latent dim., via random search), encoder depth –$3$, hidden layers –$128$ units (tanh or ReLU activations). Decoder activations are linear prior to softmax.
Regularization methods:
- -dropout: dropout rate of $0.4$–$0.5$ on
- Batch normalization on reconstruction logits
- KL-annealing: KL term scheduled from $0$ to $1$ over 1,000 updates (linear/sigmoid)
Optimization uses Adam (learning rate sampled in , typically ), batch size $64$, and up to $50$ epochs. Early stopping tracks topic coherence (NPMI) on held-out data (patience = $5$ epochs), not marginal likelihood.
4. Classifier Integration and Semi-supervised Learning Regime
The downstream classifier is a sequence-to-vector encoder implemented as a Deep Averaging Network (DAN). DAN embeds tokens (random/GloVe/contextual), averages them, and applies a 1-layer MLP with dropout.
The final classifier input is the concatenation , passed through the DAN’s MLP for prediction. Loss is standard cross-entropy, supervised only on labeled examples . The VAE remains frozen during classifier training. For semi-supervised learning, VAMPIRE leverages large in-domain unlabeled corpora for VAE pretraining, but does not employ iterative pseudo-labeling as in self-training baselines.
5. Empirical Performance and Comparative Results
VAMPIRE demonstrates strong performance on low-resource classification tasks. Test accuracies with only $200$ labeled examples are:
| Dataset | VAMPIRE | DAN (sup.) | DAN+self-training |
|---|---|---|---|
| AG News | (±2.0) | ||
| Yahoo | (±0.8) | ||
| IMDB | (±0.6) | ||
| Toxic Tweets | (±0.9) |
With $500$– labels, VAMPIRE continues to outperform classical baselines and approaches the accuracy of large, fine-tuned LLMs such as ELMo/BERT on in-domain data. Model size is approximately $3.8$M parameters, training in $7$ minutes on a single GPU, compared to $159$M parameters and hours for a Transformer LM on the same corpus.
6. Design Rationale and Distinguishing Characteristics
VAMPIRE’s advantages derive from:
- Spherical Gaussian latent prior with topic-based decoding via a softmax transformation
- Bag-of-words document encoding and unigram reconstruction, yielding interpretable, topic-like document representations
- Extraction of both high-level () and hidden-layer () encoder states as transferrable document features
- KL-annealing and NPMI-based early stopping for robust pretraining without overfitting
- Semi-supervised protocol that avoids the complexity of iterative pseudo-labeling
This design enables significant gains in accuracy and computational efficiency under resource constraints (Gururangan et al., 2019).
7. Limitations and Future Directions
VAMPIRE is limited to bag-of-words document representations, and thus does not encode word order or syntactic phenomena directly. Performance degrades when sufficient labeled data and compute permit pretraining and fine-tuning of large contextual models. Extension to richer document architectures and adaptation to languages with less clear bag-of-words semantics remain plausible avenues for future research. The VAMPIRE codebase is made available by the original authors for further study and experimentation (Gururangan et al., 2019).