Papers
Topics
Authors
Recent
Search
2000 character limit reached

VAMPIRE: Efficient Variational Pretraining

Updated 26 November 2025
  • VAMPIRE is a lightweight, semi-supervised text classification framework that pretrains a variational autoencoder on in-domain bag-of-words data to generate document features.
  • It extracts features by combining intermediate layer activations with a topic vector, capturing both high-level semantics and detailed patterns for improved classification.
  • Empirical results show VAMPIRE achieves competitive accuracy on low-resource benchmarks with fewer parameters and faster training than large-scale Transformer models.

VAMPIRE (VAriational Methods for Pretraining In Resource-limited Environments) is a lightweight, semi-supervised text classification framework that pretrains a variational autoencoder (VAE) on in-domain, unlabeled data to produce document features for downstream classification under data- and resource-constrained regimes. Unlike computationally intensive neural LLMs, VAMPIRE is designed for rapid, effective deployment where labeled data and compute are scarce, achieving strong empirical results on multiple benchmarks (Gururangan et al., 2019).

1. Unigram Document VAE: Generative Model, Inference, and ELBO

VAMPIRE’s core is a bag-of-words VAE formulated over document count vectors. Each document xx is represented as cNVc \in \mathbb{N}^V, with VV the vocabulary size. The generative process samples zRKz \in \mathbb{R}^K from a spherical Gaussian prior p(z)=N(z;0,I)p(z) = \mathcal{N}(z; 0, I), then applies a softmax to yield topic proportions θ=softmax(z)ΔK\theta = \mathrm{softmax}(z) \in \Delta^K. Word probabilities η\eta are computed via η=softmax(b+Bθ)ΔV\eta = \mathrm{softmax}(b + B\theta) \in \Delta^V, and the likelihood factorizes as: logp(xz)=j=1Vcjlogηj\log p(x|z) = \sum_{j=1}^V c_j \log \eta_j The variational posterior q(zx)q(z|x) is a diagonal Gaussian with parameters output by a multi-layer perceptron (MLP) encoder on cc: h=MLP(c),μ=Wμh+bμ,logσ=Wσh+bσ,σ=exp(logσ)h = \mathrm{MLP}(c),\quad \mu = W_\mu h + b_\mu,\quad \log\sigma = W_\sigma h + b_\sigma,\quad \sigma = \exp(\log\sigma) A sample z=μ+σϵz = \mu + \sigma\odot\epsilon is drawn with ϵN(0,I)\epsilon\sim\mathcal{N}(0,I). The variational lower bound (ELBO) is: L(x)=Eq(zx)[logp(xz)]KL(q(zx)p(z))\mathcal{L}(x) = \mathbb{E}_{q(z|x)} [\log p(x|z)] - \mathrm{KL}(q(z|x) \| p(z)) In practice, the expectation is approximated via a single sample.

2. VAE Feature Extraction for Downstream Classification

After pretraining, the VAE is frozen. Features for each document are extracted from both the encoder MLP’s intermediate activations h(1),...,h(n)h^{(1)}, ..., h^{(n)} and the final topic vector θ\theta. The document embedding is a learnable weighted sum: r=λ0θ+k=1nλkh(k)r = \lambda_0 \theta + \sum_{k=1}^n \lambda_k h^{(k)} where {λ0,,λn}\{\lambda_0,\ldots,\lambda_n\} are softmax-normalized scalars. This concatenation of information captures both topic-level and generic feature structure for classification.

3. Model Architecture and Training Protocols

Key hyperparameters in VAMPIRE include: V=30,000V=30{,}000 (vocabulary), K80K\approx80–$125$ (latent dim., via random search), encoder depth n=2n=2–$3$, hidden layers 80\approx80–$128$ units (tanh or ReLU activations). Decoder activations are linear prior to softmax.

Regularization methods:

  • zz-dropout: dropout rate of $0.4$–$0.5$ on zz
  • Batch normalization on reconstruction logits
  • KL-annealing: KL term scheduled from $0$ to $1$ over \sim1,000 updates (linear/sigmoid)

Optimization uses Adam (learning rate sampled in [1104,1102][1\cdot10^{-4},1\cdot10^{-2}], typically 21042\cdot10^{-4}), batch size $64$, and up to $50$ epochs. Early stopping tracks topic coherence (NPMI) on held-out data (patience = $5$ epochs), not marginal likelihood.

4. Classifier Integration and Semi-supervised Learning Regime

The downstream classifier is a sequence-to-vector encoder fs2v(x)f_{\text{s2v}}(x) implemented as a Deep Averaging Network (DAN). DAN embeds tokens (random/GloVe/contextual), averages them, and applies a 1-layer MLP with dropout.

The final classifier input is the concatenation [r; DAN(x)][r;\ \mathrm{DAN}(x)], passed through the DAN’s MLP for prediction. Loss is standard cross-entropy, supervised only on labeled examples DL\mathcal{D}_L. The VAE remains frozen during classifier training. For semi-supervised learning, VAMPIRE leverages large in-domain unlabeled corpora DU\mathcal{D}_U for VAE pretraining, but does not employ iterative pseudo-labeling as in self-training baselines.

5. Empirical Performance and Comparative Results

VAMPIRE demonstrates strong performance on low-resource classification tasks. Test accuracies with only $200$ labeled examples are:

Dataset VAMPIRE DAN (sup.) DAN+self-training
AG News 82.2%82.2\% (±2.0) 68.5%68.5\% 73.8%73.8\%
Yahoo 74.1%74.1\% (±0.8) 67.7%67.7\% 68.5%68.5\%
IMDB 83.9%83.9\% (±0.6) 68.8%68.8\% 77.3%77.3\%
Toxic Tweets 59.9%59.9\% (±0.9) 54.5%54.5\% 57.5%57.5\%

With $500$–2,5002{,}500 labels, VAMPIRE continues to outperform classical baselines and approaches the accuracy of large, fine-tuned LLMs such as ELMo/BERT on in-domain data. Model size is approximately $3.8$M parameters, training in $7$ minutes on a single GPU, compared to $159$M parameters and >12>12 hours for a Transformer LM on the same corpus.

6. Design Rationale and Distinguishing Characteristics

VAMPIRE’s advantages derive from:

  • Spherical Gaussian latent prior with topic-based decoding via a softmax transformation
  • Bag-of-words document encoding and unigram reconstruction, yielding interpretable, topic-like document representations
  • Extraction of both high-level (θ\theta) and hidden-layer (h(k)h^{(k)}) encoder states as transferrable document features
  • KL-annealing and NPMI-based early stopping for robust pretraining without overfitting
  • Semi-supervised protocol that avoids the complexity of iterative pseudo-labeling

This design enables significant gains in accuracy and computational efficiency under resource constraints (Gururangan et al., 2019).

7. Limitations and Future Directions

VAMPIRE is limited to bag-of-words document representations, and thus does not encode word order or syntactic phenomena directly. Performance degrades when sufficient labeled data and compute permit pretraining and fine-tuning of large contextual models. Extension to richer document architectures and adaptation to languages with less clear bag-of-words semantics remain plausible avenues for future research. The VAMPIRE codebase is made available by the original authors for further study and experimentation (Gururangan et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VAMPIRE Model.