VAMPIRE: Efficient Variational Pretraining

Updated 26 November 2025

VAMPIRE is a lightweight, semi-supervised text classification framework that pretrains a variational autoencoder on in-domain bag-of-words data to generate document features.
It extracts features by combining intermediate layer activations with a topic vector, capturing both high-level semantics and detailed patterns for improved classification.
Empirical results show VAMPIRE achieves competitive accuracy on low-resource benchmarks with fewer parameters and faster training than large-scale Transformer models.

VAMPIRE (VAriational Methods for Pretraining In Resource-limited Environments) is a lightweight, semi-supervised text classification framework that pretrains a variational autoencoder (VAE) on in-domain, unlabeled data to produce document features for downstream classification under data- and resource-constrained regimes. Unlike computationally intensive neural LLMs, VAMPIRE is designed for rapid, effective deployment where labeled data and compute are scarce, achieving strong empirical results on multiple benchmarks (Gururangan et al., 2019).

1. Unigram Document VAE: Generative Model, Inference, and ELBO

VAMPIRE’s core is a bag-of-words VAE formulated over document count vectors. Each document $x$ is represented as $c \in \mathbb{N}^V$ , with $V$ the vocabulary size. The generative process samples $z \in \mathbb{R}^K$ from a spherical Gaussian prior $p(z) = \mathcal{N}(z; 0, I)$ , then applies a softmax to yield topic proportions $\theta = \mathrm{softmax}(z) \in \Delta^K$ . Word probabilities $\eta$ are computed via $\eta = \mathrm{softmax}(b + B\theta) \in \Delta^V$ , and the likelihood factorizes as: $\log p(x|z) = \sum_{j=1}^V c_j \log \eta_j$ The variational posterior $q(z|x)$ is a diagonal Gaussian with parameters output by a multi-layer perceptron (MLP) encoder on $c$ : $h = \mathrm{MLP}(c),\quad \mu = W_\mu h + b_\mu,\quad \log\sigma = W_\sigma h + b_\sigma,\quad \sigma = \exp(\log\sigma)$ A sample $z = \mu + \sigma\odot\epsilon$ is drawn with $\epsilon\sim\mathcal{N}(0,I)$ . The variational lower bound (ELBO) is: $\mathcal{L}(x) = \mathbb{E}_{q(z|x)} [\log p(x|z)] - \mathrm{KL}(q(z|x) \| p(z))$ In practice, the expectation is approximated via a single sample.

2. VAE Feature Extraction for Downstream Classification

After pretraining, the VAE is frozen. Features for each document are extracted from both the encoder MLP’s intermediate activations $h^{(1)}, ..., h^{(n)}$ and the final topic vector $\theta$ . The document embedding is a learnable weighted sum: $r = \lambda_0 \theta + \sum_{k=1}^n \lambda_k h^{(k)}$ where $\{\lambda_0,\ldots,\lambda_n\}$ are softmax-normalized scalars. This concatenation of information captures both topic-level and generic feature structure for classification.

3. Model Architecture and Training Protocols

Key hyperparameters in VAMPIRE include: $V=30{,}000$ (vocabulary), $K\approx80$ –$125$ (latent dim., via random search), encoder depth $n=2$ –$3$, hidden layers $\approx80$ –$128$ units (tanh or ReLU activations). Decoder activations are linear prior to softmax.

Regularization methods:

$z$ -dropout: dropout rate of $0.4$–$0.5$ on $z$
Batch normalization on reconstruction logits
KL-annealing: KL term scheduled from $0$ to $1$ over $\sim$ 1,000 updates (linear/sigmoid)

Optimization uses Adam (learning rate sampled in $[1\cdot10^{-4},1\cdot10^{-2}]$ , typically $2\cdot10^{-4}$ ), batch size $64$, and up to $50$ epochs. Early stopping tracks topic coherence (NPMI) on held-out data (patience = $5$ epochs), not marginal likelihood.

4. Classifier Integration and Semi-supervised Learning Regime

The downstream classifier is a sequence-to-vector encoder $f_{\text{s2v}}(x)$ implemented as a Deep Averaging Network (DAN). DAN embeds tokens (random/GloVe/contextual), averages them, and applies a 1-layer MLP with dropout.

The final classifier input is the concatenation $[r;\ \mathrm{DAN}(x)]$ , passed through the DAN’s MLP for prediction. Loss is standard cross-entropy, supervised only on labeled examples $\mathcal{D}_L$ . The VAE remains frozen during classifier training. For semi-supervised learning, VAMPIRE leverages large in-domain unlabeled corpora $\mathcal{D}_U$ for VAE pretraining, but does not employ iterative pseudo-labeling as in self-training baselines.

5. Empirical Performance and Comparative Results

VAMPIRE demonstrates strong performance on low-resource classification tasks. Test accuracies with only $200$ labeled examples are:

Dataset	VAMPIRE	DAN (sup.)	DAN+self-training
AG News	$82.2\%$ (±2.0)	$68.5\%$	$73.8\%$
Yahoo	$74.1\%$ (±0.8)	$67.7\%$	$68.5\%$
IMDB	$83.9\%$ (±0.6)	$68.8\%$	$77.3\%$
Toxic Tweets	$59.9\%$ (±0.9)	$54.5\%$	$57.5\%$

With $500$– $2{,}500$ labels, VAMPIRE continues to outperform classical baselines and approaches the accuracy of large, fine-tuned LLMs such as ELMo/BERT on in-domain data. Model size is approximately $3.8$M parameters, training in $7$ minutes on a single GPU, compared to $159$M parameters and $>12$ hours for a Transformer LM on the same corpus.

6. Design Rationale and Distinguishing Characteristics

VAMPIRE’s advantages derive from:

Spherical Gaussian latent prior with topic-based decoding via a softmax transformation
Bag-of-words document encoding and unigram reconstruction, yielding interpretable, topic-like document representations
Extraction of both high-level ( $\theta$ ) and hidden-layer ( $h^{(k)}$ ) encoder states as transferrable document features
KL-annealing and NPMI-based early stopping for robust pretraining without overfitting
Semi-supervised protocol that avoids the complexity of iterative pseudo-labeling

This design enables significant gains in accuracy and computational efficiency under resource constraints (Gururangan et al., 2019).

7. Limitations and Future Directions

VAMPIRE is limited to bag-of-words document representations, and thus does not encode word order or syntactic phenomena directly. Performance degrades when sufficient labeled data and compute permit pretraining and fine-tuning of large contextual models. Extension to richer document architectures and adaptation to languages with less clear bag-of-words semantics remain plausible avenues for future research. The VAMPIRE codebase is made available by the original authors for further study and experimentation (Gururangan et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

Variational Pretraining for Semi-supervised Text Classification (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VAMPIRE Model.

VAMPIRE: Efficient Variational Pretraining

1. Unigram Document VAE: Generative Model, Inference, and ELBO

2. VAE Feature Extraction for Downstream Classification

3. Model Architecture and Training Protocols

4. Classifier Integration and Semi-supervised Learning Regime

5. Empirical Performance and Comparative Results

6. Design Rationale and Distinguishing Characteristics

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

VAMPIRE: Efficient Variational Pretraining

1. Unigram Document VAE: Generative Model, Inference, and ELBO

2. VAE Feature Extraction for Downstream Classification

3. Model Architecture and Training Protocols

4. Classifier Integration and Semi-supervised Learning Regime

5. Empirical Performance and Comparative Results

6. Design Rationale and Distinguishing Characteristics

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research