STAR-VAE: Latent Variable Transformers for Scalable and Controllable Molecular Generation (2511.02769v1)

Published 4 Nov 2025 in cs.LG, cs.AI, and q-bio.BM

Abstract: The chemical space of drug-like molecules is vast, motivating the development of generative models that must learn broad chemical distributions, enable conditional generation by capturing structure-property representations, and provide fast molecular generation. Meeting the objectives depends on modeling choices, including the probabilistic modeling approach, the conditional generative formulation, the architecture, and the molecular input representation. To address the challenges, we present STAR-VAE (Selfies-encoded, Transformer-based, AutoRegressive Variational Auto Encoder), a scalable latent-variable framework with a Transformer encoder and an autoregressive Transformer decoder. It is trained on 79 million drug-like molecules from PubChem, using SELFIES to guarantee syntactic validity. The latent-variable formulation enables conditional generation: a property predictor supplies a conditioning signal that is applied consistently to the latent prior, the inference network, and the decoder. Our contributions are: (i) a Transformer-based latent-variable encoder-decoder model trained on SELFIES representations; (ii) a principled conditional latent-variable formulation for property-guided generation; and (iii) efficient finetuning with low-rank adapters (LoRA) in both encoder and decoder, enabling fast adaptation with limited property and activity data. On the GuacaMol and MOSES benchmarks, our approach matches or exceeds baselines, and latent-space analyses reveal smooth, semantically structured representations that support both unconditional exploration and property-aware generation. On the Tartarus benchmarks, the conditional model shifts docking-score distributions toward stronger predicted binding. These results suggest that a modernized, scale-appropriate VAE remains competitive for molecular generation when paired with principled conditioning and parameter-efficient finetuning.

Summary

The paper introduces a scalable conditional VAE that uses a Transformer encoder and autoregressive decoder with SELFIES for valid molecular generation.
The model achieves state-of-the-art performance on benchmarks by organizing a smooth, semantically rich latent space aligned with key chemical properties.
It employs LoRA adapters for efficient fine-tuning, enabling property-guided conditional generation and robust adaptation to new molecular design tasks.

STAR-VAE: Latent Variable Transformers for Scalable and Controllable Molecular Generation

Introduction

The STAR-VAE framework introduces a scalable, conditional variational autoencoder architecture for molecular generation, leveraging a bidirectional Transformer encoder and an autoregressive Transformer decoder trained on SELFIES representations. The model is designed to address the challenges of learning broad chemical distributions, enabling property-guided conditional generation, and supporting efficient adaptation to new tasks with limited labeled data. The use of SELFIES ensures syntactic validity of generated molecules, while the latent-variable formalism provides a principled basis for conditional generation and smooth latent space organization.

Model Architecture and Training

STAR-VAE consists of three main components: a Transformer-based encoder, a latent bottleneck, and an autoregressive Transformer decoder. The encoder processes SELFIES token embeddings and outputs the mean and variance of a Gaussian latent distribution. The latent vector, sampled via the reparameterization trick, serves as a compact molecular representation and conditions the decoder. Both encoder and decoder utilize absolute positional encoding, with a maximum sequence length of 71 tokens, covering over 99% of drug-like molecules in the training set.

The model is trained on 79 million drug-like molecules curated from PubChem, filtered for pharmaceutical relevance. Training employs the Adam optimizer with a learning rate of $10^{-5}$ and a KL divergence loss with $\beta=1.1$ to prevent posterior collapse. The architecture comprises 12 encoder and 12 decoder layers, 8 attention heads, and a latent dimensionality of 256, totaling 89.2 million parameters.

Conditional Generation Formalism

Conditional generation is formalized through three distributions: the property-conditioned prior $p_\theta(z|y)$ , the approximate posterior $q_\phi(z|x,y)$ , and the decoder likelihood $p_\theta(x|z,y)$ . Both the prior and posterior are modeled as Gaussians, with parameters predicted from property embeddings. The decoder generates molecules autoregressively, conditioned on the latent vector and property information.

Property conditioning is implemented via LoRA adapters in the attention projections of both encoder and decoder. These low-rank matrices enable efficient fine-tuning for property-guided generation, with the majority of model parameters frozen. Additionally, classifier guidance is incorporated at inference time, using a differentiable property predictor to steer decoding toward desired property values via gradient-based modification of decoder logits.

Experimental Results

Unconditional Generation

On the MOSES benchmark, STAR-VAE achieves perfect scores for validity, uniqueness, and novelty, and maintains high internal diversity. In the GuacaMol benchmark, the model attains a KL-divergence score of 0.916 when evaluated against the reference distribution, outperforming baselines on key physicochemical descriptors such as BertzCT, MolLogP, MolWt, and TPSA. Seed-conditioned inference further demonstrates the model's ability to reproduce property distributions of external datasets without additional fine-tuning, highlighting the robustness and adaptability of the learned latent space.

Conditional Generation

Synthetic Accessibility and Blood-Brain Barrier Permeability

UMAP visualizations of the conditional latent space reveal smooth, semantically organized manifolds aligned with synthetic accessibility (SA) and blood-brain barrier permeability (BBBP) scores. Conditioning signals consistently shape the latent space, producing clusters that reflect the continuous nature of the properties and support interpolation and controllable generation.

Target-Conditioned Generation

On the Tartarus protein-ligand design benchmark, STAR-VAE (CVAE variant) is fine-tuned for target-specific docking scores. The model shifts the distribution of docking scores for generated molecules toward stronger predicted binding affinities compared to the baseline VAE. For targets 1SYH and 6Y2F, CVAE-generated ligands achieve statistically stronger mean scores ( $p < 0.0001$ ), demonstrating effective capture of target-specific molecular features. The approach produces many high-scoring, diverse molecules, rather than optimizing for a single outlier, which is advantageous for downstream drug discovery applications.

Implementation Considerations

STAR-VAE's architecture is amenable to large-scale training and efficient adaptation. The use of SELFIES eliminates the need for post-generation validity checks, and the Transformer backbone supports parallelization and scalability. LoRA-based conditioning enables rapid fine-tuning with limited labeled data, reducing computational requirements for property-guided tasks. The latent-variable formalism facilitates both unconditional exploration and property-aware steering, with smooth latent spaces supporting interpolation and transfer across molecular domains.

Potential limitations include the reliance on property predictors for effective conditioning, which may be constrained by the availability and quality of labeled data. The model's performance on highly specialized or out-of-distribution chemical spaces may require further investigation, particularly in real-world drug discovery scenarios.

Implications and Future Directions

The results demonstrate that a modernized, scale-appropriate VAE remains competitive for molecular generation when paired with robust tokenization, principled conditioning, and parameter-efficient fine-tuning. The latent-variable approach provides a unified framework for distribution learning and goal-oriented generation, with strong empirical performance on both unconditional and conditional benchmarks.

Future work should focus on enhancing controllability by jointly incorporating molecular property information and latent seed embeddings, as well as extending validation to real-world discovery tasks beyond benchmark datasets. Comprehensive computational and experimental validation will be necessary to assess the relevance and utility of generated molecules in practical drug design pipelines.

Conclusion

STAR-VAE advances the state of molecular generative modeling by integrating a Transformer-based latent-variable architecture with SELFIES representations and efficient conditional adaptation. The framework achieves strong alignment with target property distributions, competitive benchmark performance, and smooth, semantically organized latent spaces. These results support the continued relevance of latent-variable models for scalable and controllable molecular generation, with promising avenues for future research in property-guided design and real-world application.