Tailoring xLSTM for biological and chemical sequences and benchmarking against domain-specific LLMs

Determine how to adapt the xLSTM architecture and training procedures specifically for biological and chemical sequence modeling (including genomic DNA, protein sequences, and SMILES representations of small molecules), and ascertain how xLSTM-based models compare in performance to other domain-specific large language model architectures across these modalities.

Background

The xLSTM architecture is a recurrent sequence model with linear runtime in sequence length and favorable properties for long-context tasks. Biological and chemical sequences require modeling long-range dependencies (e.g., genomic regulation, protein folding context, and domain-conditioned molecule design), where Transformers’ quadratic complexity is challenging and state-space models have recently emerged. Given these domain needs and architectural trade-offs, prior to this work it was uncertain how best to tailor xLSTM to these modalities and how it would compare to established domain-specific LLMs such as Transformer-based and SSM-based models.

This uncertainty motivates the development of Bio-xLSTM variants (DNA-xLSTM, Prot-xLSTM, Chem-xLSTM) and comprehensive benchmarking across generative modeling, representation learning, and in-context learning. The open question explicitly stated in the paper frames the need for rigorous design choices and comparative evaluation.

References

However, it remains unclear how to best tailor xLSTM for biological and chemical sequences and how xLSTM compares to other domain-specific LLM architectures.

— Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences (2411.04165 - Schmidinger et al., 6 Nov 2024) in Section 1, Introduction

Tailoring xLSTM for biological and chemical sequences and benchmarking against domain-specific LLMs

Background

References

Related Problems