Elucidating the Design Space of Multimodal Protein Language Models (2504.11454v3)

Published 15 Apr 2025 in cs.LG, cs.AI, and q-bio.QM

Abstract: Multimodal protein LLMs (PLMs) integrate sequence and token-based structural information, serving as a powerful foundation for protein modeling, generation, and design. However, the reliance on tokenizing 3D structures into discrete tokens causes substantial loss of fidelity about fine-grained structural details and correlations. In this paper, we systematically elucidate the design space of multimodal PLMs to overcome their limitations. We identify tokenization loss and inaccurate structure token predictions by the PLMs as major bottlenecks. To address these, our proposed design space covers improved generative modeling, structure-aware architectures and representation learning, and data exploration. Our advancements approach finer-grained supervision, demonstrating that token-based multimodal PLMs can achieve robust structural modeling. The effective design methods dramatically improve the structure generation diversity, and notably, folding abilities of our 650M model by reducing the RMSD from 5.52 to 2.36 on PDB testset, even outperforming 3B baselines and on par with the specialized folding models. Project page and code: https://bytedance.github.io/dplm/dplm-2.1/.

Summary

The paper identifies tokenization loss and generative shortcomings in protein language models and introduces bit-based predictions and residual diffusion to recover fine structural details.
The approach leverages a geometry-aware architecture with representation alignment and flow matching, significantly boosting folding accuracy to an RMSD of 2.37Å.
The integration of multimer data and targeted fine-tuning demonstrates scalable improvements in protein folding performance and generation diversity.

This paper investigates the design space of multimodal protein LLMs (PLMs) that integrate sequence and structure information, focusing on models that use tokenized representations of 3D structure like DPLM-2 (2410.13782) and ESM3 (2407.10758). While these models offer a unified framework for protein modeling, generation, and design, they suffer from limitations primarily due to the structure tokenization process and subsequent structure prediction by the LLM.

Problem Identification:

The authors identify two major bottlenecks:

Tokenization Loss: Converting continuous 3D coordinates into discrete tokens (vector quantization) inevitably loses fine-grained structural details. Experiments show that quantization significantly increases reconstruction error (RMSD from 1.31Å to 1.98Å, TM-score from 0.97 to 0.94 for the tokenizer alone). (Observation O1)
Inaccurate Structure Token Prediction: High reconstruction accuracy of the structure tokenizer doesn't directly translate to high-quality structure generation by the PLM. A tokenizer with better reconstruction (ESM3's) led to worse folding performance in the PLM compared to one with poorer reconstruction (DPLM-2's), highlighting the PLM's generative capability as a critical factor. (Observation O2) Furthermore, predicting index-based structure tokens is highly challenging and inaccurate (e.g., 8.6% accuracy on CAMEO folding), as small bit-level changes can lead to large index changes. Bit-level prediction accuracy is much higher (77.2%) and aligns better with structural quality metrics (RMSD, TM-score). (Observation O3)

Proposed Design Space Enhancements:

To address these issues, the paper explores improvements across generative modeling, architecture, representation learning, and data:

Improved Generative Modeling for Structure Prediction (Section 3):
- ResDiff (Residual Diffusion): A lightweight continuous diffusion module is trained to predict the quantization residuals ($\mathbf{z}_{\text{cont} - \mathbf{z}_{\text{quant}$) conditioned on the PLM's hidden states and predicted discrete tokens. This aims to recover the fine-grained detail lost during tokenization. It demonstrably refines local structures (e.g., improving secondary structure formation) and consistently improves folding metrics across different model variants.
- Bit-based LLMing: Instead of predicting a single index token per residue (a $2^K$ -way classification for K bits), the model predicts each of the $K$ bits independently (K binary classifications). This provides finer-grained supervision, significantly improving both token prediction accuracy and structural metrics (e.g., folding RMSD on PDB date split reduced from 5.31Å to 3.22Å).
- Hybrid Data-Space Modeling (Flow Matching): The structure encoder, PLM, and structure decoder are treated collectively as a denoising function operating in the continuous atomic coordinate space (after encoding) or feature space. This composite denoiser is integrated into a flow-based sampling framework (using Euler integration and flow matching fine-tuning). This allows direct data-space structure generation, further improving folding accuracy (RMSD to 2.87Å on PDB date split) and potentially matching specialized folding models, though it can negatively impact generation diversity and requires more training time due to the need for the encoder during training. It significantly speeds up inference sampling (10x fewer steps).
Structure-Aware Architecture and Representation Learning (Section 4):
- GeoDPLM (Geometry-aware Architecture): Inspired by AlphaFold3 (2405.11777), geometric modules operating on 2D pair representations are added to the PLM's Transformer blocks. This includes structure attention (refining single and pair representations using transitions, triangle updates/attention) and SeqStruct attention (blending pair representations with sequence/structure representations). Ablations show that adding pair representations and transition layers (especially for structure representations) is most effective for improving folding and generation diversity, while triangle operations are computationally expensive and offer limited benefit in this context.
- REPA (Representation Alignment): The PLM's hidden representations are aligned (using cosine similarity loss) with precomputed structure/pair representations from a specialized folding model (ESMFold (2304.05702)). This aims to transfer structural semantics and improve representation quality. REPA improves folding performance and significantly boosts the diversity of unconditionally generated structures. Its benefits overlap somewhat with bit-based modeling, as both provide smoother, higher-dimensional learning signals compared to index tokens.
Data Exploration: Multimers (Section 5):
- The paper curates a PDB-Multimer dataset and investigates the impact of training on multi-chain proteins.
- Scaling monomer data improves the structure tokenizer's reconstruction performance on multimer data, indicating relevance.
- Using chain linkers (e.g., Glycine linkers) or simple position index offsets (adding chain_index * offset to position embeddings) improves modeling for multimers by helping differentiate chains.
- Fine-tuning the PLM on a mix of monomer and multimer data improves folding performance on both monomer (CAMEO) and multimer test sets, suggesting multimer data provides richer structural interactions beneficial for robust modeling.

Key Results and Combinations (Section 6):

Combining the proposed methods yields significant improvements. The Geo + Bit-based modeling approach offers a strong balance between folding performance, generation diversity, and training efficiency.
Adding Flow Matching (FM) to Geo+Bit further improves folding but can hurt diversity.
Adding REPA to Geo+Bit improves diversity significantly but offers less additional folding improvement compared to FM.
Supervised fine-tuning (SFT) specifically for folding (predicting structure tokens given sequence) further boosts folding metrics but harms unconditional co-generation capabilities.
The best performing model configuration (GeoDPLM 650M + Bit-based + FM + ResDiff + Folding SFT) achieves a folding RMSD of 2.37Å on the PDB date split, outperforming the 3B parameter ESMFold (2.84Å) and a 3B DPLM-2 variant (3.15Å).

Conclusion:

The paper systematically identifies key limitations in token-based multimodal PLMs related to tokenization and prediction accuracy. It proposes and validates effective design choices including bit-based modeling, residual recovery, hybrid flow-based sampling, geometry-aware architectures, representation alignment, and multimer data integration. These methods significantly enhance structural modeling capabilities, enabling a 650M parameter multimodal PLM to achieve state-of-the-art folding results comparable to or exceeding specialized, larger models, while also improving generation diversity and representation learning.

PDF Markdown

Tweets

https://twitter.com/BiologyAIDaily/status/1912478350263873918

https://twitter.com/zaixiang_zheng/status/1923080971173364086

https://twitter.com/BiologyAIDaily/status/1912478297247862886

https://twitter.com/zaixiang_zheng/status/1923066339301916922

https://twitter.com/XTXI/status/1933575767415304489

https://twitter.com/XTXI/status/1912896186308088215

Elucidating the Design Space of Multimodal Protein Language Models (2504.11454v3)

Summary

Related Papers

Tweets