Protein Structure Tokenization via Geometric Byte Pair Encoding

Published 13 Nov 2025 in q-bio.QM and cs.AI | (2511.11758v1)

Abstract: Protein structure is central to biological function, and enabling multimodal protein models requires joint reasoning over sequence, structure, and function. A key barrier is the lack of principled protein structure tokenizers (PSTs): existing approaches fix token size or rely on continuous vector codebooks, limiting interpretability, multi-scale control, and transfer across architectures. We introduce GeoBPE, a geometry-grounded PST that transforms continuous, noisy, multi-scale backbone conformations into discrete ``sentences'' of geometry while enforcing global constraints. Analogous to byte-pair encoding, GeoBPE generates a hierarchical vocabulary of geometric primitives by iteratively (i) clustering Geo-Pair occurrences with k-medoids to yield a resolution-controllable vocabulary; (ii) quantizing each Geo-Pair to its closest medoid prototype; and (iii) reducing drift through differentiable inverse kinematics that optimizes boundary glue angles under an $\mathrm{SE}(3)$ end-frame loss. GeoBPE offers compression ($>$10x reduction in bits-per-residue at similar distortion rate), data efficiency ($>$10x less training data), and generalization (maintains test/train distortion ratio of $1.0-1.1$). It is architecture-agnostic: (a) its hierarchical vocabulary provides a strong inductive bias for coarsening residue-level embeddings from large PLMs into motif- and protein-level representations, consistently outperforming leading PSTs across $12$ tasks and $24$ test splits; (b) paired with a transformer, GeoBPE supports unconditional backbone generation via language modeling; and (c) tokens align with CATH functional families and support expert-interpretable case studies, offering functional meaning absent in prior PSTs. Code is available at https://github.com/shiningsunnyday/PT-BPE/.

Abstract PDF Upgrade to Chat

Summary

The paper introduces GeoBPE, a geometry-guided framework that constructs hierarchical vocabularies from protein backbones using motif clustering and differentiable inverse kinematics.
It employs iterative motif clustering, adaptive quantization, and global geometry correction to robustly tokenize continuous, noisy protein structures.
Empirical results demonstrate improved token efficiency, up to 49% enhancement in predictive diversity, and domain recall >99.9% using minimal training data.

Geometric Byte Pair Encoding for Protein Structure Tokenization

Motivation and Background

The paper presents GeoBPE, a geometry-guided, multi-resolution tokenizer for protein backbone structures (2511.11758). Existing protein structure tokenizers (PSTs), primarily based on VQ-VAEs, vectorize local backbone fragments and quantize them via fixed-size discrete codebooks. Although effective for sequence-structure tasks, these approaches lack hierarchical interpretability, multi-scale adaptability, and architecture independence. Moreover, their codebooks often collapse, resulting in poor token efficiency and degraded performance on out-of-distribution (OOD) data. Traditional structural alphabets—fixed sets of backbone motifs—capture modularity but do not address continuous geometric variability or enable hierarchical representation learning.

GeoBPE adapts byte pair encoding (BPE) from natural language processing, which merges frequent symbol pairs into hierarchical, variable-length tokens. The core challenge in geometric analogs is robustly discretizing continuous, noisy backbone conformations while preserving global spatial consistency. GeoBPE resolves this through a combination of motif clustering, adaptive quantization, and global geometric correction via differentiable inverse kinematics.

Algorithmic Framework

GeoBPE transforms a protein backbone into a discrete token sequence representing a hierarchical decomposition of structural motifs. The algorithm operates in iterative steps:

Motif Clustering and Quantization: Frequent motif pairs (Geo-Pairs) are identified and clustered with k-medoids over RMSD distances. Each cluster center (medoid) serves as a prototype, quantizing raw occurrences to their nearest medoid, yielding denoised geometric symbols.
Hierarchical Vocabulary Construction: Replacement of motif pairs by prototypes recursively builds a multi-scale vocabulary of structural primitives. This allows the vocabulary size and resolution to be tuned dynamically.
Global Geometry Correction: Local quantization introduces structural drift. GeoBPE compensates by optimizing glue angles (boundary dihedrals and bond angles) at motif boundaries through differentiable inverse kinematics, minimizing a global SE(3) end-frame loss. This enforces fold integrity across the protein and prevents cumulative geometric artifacts.
Forest Merge Trees: The segmentation and merging steps produce a hierarchical tree structure, where leaves represent fine-grained residues and internal nodes encapsulate higher-order motifs. Embeddings from pretrained protein LLMs (PLMs) can be aggregated and propagated within this tree, supporting localized and protein-level feature extraction.

GeoBPE is designed to be architecture-agnostic, capable of inducing strong inductive biases for downstream models and supporting discrete backbone generation via transformer-based language modeling.

Empirical Results

GeoBPE is extensively benchmarked against VQ-VAE and other PSTs across compression-distortion tradeoffs, generalization, token efficiency, and downstream predictive tasks:

Compression and Distortion: GeoBPE forms a smooth Pareto front, achieving $0.27-0.36\times$ the bits-per-residue of ProToken and only $18-22\%$ drop in LDDT versus ESM3, despite being trained on $0.02-7\%$ of their data volume. Test/train RMSD ratios remain exceptionally low ($1.16-1.28$), indicating robust generalization to OOD data, while standard PSTs degrade significantly (up to $6.4\times$ test RMSD).
Token Efficiency and Generative Diversity: GeoBPE avoids codebook collapse, yielding utilization rates above $40\%$ and perplexity levels superior to VQ-VAEs. Small Structure LLM evaluations demonstrate GeoBPE's ability to generate $99\%$ unique/designable backbones—improving scTM and diversity metrics by up to $49\%$ compared to baselines.
Downstream Functional and Structural Prediction: GeoBPE-induced features surpass other discrete and continuous tokenizers in AUROC for binding/catalytic/conserved site prediction, Spearman’s $\rho$ for flexibility regression, and macro F1 for fold classification. Performance gains over ESM3 reach $15.44\%$ (functional), $21.18\%$ (physicochemical), and $43.28\%$ (fold classification).
Interpretable Motif-Function Alignment: GeoBPE tokens correlate tightly with CATH and PFAM domain boundaries, achieving mean domain recall $>99.9\%$ and F1/IOU scores $\sim0.996/0.992$ . Motif boundaries often coincide with ligand-binding grooves, transmembrane cavities, and catalytically active scaffolds, providing interpretable vocabulary linked to biochemical function.
Data Efficiency and Robustness: GeoBPE trained on just $1\%$ of protein structures matches the performance of models trained with full data. Task-specific tokenization yields no significant gain, indicating broad generality of the geometric motif vocabulary.

Limitations and Future Directions

GeoBPE presently encodes only backbone geometry, excluding sequence and side-chain atoms. Feature extraction remains dependent on pretrained PLMs, and generative capabilities are demonstrated in small-scale autoregressive settings. While the method scales up to vocabularies of 21K tokens, ultra-high-resolution tokenization and integration with large generative models are open challenges.

Prospective research directions include the integration of side-chain geometries, end-to-end multimodal pretraining of PLMs grounded in GeoBPE token hierarchies, and further development of generative models capable of direct structure synthesis from geometric token sequences. The interpretability and multi-resolution architecture of GeoBPE also suggest promising applications in protein engineering, evolutionary analysis, and rational functional annotation.

Conclusion

GeoBPE constitutes a rigorous, geometry-centric framework for protein structure tokenization, realizing discrete, hierarchical motif vocabularies that support low-distortion reconstruction, efficient compression, robust generalization, and strong downstream predictive transfer. It addresses key limitations of current PSTs by providing multi-resolution, interpretable representations tightly coupled to structural and functional organization in proteins. The method establishes a viable foundation for structure-native protein language modeling and motivates the continued exploration of geometry-grounded tokenization in biological modeling.