Dice Question Streamline Icon: https://streamlinehq.com

Tokenization for molecular foundation models

Determine a principled tokenization scheme for molecular foundation models that represents continuous three-dimensional molecular configurations as discrete tokens, evaluating alternatives such as voxelization, graph encoding, and point-cloud representations.

Information Square Streamline Icon: https://streamlinehq.com

Background

In discussing physics-grounded molecular foundation models, the paper identifies tokenization as a critical unresolved design choice. Unlike text, molecular configurations are continuous and geometric, requiring conversion to discrete tokens that preserve symmetry, locality, and physical constraints.

Potential tokenization strategies include voxel grids, molecular graphs with geometric features, and point-cloud encodings. Selecting an approach that supports downstream learning tasks while maintaining physical fidelity is presented as an open question.

References

Key open questions: Tokenization: How to embed continuous molecular configurations into discrete tokens? Voxelization? Graph encoding? Point clouds?

Learning Biomolecular Motion: The Physics-Informed Machine Learning Paradigm (2511.06585 - Deshpande, 10 Nov 2025) in Section 7, Future Directions—Physics-Grounded Foundation Models