Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GraphXForm: Graph transformer for computer-aided molecular design with application to extraction (2411.01667v1)

Published 3 Nov 2024 in cs.LG, physics.chem-ph, and q-bio.BM

Abstract: Generative deep learning has become pivotal in molecular design for drug discovery and materials science. A widely used paradigm is to pretrain neural networks on string representations of molecules and fine-tune them using reinforcement learning on specific objectives. However, string-based models face challenges in ensuring chemical validity and enforcing structural constraints like the presence of specific substructures. We propose to instead combine graph-based molecular representations, which can naturally ensure chemical validity, with transformer architectures, which are highly expressive and capable of modeling long-range dependencies between atoms. Our approach iteratively modifies a molecular graph by adding atoms and bonds, which ensures chemical validity and facilitates the incorporation of structural constraints. We present GraphXForm, a decoder-only graph transformer architecture, which is pretrained on existing compounds and then fine-tuned using a new training algorithm that combines elements of the deep cross-entropy method with self-improvement learning from LLMing, allowing stable fine-tuning of deep transformers with many layers. We evaluate GraphXForm on two solvent design tasks for liquid-liquid extraction, showing that it outperforms four state-of-the-art molecular design techniques, while it can flexibly enforce structural constraints or initiate the design from existing molecular structures.

GraphXForm: Graph Transformer for Computer-Aided Molecular Design with Application to Extraction

The paper presents GraphXForm, a graph-based molecular design methodology leveraging the transformer architecture for tasks in chemical design, specifically focusing on solvent design for liquid-liquid extraction. The methodology addresses limitations found in string-based molecular representation, such as SMILES or SELFIES, in terms of ensuring chemical validity and embedding structural constraints during chemical compound generation.

Core Contributions

The primary contribution of this research is the integration of graph-based molecular representations with transformer architectures to enable the generation of molecular structures that inherently satisfy chemical validity. Unlike string-based models, the graph-based method allows for seamless incorporation of structural constraints from the beginning, ensuring more feasible and usable chemical designs.

Key features of GraphXForm include:

  • Molecular Graph Iterative Modification: The paper proposes a graph transformer model, GraphXForm, which modifies the molecular graph by adding atoms and bonds iteratively. This approach naturally ensures the chemical validity of generated molecules.
  • Decoder-Only Architecture with Novel Training: The design includes a decoder-only graph transformer architecture, pretrained on existing molecules and fine-tuned using a novel training algorithm combining self-improvement learning and elements from LLMing. This facilitates the stable fine-tuning of transformers, even for deep models with multiple layers.
  • Empirical Evaluation on Solvent Design: The paper benchmarks GraphXForm against four state-of-the-art molecular design models across two solvent design tasks. The results indicate that GraphXForm not only outperforms these comparative techniques in solvent design but also demonstrates flexibility in enforcing structural constraints and leveraging existing molecular designs.

Evaluation and Numerical Results

GraphXForm has been applied to solvent design tasks, specifically targeting liquid-liquid extraction processes which are critical in industries like biotechnology. The performance of GraphXForm was evaluated against the objective functions based on activity coefficients at infinite dilution for two distinct tasks: the separation of isobutanol (IBA) from water and a process involving 3,5-dimethoxybenzaldehyde (DMBA) and (R)-3,3’,5,5’-tetramethoxy-benzoin (TMB).

Results from the experiments show that GraphXForm consistently outperforms existing methods in both tasks, achieving higher maximal and mean objective values across multiple runs and significantly improving the capability of enforcing structural constraints. Notably, GraphXForm derived more chemically feasible solvent structures, accommodating specified constraints such as ring sizes and bond types.

Theoretical and Practical Implications

The theoretical implications center around the ability to effectively combine transformer architectures with graph-based representations in deep learning. This approach not only broadens the application scope of transformers in generative tasks but also ensures chemical design validity, which is often challenging with conventional string-based methods.

On the practical side, GraphXForm opens avenues for more efficient and chemically valid molecular design processes in fields like drug discovery and material science. The flexibility to impose structural constraints and initiate designs from preexisting structures ushers in a user-friendly and adaptive design framework, enhancing its utility in real-world chemical engineering applications.

Conclusion and Future Directions

GraphXForm showcases the successful blending of deep learning methodologies with chemical design needs, underlining the applicability of graph-former models in generating viable chemical structures. It demonstrates that operating directly on molecular graphs revolutionizes the flexibility and validity of molecule-generation processes. Future developments could expand the model’s capabilities by incorporating additional chemical elements and states, thus broadening its application. Furthermore, integrating GraphXForm into larger LLMs can facilitate its translation into a more intuitive user interface for constraint specification, thus streamlining the workflow of chemical researchers and engineers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. National center for biotechnology information. pubchem compound database, 2024. URL https://pubchem.ncbi.nlm.nih.gov/.
  2. Generative models as an emerging paradigm in the chemical sciences. Journal of the American Chemical Society, 145(16):8736–8750, 2023.
  3. Rezero is all you need: Fast convergence at large depth. In Uncertainty in Artificial Intelligence, pp.  1352–1361. PMLR, 2021.
  4. Generative models for molecular discovery: Recent advances and challenges. Wiley Interdisciplinary Reviews: Computational Molecular Science, 12(5):e1608, 2022.
  5. Molecular generation with recurrent neural networks (rnns). arXiv preprint arXiv:1705.04612, 2017.
  6. Group selfies: a robust fragment-based molecular string representation. Digital Discovery, 2(3):748–758, 2023.
  7. Self-labeling the job shop scheduling problem. arXiv preprint arXiv:2401.11849, 2024.
  8. Syntax-directed variational autoencoder for structured data. arXiv preprint arXiv:1802.08786, 2018.
  9. Predicting activity coefficients at infinite dilution for varying temperatures by matrix completion. Industrial & Engineering Chemistry Research, 60(40):14564–14578, 2021. ISSN 0888-5885. doi: 10.1021/acs.iecr.1c02039.
  10. Chembl web services: streamlining access to drug discovery data and utilities. Nucleic acids research, 43(W1):W612–W620, 2015.
  11. Molgan: An implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973, 2018.
  12. Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  13. Group-contribution estimation of activity coefficients in nonideal liquid mixtures. AIChE Journal, 21(6):1086–1099, 1975. ISSN 00011541. doi: 10.1002/aic.690210607.
  14. Reinforced genetic algorithm for structure-based drug design. Advances in Neural Information Processing Systems, 35:12325–12338, 2022.
  15. Sample efficiency matters: a benchmark for practical molecular optimization. Advances in neural information processing systems, 35:21342–21357, 2022.
  16. Neural message passing for quantum chemistry. 34th International Conference on Machine Learning, ICML 2017, 3:2053–2070, 2017.
  17. Dechema chemisry data series, volume ix activity coefficients at infinite dilution. DECHEMA Chemistry Data Series, 9, 2008.
  18. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
  19. Generative recurrent networks for de novo drug design. Molecular informatics, 37(1-2):1700111, 2018.
  20. Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  21. Large language models can self-improve. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview.net/forum?id=uuUQraD4XX.
  22. Jensen, J. H. A graph-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space. Chemical science, 10(12):3567–3572, 2019.
  23. Property-guided generation of complex polymer topologies using variational autoencoders. npj Computational Materials, 10(1):139, 2024.
  24. Junction tree variational autoencoder for molecular graph generation. 35th International Conference on Machine Learning, ICML 2018, 5:3632–3648, 2018.
  25. COSMO-RS: An Alternative to Simulation for Calculating Thermodynamic Properties of Liquid Mixtures. Annual Review of Chemical and Biomolecular Engineering, 1(1):101–122, jun 2010. ISSN 1947-5438. doi: 10.1146/annurev-chembioeng-073009-100903.
  26. Machine learning-supported solvent design for lignin-first biorefineries and lignin upgrading. Chemical Engineering Journal, 495:153524, 2024.
  27. Stochastic beams and where to find them: The gumbel-top-k trick for sampling sequences without replacement. In International Conference on Machine Learning, pp. 3499–3508. PMLR, 2019.
  28. Self-referencing embedded strings (selfies): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4):045024, 2020.
  29. Multi-objective de novo drug design with conditional graph generative model. Journal of cheminformatics, 10:1–24, 2018.
  30. Masked graph modeling for molecule generation. Nature communications, 12(1):3156, 2021.
  31. Molecule-augmented attention transformer. In Workshop on Graph Representation Learning, Neural Information Processing Systems, 2019.
  32. Molecule generation using transformers and policy gradient reinforcement learning. Scientific Reports, 13(1):8799, 2023.
  33. Gibbs–helmholtz graph neural network: capturing the temperature dependency of activity coefficients at infinite dilution. Digital Discovery, 2(3):781–798, 2023. doi: 10.1039/d2dd00142j. URL https://doi.org/10.1039/d2dd00142j.
  34. Mnih, V. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  35. Multi-fidelity graph neural networks for predicting toluene/water partition coefficients. ChemRxiv preprint 10.26434/chemrxiv-2024-3t818, 2024.
  36. Beyond generative models: superfast traversal, optimization, novelty, exploration and discovery (stoned) algorithm for molecules using selfies. Chemical science, 12(20):7079–7090, 2021.
  37. Tartarus: A benchmarking platform for realistic and practical inverse molecular design. Advances in Neural Information Processing Systems, 36:3263–3306, 2023.
  38. Surfactant-specific ai-driven molecular design: Integrating generative models, predictive modeling, and reinforcement learning for tailored surfactant synthesis. Industrial & Engineering Chemistry Research, 63(14):6313–6324, 2024.
  39. Deepsmiles: an adaptation of smiles for use in machine-learning of chemical structures. 2018.
  40. Molecular de-novo design through deep reinforcement learning. Journal of cheminformatics, 9:1–14, 2017.
  41. Systematic approach to solvent selection for biphasic systems with a combination of cosmo–rs and a dynamic modeling tool. Engineering in Life Sciences, 8(5):546–552, 2008. ISSN 1618-0240. doi: 10.1002/elsc.200800037.
  42. Self-improvement for neural combinatorial optimization: Sample without replacement, but improvement. Transactions on Machine Learning Research, 2024a. ISSN 2835-8856. URL https://openreview.net/forum?id=agT8ojoH0X. Featured Certification.
  43. Take a step and reconsider: Sequence decoding for self-improved neural combinatorial optimization. arXiv preprint arXiv:2407.17206, 2024b.
  44. Graph neural networks for materials science and chemistry. Communications Materials, 3(1):93, 2022. doi: 10.1038/s43246-022-00315-6.
  45. Thermodynamics-consistent graph neural networks. Chemical Science, 2024. doi: 10.1039/d4sc04554h.
  46. Graph neural networks for temperature-dependent activity coefficient prediction of solutes in ionic liquids. Computers and Chemical Engineering, 171:108153, 2023a. ISSN 00981354. doi: 10.1016/j.compchemeng.2023.108153. URL https://doi.org/10.1016/j.compchemeng.2023.108153.
  47. Gibbs-Duhem-informed neural networks for binary activity coefficient prediction. Digital Discovery, 2:1752–1767, 2023b. doi: 10.1039/D3DD00103B. URL https://doi.org/10.1039/D3DD00103B.
  48. Graph neural networks for the prediction of molecular structure–property relationships. In Zhang, D. and Del Río Chanona, E. A. (eds.), Machine Learning and Hybrid Modelling for Reaction Engineering, pp.  159–181. Royal Society of Chemistry, 2023c. ISBN 978-1-83916-563-4. doi: 10.1039/BK9781837670178-00159.
  49. Graph machine learning for design of high-octane fuels. AIChE journal, 69(4):e17971, 2023d.
  50. Inverse molecular design using machine learning: Generative models for matter engineering. Science, 361(6400):360–365, 2018.
  51. Artificial intelligence for novel fuel design. Proceedings of the Combustion Institute, 40(1-4):105630, 2024.
  52. Designing catalysts with deep generative models and computational data. a case study for suzuki cross coupling reactions. Digital discovery, 2(3):728–735, 2023.
  53. Computer-based de novo design of drug-like molecules. Nature Reviews Drug Discovery, 4(8):649–663, 2005.
  54. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  55. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS central science, 4(1):120–131, 2018.
  56. Incremental sampling without replacement for sequence models. In International Conference on Machine Learning, pp. 8785–8795. PMLR, 2020.
  57. Skinnider, M. A. Invalid smiles are beneficial rather than detrimental to chemical language models. Nature Machine Intelligence, 6(4):437–448, 2024.
  58. Hanna: Hard-constraint neural network for consistent activity coefficient prediction. arXiv preprint arXiv:2407.18011, 2024.
  59. Synthesis of waixenicin a: Exploring strategies for nine‐membered ring formation. Chemistry – A European Journal, 30(7):e202303489, December 2023. ISSN 1521-3765. doi: 10.1002/chem.202303489. URL http://dx.doi.org/10.1002/chem.202303489.
  60. Vaswani, A. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
  61. Wagner, A. Z. Constructions in combinatorics via neural networks. arXiv preprint arXiv:2104.14516, 2021.
  62. Graph neural networks for molecules. In Machine Learning in Molecular Sciences, pp.  21–66. Springer, 2023.
  63. Weininger, D. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences, 28(1):31–36, 1988. doi: 10.1021/ci00057a005. URL https://doi.org/10.1021/ci00057a005.
  64. Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
  65. A smile is all you need: predicting limiting activity coefficients from SMILES with natural language processing. Digital Discovery, 1:859–869, 2022. doi: 10.1039/d2dd00058j.
  66. Spt-nrtl: A physics-guided machine learning model to predict thermodynamically consistent activity coefficients. Fluid Phase Equilibria, 568:113731, 2023. ISSN 03783812. doi: 10.1016/j.fluid.2023.113731.
  67. Wisniak, J. Liquid—liquid phase splitting—i analytical models for critical mixing and azeotropy. Chemical Engineering Science, 38(6):969–978, 1983.
  68. REINVENT-transformer: Molecular de novo design through transformer-based reinforcement learning. In Artificial Intelligence and Data Science for Healthcare: Bridging Data-Centric AI and People-Centric Healthcare, 2024. URL https://openreview.net/forum?id=XykiSFid41.
  69. Do transformers really perform badly for graph representation? Advances in neural information processing systems, 34:28877–28888, 2021.
  70. Benchmarking study of deep generative models for inverse polymer design. ChemRxiv, 2024.
  71. Prediction of toluene/water partition coefficients in the sampl9 blind challenge: assessment of machine learning and ief-pcm/mst continuum solvation models. Physical Chemistry Chemical Physics, pp.  10.1039/D3CP01428B, 2023. doi: 10.1039/D3CP01428B. URL http://dx.doi.org/10.1039/D3CP01428B.
  72. Moflow: an invertible flow model for generating molecular graphs. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp.  617–626, 2020.
  73. Artificial intelligence for science in quantum, atomistic, and continuum systems. arXiv preprint arXiv:2307.08423, 2023a.
  74. An equivariant generative framework for molecular graph-structure co-design. Chemical Science, 14(31):8380–8392, 2023b.
  75. Optimization of molecules via deep reinforcement learning. Scientific reports, 9(1):10752, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jonathan Pirnay (7 papers)
  2. Jan G. Rittig (11 papers)
  3. Alexander B. Wolf (1 paper)
  4. Martin Grohe (92 papers)
  5. Jakob Burger (7 papers)
  6. Alexander Mitsos (45 papers)
  7. Dominik G. Grimm (7 papers)
Citations (1)