Implicitly Guided Design with PropEn: Match your Data to Follow the Gradient (2405.18075v1)
Abstract: Across scientific domains, generating new models or optimizing existing ones while meeting specific criteria is crucial. Traditional machine learning frameworks for guided design use a generative model and a surrogate model (discriminator), requiring large datasets. However, real-world scientific applications often have limited data and complex landscapes, making data-hungry models inefficient or impractical. We propose a new framework, PropEn, inspired by ``matching'', which enables implicit guidance without training a discriminator. By matching each sample with a similar one that has a better property value, we create a larger training dataset that inherently indicates the direction of improvement. Matching, combined with an encoder-decoder architecture, forms a domain-agnostic generative framework for property enhancement. We show that training with a matched dataset approximates the gradient of the property of interest while remaining within the data distribution, allowing efficient design optimization. Extensive evaluations in toy problems and scientific applications, such as therapeutic protein design and airfoil optimization, demonstrate PropEn's advantages over common baselines. Notably, the protein design results are validated with wet lab experiments, confirming the competitiveness and effectiveness of our approach.
- Joaquim RRA Martins. Aerodynamic design optimization: Challenges and perspectives. Computers & Fluids, 239:105391, 2022.
- Materials discovery and design: By means of data science and optimal learning, volume 280. Springer, 2018.
- Recent advances and applications of machine learning in solid-state materials science. npj computational materials, 5(1):83, 2019.
- A model to search for synthesizable molecules. Advances in Neural Information Processing Systems, 32, 2019.
- Experimental search for high-temperature ferroelectric perovskites guided by two-step machine learning. Nature communications, 9(1):1668, 2018.
- Machine learning in aerodynamic shape optimization. Progress in Aerospace Sciences, 134:100849, 2022.
- Applications of machine learning in drug discovery and development. Nature reviews Drug discovery, 18(6):463–477, 2019.
- Jingchao Jiang. A survey of machine learning in additive manufacturing technologies. International Journal of Computer Integrated Manufacturing, 36(9):1258–1280, 2023.
- Deep learning for low-data drug discovery: hurdles and opportunities. Current Opinion in Structural Biology, 86:102818, 2024.
- The influence of negative training set size on machine learning-based virtual screening. Journal of cheminformatics, 6:1–9, 2014.
- Machine learning for a sustainable energy future. Nature Reviews Materials, 8(3):202–215, 2023.
- Machine learning in a data-limited regime: Augmenting experiments with synthetic data uncovers order in crumpled sheets. Science advances, 5(4):eaau6792, 2019.
- Implications of additivity and nonadditivity for machine learning and deep learning models in drug design. ACS omega, 7(30):26573–26581, 2022.
- Exposing the limitations of molecular machine learning with activity cliffs. Journal of chemical information and modeling, 62(23):5938–5951, 2022.
- Genetic algorithms and engineering optimization, volume 7. John Wiley & Sons, 1999.
- Flow field prediction of supercritical airfoils via variational autoencoder based deep learning framework. Physics of Fluids, 33(8), 2021.
- Airfoil gan: encoding and synthesizing airfoils for aerodynamic shape optimization. Journal of Computational Design and Engineering, 10(4):1350–1362, 2023.
- Grammar variational autoencoder. In International conference on machine learning, pages 1945–1954. PMLR, 2017.
- Actively learning what makes a discrete sequence valid. arXiv preprint arXiv:1708.04465, 2017.
- Conditioning by adaptive sampling for robust design. In International conference on machine learning, pages 773–782. PMLR, 2019.
- Improving black-box optimization in vae latent space using decoder uncertainty. Advances in Neural Information Processing Systems, 34:802–814, 2021.
- The computerized construction of a matched sample. American Journal of Sociology, 76(2):325–346, 1970.
- Donald B Rubin. Matching to remove bias in observational studies. Biometrics, pages 159–183, 1973.
- Donald B Rubin. The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials. Statistics in medicine, 26(1):20–36, 2007.
- Elizabeth A Stuart. Matching methods for causal inference: A review and a look forward. Statistical science: a review journal of the Institute of Mathematical Statistics, 25(1):1, 2010.
- Nathan Kallus. Deepmatch: Balancing deep covariate representations for causal inference using adversarial training. In International Conference on Machine Learning, pages 5067–5077. PMLR, 2020.
- Flame: A fast large-scale almost matching exactly approach to causal inference. Journal of Machine Learning Research, 22(31):1–41, 2021.
- Neural score matching for high-dimensional causal inference. In International Conference on Artificial Intelligence and Statistics, pages 7076–7110. PMLR, 2022.
- Peter Sharpe. NeuralFoil: An airfoil aerodynamics analysis tool using physics-informed machine learning. https://github.com/peterdsharpe/NeuralFoil, 2023.
- Mark Drela. Xfoil: An analysis and design system for low reynolds number airfoils. In Low Reynolds Number Aerodynamics: Proceedings of the Conference Notre Dame, Indiana, USA, 5–7 June 1989, pages 1–12. Springer, 1989.
- Cooperation of thin-airfoil theory and deep learning for a compact airfoil shape parameterization. Aerospace, 10(7):650, 2023.
- Aeroacoustic airfoil shape optimization enhanced by autoencoders. Expert Systems with Applications, 217:119513, 2023.
- Generating various airfoil shapes with required lift coefficient using conditional variational autoencoders. arXiv preprint arXiv:2106.09901, 2021.
- Scalable gradient–enhanced artificial neural networks for airfoil shape design in the subsonic and transonic regimes. Structural and Multidisciplinary Optimization, 61(4):1363–1376, 2020.
- Abdiffuser: full-atom generation of in-vitro functioning antibodies. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Yet another numbering scheme for immunoglobulin variable domains: an automatic modeling and analysis tool. Journal of molecular biology, 309(3):657–670, 2001.
- ANARCI: antigen receptor numbering and receptor classification. Bioinformatics, 32(2):298–300, 09 2015.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Protein discovery with discrete walk-jump sampling. In The Twelfth International Conference on Learning Representations, 2024.
- Protein design with guided discrete diffusion. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
- Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv preprint arXiv:2108.02938, 2021.
- Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
- Improving antibody affinity using laboratory data with language model guided design. bioRxiv, pages 2023–09, 2023.
- Machine learning optimization of candidate antibody yields highly diverse sub-nanomolar affinity antibody libraries. Nature Communications, 14(1):3454, 2023.
- Constrained bayesian optimization for automatic chemical design using variational autoencoders. Chemical science, 11(2):577–586, 2020.
- Learning multimodal graph-to-graph translation for molecular optimization. arXiv preprint arXiv:1812.01070, 2018.
- Black box recursive translations for molecular optimization. arXiv preprint arXiv:1912.10156, 2019.
- Model-based deep learning: On the intersection of deep learning and optimization. IEEE Access, 10:115384–115398, 2022.
- Learning convex optimization control policies. In Learning for Dynamics and Control, pages 361–373. PMLR, 2020.
- Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing. IEEE Signal Processing Magazine, 38(2):18–44, 2021.