Generating and designing DNA with deep generative models (1712.06148v1)

Published 17 Dec 2017 in cs.LG, q-bio.GN, and stat.ML

Abstract: We propose generative neural network methods to generate DNA sequences and tune them to have desired properties. We present three approaches: creating synthetic DNA sequences using a generative adversarial network; a DNA-based variant of the activation maximization ("deep dream") design method; and a joint procedure which combines these two approaches together. We show that these tools capture important structures of the data and, when applied to designing probes for protein binding microarrays, allow us to generate new sequences whose properties are estimated to be superior to those found in the training data. We believe that these results open the door for applying deep generative models to advance genomics research.

Citations (141)

View on Semantic Scholar

Summary

The paper introduces a novel deep generative framework combining GAN, activation maximization, and joint approaches to design DNA sequences with superior binding properties.
The methodology leverages a Wasserstein GAN with gradient penalty and gradient ascent to navigate the discrete space of DNA sequences and capture key genomic features.
Experimental results show the model can generate DNA probes with enhanced binding affinities, highlighting its potential impact on synthetic biology applications.

Generating and Designing DNA with Deep Generative Models

The paper "Generating and designing DNA with deep generative models" presents a novel approach to the generation and design of DNA sequences using deep learning techniques, specifically focusing on deep generative models. This research bridges the gap between machine learning methods and genomics, proposing methods that could fundamentally advance the way synthetic DNA sequences are conceived and evaluated.

Research Summary

The authors explore three deep generative methodologies: a GAN-based approach, a method inspired by activation maximization (akin to "deep dream"), and a combined method that integrates both strategies. These innovative approaches were applied to practical tasks, such as designing DNA probes for protein-binding microarrays (PBMs), and demonstrated the model's capacity to generate sequences with estimated superior properties compared to those in the original training datasets.

The paper explores the complexities of DNA sequence data, highlighting its dual nature as akin to both natural language and computer vision data. This unique characteristic informs the design and application of the models. The research showcases the potential of deep generative models in exploring the vast space of potential DNA configurations, tailoring sequences to specific desired properties, and discovering novel configurations that extend beyond existing knowledge.

Methodologies

GAN-Based Generation: The authors implement a GAN architecture adapted for DNA sequences, leveraging the Wasserstein GAN with gradient penalty (WGAN-GP) to address the associated challenges of generating discrete sequence data. The model comprises a generator that learns to produce sequences and a discriminator that distinguishes between real and generated sequences.
Activation Maximization: This approach, adapted from image processing, focuses on optimizing DNA sequences to enhance specific properties. By treating DNA sequences as continuous data representations via one-hot encoding adjustments, the authors employ gradient ascent to modify sequences towards desired property manifestations.
Joint Approach: Combining both the generative and optimization strategies, this architecture allows for the crafting of sequences that not only exhibit specific characteristics but also maintain realistic features captured by the GAN. This dual approach provides a comprehensive model for sequence design, enhancing sequence functionality and feasibility.

Experimental Results

The paper articulates several computational experiments, demonstrating this framework's efficacy. When applied to real genomic data, such as exon splice site signals, the GAN model effectively captured critical sequence features like splice motifs, indicating its potential for scaling to more complex generative tasks, such as gene or genome design.

In the context of designing DNA probes with desired binding affinities, the authors illustrate that their joint method can surpass existing sequences in binding strength, even when the model was only trained on a limited set of weaker binders. This showcases the predictive power and optimization capabilities inherent in the methodology.

Implications and Future Directions

This research underscores the transformative potential of deep generative models in genomics. By automating and enhancing the design of DNA sequences, these models could significantly impact fields like synthetic biology and genome editing. The methods introduced here could lead to new avenues for producing tailored genetic constructs with applications in biofuels, pharmaceuticals, and more.

The paper also suggests several promising avenues for future research, such as integrating experimental validation stages or developing more advanced conditional generative models to further propound this machine-assistive design framework for DNA sequences. Additionally, adapting or combining these approaches with emerging machine learning paradigms might open further opportunities for exploration and application.

Overall, this work lays a foundational step towards leveraging deep learning to push the boundaries of genomic design and innovation, inviting computational biologists and machine learning researchers to reimagine the potentials of DNA synthesis and manipulation.

PDF Markdown

Related Papers

YouTube

Show All Videos