Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Convolutional neural network models for cancer type prediction based on gene expression (1906.07794v1)

Published 18 Jun 2019 in q-bio.GN, cs.LG, and q-bio.QM

Abstract: Background Precise prediction of cancer types is vital for cancer diagnosis and therapy. Important cancer marker genes can be inferred through predictive model. Several studies have attempted to build machine learning models for this task however none has taken into consideration the effects of tissue of origin that can potentially bias the identification of cancer markers. Results In this paper, we introduced several Convolutional Neural Network (CNN) models that take unstructured gene expression inputs to classify tumor and non-tumor samples into their designated cancer types or as normal. Based on different designs of gene embeddings and convolution schemes, we implemented three CNN models: 1D-CNN, 2D-Vanilla-CNN, and 2D-Hybrid-CNN. The models were trained and tested on combined 10,340 samples of 33 cancer types and 731 matched normal tissues of The Cancer Genome Atlas (TCGA). Our models achieved excellent prediction accuracies (93.9-95.0%) among 34 classes (33 cancers and normal). Furthermore, we interpreted one of the models, known as 1D-CNN model, with a guided saliency technique and identified a total of 2,090 cancer markers (108 per class). The concordance of differential expression of these markers between the cancer type they represent and others is confirmed. In breast cancer, for instance, our model identified well-known markers, such as GATA3 and ESR1. Finally, we extended the 1D-CNN model for prediction of breast cancer subtypes and achieved an average accuracy of 88.42% among 5 subtypes. The codes can be found at https://github.com/chenlabgccri/CancerTypePrediction.

Citations (168)

Summary

  • The paper introduces three CNN architectures (1D-CNN, 2D-Vanilla, and 2D-Hybrid) achieving up to 95.7% accuracy for cancer type prediction.
  • The models incorporate a normal class to counteract tissue origin bias, ensuring the detection of true cancer-specific markers.
  • Guided gradient saliency techniques reveal over 2,090 gene markers, underscoring the method’s potential for advancing diagnostics and personalized treatments.

Evaluation of CNN Models for Cancer Type Prediction Using Gene Expression Data

In this paper, the authors explore the application of convolutional neural networks (CNNs) for classifying cancer types based on gene expression profiles, addressing methodological gaps in existing machine learning models that often overlook the tissue of origin's potential bias on cancer marker identification. Their work encompasses the design, implementation, and evaluation of three CNN architectures: the 1D-CNN, 2D-Vanilla-CNN, and 2D-Hybrid-CNN, tested on an extensive dataset from The Cancer Genome Atlas (TCGA).

Model Design and Performance

The paper introduces three CNN architectures with the aim to effectively predict cancer types:

  1. 1D-CNN Model: Utilizes a vector input format of gene expression and applies one-dimensional convolutional layers. This model is distinct in its simplicity, requiring fewer hyperparameters and a single convolutional layer, mitigating the risk of overfitting which is crucial given the high dimensional nature of genomic data. It achieves a prediction accuracy of 95.7%.
  2. 2D-Vanilla-CNN Model: Employs two-dimensional inputs reshaped from original gene expression data to mimic image-like formats. Despite requiring more parameters and computational resources, it maintains a comparable accuracy but demonstrates slower convergence in training phases.
  3. 2D-Hybrid-CNN Model: Features two parallel convolutional pathways mimicking the Resnet architecture style, capturing both vertical and horizontal global features. The model suggests a potential elevation in accuracy (95.7%) but at the cost of increased computational expenses compared to the 1D-CNN.

Across these models, the best performing configurations exhibit accuracy between 93.9% to 95.0% when incorporating a normal class to account for tissue of origin, enhancing model robustness and interpretability for clinical applications.

Methodological Insights

The authors argue that CNN architectures with limited depth are preferable in genomic contexts, given sample size constraints and potential overfitting issues associated with more complex models. They retained simplicity in gene input ordering and explored kernel configurations to naturally encapsulate gene interactions within the neural network framework.

The addition of a normal class target in the prediction layers neutralizes the influence of tissue origins and elucidates cancer-type-specific markers. This novel strategy contributes to achieving accurate prediction results without entrenched biases seen in other DL studies where markers insinuate tissue origins rather than cancer specifics.

Saliency Map Interpretation

A critical advantage is the deployment of guided gradient saliency techniques to derive a gene-effect score matrix, pinpointing cancer-specific markers. Notably, 2,090 unique markers emerged, including well-documented genetic markers and previously undiscovered ones, warranting further biological validation and paper.

Implications for Cancer Diagnosis

The paper signifies advancements in cancer diagnostics, offering a promising approach for identifying actionable cancer markers and precise subtyping (e.g., breast cancer subtypes) by capitalizing on the interpretability of CNN models. The distinctions in markers potentially unravel novel biological pathways involved in cancer development.

Future Prospects

The research opens avenues for integrating multi-omic layers (e.g., DNA methylation, somatic mutations) to refine classification frameworks further, potentially bridging gaps in current precision medicine paradigms. Moreover, expanding to larger, varied datasets like GTEx can offer enhanced insights into the interplay between different genomic aberrations across diverse cancer backgrounds.

Overall, this paper contributes substantially to the computational oncology domain, aligning deep learning methodologies with clinical applicability, emphasizing both classification prowess and biological interpretation. Although the research presents strides in cancer prediction, continuous advancements and validations remain pertinent to fully exploit CNN's potential in genomic medicine.