Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unlocking the Power of Multi-institutional Data: Integrating and Harmonizing Genomic Data Across Institutions (2402.00077v2)

Published 30 Jan 2024 in q-bio.GN, cs.LG, and stat.ME

Abstract: Cancer is a complex disease driven by genomic alterations, and tumor sequencing is becoming a mainstay of clinical care for cancer patients. The emergence of multi-institution sequencing data presents a powerful resource for learning real-world evidence to enhance precision oncology. GENIE BPC, led by the American Association for Cancer Research, establishes a unique database linking genomic data with clinical information for patients treated at multiple cancer centers. However, leveraging such multi-institutional sequencing data presents significant challenges. Variations in gene panels result in loss of information when the analysis is conducted on common gene sets. Additionally, differences in sequencing techniques and patient heterogeneity across institutions add complexity. High data dimensionality, sparse gene mutation patterns, and weak signals at the individual gene level further complicate matters. Motivated by these real-world challenges, we introduce the Bridge model. It uses a quantile-matched latent variable approach to derive integrated features to preserve information beyond common genes and maximize the utilization of all available data while leveraging information sharing to enhance both learning efficiency and the model's capacity to generalize. By extracting harmonized and noise-reduced lower-dimensional latent variables, the true mutation pattern unique to each individual is captured. We assess the model's performance and parameter estimation through extensive simulation studies. The extracted latent features from the Bridge model consistently excel in predicting patient survival across six cancer types in GENIE BPC data.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yuan Chen (113 papers)
  2. Ronglai Shen (4 papers)
  3. Xiwen Feng (1 paper)
  4. Katherine Panageas (1 paper)
Citations (1)

Summary

  • The paper presents the Bridge model that integrates and harmonizes genomic data from multiple institutions to enhance cancer patient survival predictions.
  • It addresses challenges like gene panel variations, diverse sequencing techniques, and patient heterogeneity by employing quantile-matched latent variables.
  • Simulation studies across six cancer types in the GENIE BPC dataset validate that the model’s noise-reduced latent features improve overall predictive performance.

The paper "Unlocking the Power of Multi-institutional Data: Integrating and Harmonizing Genomic Data Across Institutions" focuses on leveraging multi-institutional sequencing data to enhance the understanding and treatment of cancer. With cancer being profoundly influenced by genomic alterations, tumor sequencing has become integral to the clinical management of cancer patients. The emergence of vast multi-institution sequencing data repositories presents a significant opportunity to derive real-world evidence for precision oncology.

GENIE BPC Database:

The paper utilizes data from the GENIE (Genomics Evidence Neoplasia Information Exchange) Biopharmaceutical Consortium (BPC), an endeavor led by the American Association for Cancer Research. This unique database links comprehensive genomic data with clinical information across multiple cancer centers, creating a rich resource for oncological research and clinical decision-making.

Challenges Identified:

The paper identifies several challenges inherent in dealing with multi-institutional genomic data:

  1. Variations in gene panels: Different institutions utilize varying gene panels for sequencing, leading to information loss when data is analyzed only on common genes.
  2. Sequencing techniques and patient heterogeneity: Differences in sequencing methods and patient demographics across institutions complicate data integration.
  3. Data dimensionality and gene mutation patterns: The high dimensionality of genomic data, coupled with sparse and often weak gene mutation signals, poses significant hurdles.

Bridge Model Introduction:

To address these challenges, the paper introduces the Bridge model, which employs a quantile-matched latent variable approach. The key features and benefits of this model include:

  • Integrated feature derivation: It preserves information beyond common gene sets, maximizing the utility of all available data.
  • Information sharing and learning efficiency: The model leverages shared information across datasets to enhance learning efficiency and improve generalizability.
  • Noise-reduced latent variables: By extracting lower-dimensional latent variables with reduced noise, it captures true mutation patterns unique to each patient.

Model Performance:

Extensive simulation studies were conducted to evaluate the Bridge model's performance in parameter estimation and predictive accuracy. The results demonstrate that the latent features derived through the Bridge model consistently excel in predicting patient survival outcomes across six different cancer types within the GENIE BPC dataset.

In summary, this paper presents a robust approach to integrating and harmonizing multi-institutional genomic data using the innovative Bridge model. The model addresses significant challenges in the field and demonstrates improved predictive performance, thereby contributing valuable insights into precision oncology and potential patient outcomes.

X Twitter Logo Streamline Icon: https://streamlinehq.com