Cancer Subtype Identification through Integrating Inter and Intra Dataset Relationships in Multi-Omics Data

Published 2 Dec 2023 in cs.LG, q-bio.GN, and stat.AP | (2312.02195v1)

Abstract: The integration of multi-omics data has emerged as a promising approach for gaining comprehensive insights into complex diseases such as cancer. This paper proposes a novel approach to identify cancer subtypes through the integration of multi-omics data for clustering. The proposed method, named LIDAF utilises affinity matrices based on linear relationships between and within different omics datasets (Linear Inter and Intra Dataset Affinity Fusion (LIDAF)). Canonical Correlation Analysis is in this paper employed to create distance matrices based on Euclidean distances between canonical variates. The distance matrices are converted to affinity matrices and those are fused in a three-step process. The proposed LIDAF addresses the limitations of the existing method resulting in improvement of clustering performance as measured by the Adjusted Rand Index and the Normalized Mutual Information score. Moreover, our proposed LIDAF approach demonstrates a notable enhancement in 50% of the log10 rank p-values obtained from Cox survival analysis, surpassing the performance of the best reported method, highlighting its potential of identifying distinct cancer subtypes.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces LIDAF, which uses Canonical Correlation Analysis to construct and fuse affinity matrices for improved cancer subtype identification.
Experiments on ten cancer types demonstrate that LIDAF outperforms existing methods using metrics like ARI, NMI, and survival analysis p-values.
The method effectively tackles high dimensionality and missing values through advanced imputation, transformation, and Gaussian mixture-based feature selection.

Recent advancements in computational techniques have dramatically enhanced our ability to analyze vast and complex biological datasets. Multi-omics data, which includes various types of molecular data such as gene expression, miRNA, and DNA methylation profiles, is particularly challenging due to the high dimensionality and variability among different data types.

Innovatively addressing these challenges, a new method known as Linear Inter and Intra Dataset Affinity Fusion (LIDAF) has been introduced. This approach utilizes Canonical Correlation Analysis (CCA) to construct affinity matrices, which quantify the linear relationships within and across different omics datasets. These affinity matrices are then fused through a three-step process to improve clustering performance, which, in turn, helps enhance the precision of cancer subtype identification.

The effectiveness of the LIDAF method was assessed through comprehensive experiments involving ten different cancer types, including Breast invasive carcinoma (BRCA), Kidney renal clear cell carcinoma (KIRC), and others. The evaluation was carried out using metrics such as the Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI), which measure the success of clustering in subdividing patients into groups based on their multi-omics profiles.

Results showed that the LIDAF method successfully overcame the limitations of the existing state-of-the-art methods it was compared against, demonstrating superior clustering performance. Additionally, the method was found to outperform the previously best-reported methods in survival analysis, achieving higher p-values in Cox proportional hazard regression, which assess the impact of variables on patient survival time.

One of the challenges addressed by LIDAF is the "Curse of Dimensionality," which arises when integrating multiple omics datasets into a single analysis. The problem of missing values was also tackled, as LIDAF incorporates strategies for removal and imputation, followed by Z-score standardization and a Yeo-Johnson transformation for normally skewed distribution. For selecting relevant features, LIDAF relies on feature selection using the Gaussian Mixture Model with Bayesian Inference.

Overall, the LIDAF method shows great promise in improving the accuracy of cancer subtype identification from multi-omics data by capturing both the diversity and the commonality among various data types. Its versatility could pave the way for its application to other complex datasets, both within healthcare and in other domains where data fusion and pattern identification are essential.

Markdown Report Issue