- The paper introduces LIDAF, which uses Canonical Correlation Analysis to construct and fuse affinity matrices for improved cancer subtype identification.
- Experiments on ten cancer types demonstrate that LIDAF outperforms existing methods using metrics like ARI, NMI, and survival analysis p-values.
- The method effectively tackles high dimensionality and missing values through advanced imputation, transformation, and Gaussian mixture-based feature selection.
Recent advancements in computational techniques have dramatically enhanced our ability to analyze vast and complex biological datasets. Multi-omics data, which includes various types of molecular data such as gene expression, miRNA, and DNA methylation profiles, is particularly challenging due to the high dimensionality and variability among different data types.
Innovatively addressing these challenges, a new method known as Linear Inter and Intra Dataset Affinity Fusion (LIDAF) has been introduced. This approach utilizes Canonical Correlation Analysis (CCA) to construct affinity matrices, which quantify the linear relationships within and across different omics datasets. These affinity matrices are then fused through a three-step process to improve clustering performance, which, in turn, helps enhance the precision of cancer subtype identification.
The effectiveness of the LIDAF method was assessed through comprehensive experiments involving ten different cancer types, including Breast invasive carcinoma (BRCA), Kidney renal clear cell carcinoma (KIRC), and others. The evaluation was carried out using metrics such as the Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI), which measure the success of clustering in subdividing patients into groups based on their multi-omics profiles.
Results showed that the LIDAF method successfully overcame the limitations of the existing state-of-the-art methods it was compared against, demonstrating superior clustering performance. Additionally, the method was found to outperform the previously best-reported methods in survival analysis, achieving higher p-values in Cox proportional hazard regression, which assess the impact of variables on patient survival time.
One of the challenges addressed by LIDAF is the "Curse of Dimensionality," which arises when integrating multiple omics datasets into a single analysis. The problem of missing values was also tackled, as LIDAF incorporates strategies for removal and imputation, followed by Z-score standardization and a Yeo-Johnson transformation for normally skewed distribution. For selecting relevant features, LIDAF relies on feature selection using the Gaussian Mixture Model with Bayesian Inference.
Overall, the LIDAF method shows great promise in improving the accuracy of cancer subtype identification from multi-omics data by capturing both the diversity and the commonality among various data types. Its versatility could pave the way for its application to other complex datasets, both within healthcare and in other domains where data fusion and pattern identification are essential.