Fast and Scalable Inference of Multi-Sample Cancer Lineages
The paper presents LICHeE (Lineage Inference for Cancer Heterogeneity and Evolution), a novel computational method developed to automate the reconstruction of cancer cell lineages. This method is primarily designed to analyze tumor samples obtained from patients at various stages or regions, providing insights into cancer progression and heterogeneity. LICHeE utilizes variant allele frequencies (VAFs) derived from deep sequencing of somatic single nucleotide variants (SSNVs) across multiple samples to infer lineage trees and decompose samples into distinct subclones.
Key Methodological Insights
The primary challenge addressed by LICHeE is the complexity of somatic phylogenetics, driven by tumor heterogeneity. Traditional tree-building methods do not adequately account for the stochastic nature of somatic mutations and the presence of distinct subclonal populations within tumors. LICHeE integrates these aspects through a robust pipeline that involves partitioning SNVs into groups based on their presence across samples, clustering these SNVs using Gaussian Mixture Models (GMMs), and constructing an evolutionary constraint network. This network, a directed acyclic graph, encodes potential precedence relationships among SNV clusters.
The method's efficacy stems from its capability to efficiently explore the search space for valid lineage trees using the evolutionary constraint network and the application of VAF constraints to ensure biologically consistent trees. LICHeE operates with high scalability and rapid computation, reconstructing lineage trees in mere seconds from datasets comprising hundreds of SNVs. The method is benchmarked against both simulated data and real-world cancer datasets, demonstrating high sensitivity in SNV calling and accuracy in tree topology reconstruction.
Experimental Validation
LICHeE is evaluated using simulated lineage trees and multiple publicly available cancer datasets. In simulation studies, LICHeE achieves a high sensitivity (94-99%) for SNV group assignment even with lower coverage (100x), illustrating its ability to preserve ancestor-descendant relationships with minimal reversal errors. Furthermore, the robustness to CNV-induced VAF variance reflects its applicability to highly complex cancer genomes.
Application to real-world datasets, such as clear cell renal cell carcinoma (ccRCC) and high-grade serous ovarian cancer (HGSC), reveals LICHeE's superior ability to generate lineage trees that often converge with those formed through extensive manual analysis and demonstrate additional insights into tumor heterogeneity. For instance, LICHeE identifies additional heterogeneous subclones not revealed by traditional maximum parsimony approaches in the ccRCC data, and highlights inadequacies in other methods (e.g. neighbor-joining with Pearson correlation distances) in capturing evidence-backed lineages in HGSC data. Lastly, in breast cancer xenoengraftment studies, the lineage trees derived by LICHeE align well with single-cell sequencing reconstructions, underscoring the method's consistency and reliability.
Implications and Future Directions
LICHeE represents a significant advancement in multi-sample cancer phylogenetic inference, providing a scalable approach for analyzing extensive cancer sequencing datasets. Its incorporation of VAF constraints directly aligns with the biological realities of cancer progression and sample heterogeneity, facilitating enhanced understanding and potential development of targeted cancer therapies.
Future developments could extend LICHeE's capabilities to accommodate lower-coverage sequencing data, directly incorporate aneuploidies and larger CNVs, and refine the method to identify and analyze cancer evolution using even more comprehensive genomic landscapes. As cancer genomic research continues to expand, methods like LICHeE will be crucial in decoding the intricate tapestry of cancer evolution and improving therapeutic strategies.