Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LangCell: Language-Cell Pre-training for Cell Identity Understanding (2405.06708v5)

Published 9 May 2024 in q-bio.GN, cs.AI, and cs.CL

Abstract: Cell identity encompasses various semantic aspects of a cell, including cell type, pathway information, disease information, and more, which are essential for biologists to gain insights into its biological characteristics. Understanding cell identity from the transcriptomic data, such as annotating cell types, has become an important task in bioinformatics. As these semantic aspects are determined by human experts, it is impossible for AI models to effectively carry out cell identity understanding tasks without the supervision signals provided by single-cell and label pairs. The single-cell pre-trained LLMs (PLMs) currently used for this task are trained only on a single modality, transcriptomics data, lack an understanding of cell identity knowledge. As a result, they have to be fine-tuned for downstream tasks and struggle when lacking labeled data with the desired semantic labels. To address this issue, we propose an innovative solution by constructing a unified representation of single-cell data and natural language during the pre-training phase, allowing the model to directly incorporate insights related to cell identity. More specifically, we introduce $\textbf{LangCell}$, the first $\textbf{Lang}$uage-$\textbf{Cell}$ pre-training framework. LangCell utilizes texts enriched with cell identity information to gain a profound comprehension of cross-modal knowledge. Results from experiments conducted on different benchmarks show that LangCell is the only single-cell PLM that can work effectively in zero-shot cell identity understanding scenarios, and also significantly outperforms existing models in few-shot and fine-tuning cell identity understanding scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Cell identity codes: understanding cell identity from gene expression profiles using deep neural networks. Scientific reports, 9(1):2342, 2019.
  2. Sparsely-connected autoencoder (sca) for single cell rnaseq data mining. NPJ Systems Biology and Applications, 7, 2020.
  3. Cz cellxgene discover: A single-cell data platform for scalable exploration, analysis and modeling of aggregated data. bioRxiv, pp.  2023–10, 2023.
  4. Genept: A simple but hard-to-beat foundation model for genes and cells built from chatgpt. bioRxiv, 2023.
  5. Rethinking attention with performers, 2022.
  6. A single-cell gene expression language model. arxiv:2210.14330 q-bio.QM,cs.AI,q-bio.MN], 10 2022. [Online; accessed 2023-05-04].
  7. The tabula sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science (New York, N.Y.), 376(6594):eabl4896, 5 2022. ISSN 0036-8075. doi: 10.1126/science.abl4896. [Online; accessed 2024-01-31].
  8. scgpt: Towards building a foundation model for single-cell multi-omics using generative ai. bioRxiv, 2023. URL https://api.semanticscholar.org/CorpusID:258464426.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  10. Translation between molecules and natural language. ArXiv, abs/2204.11817, 2022. URL https://api.semanticscholar.org/CorpusID:248376906.
  11. F.R.S., K. P. Liii. on lines and planes of closest fit to systems of points in space. Philosophical Magazine Series 1, 2:559–572, 1901.
  12. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  6894–6910, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.-main.552. URL https://aclanthology.org/2021.emnlp-main.552.
  13. A python library for probabilistic analysis of single-cell omics data. Nature biotechnology, 40(2):163–166, 2 2022. ISSN 1087-0156. doi: 10.1038/s41587-021-01206-w. [Online; accessed 2024-01-31].
  14. xtrimogene: An efficient and scalable representation learner for single-cell rna-seq data. arXiv preprint arXiv:2311.15156, 2023.
  15. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
  16. scsorter: assigning cells to known cell types according to marker genes. Genome Biology, 22, 2021.
  17. Large scale foundation model on single-cell transcriptomics. bioRxiv, 2023. URL https://api.semanticscholar.org/CorpusID:259025739.
  18. Momentum contrast for unsupervised visual representation learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  9726–9735, 2019.
  19. Hotelling, H. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:498–520, 1933.
  20. Evaluation of cell type annotation r packages on single-cell rna-seq data. Genomics, Proteomics & Bioinformatics, 19:267 – 281, 2020.
  21. Perceiver io: A general architecture for structured inputs & outputs, 2022.
  22. Dnabert: Pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics, 37(15):2112–2120, 2021. doi: 10.1093/bioinformatics/btab083.
  23. Cell2sentence: Teaching large language models the language of biology. bioRxiv, pp.  2023–09, 2023.
  24. On the sentence embeddings from pre-trained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  9119–9130, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.-main.733. URL https://aclanthology.org/2020.emnlp-main.733.
  25. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 2022. URL https://api.semanticscholar.org/CorpusID:246411402.
  26. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, 2023a. URL https://api.semanticscholar.org/CorpusID:256390509.
  27. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arxiv:2110.05208 [cs.CV], 10 2021. [Online; accessed 2024-01-31].
  28. Single-cell rna-seq debiased clustering via batch effect disentanglement. IEEE Transactions on Neural Networks and Learning Systems, 2023b.
  29. Molecular signatures database (msigdb) 3.0. Bioinformatics, 27(12):1739–1740, 2011.
  30. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp.  2980–2988, 2017.
  31. scclassify: sample size estimation and multiscale classification of cells using single and multiple reference. Molecular systems biology, 16(6):e9389, 6 2020. ISSN 1744-4292. doi: 10.15252/msb.20199389. [Online; accessed 2024-02-01].
  32. Zero-preserving imputation of single-cell rna-seq data. Nature Communications, 13, 2022.
  33. Current best practices in single-cell rna-seq analysis: a tutorial. Molecular systems biology, 15(6):e8746, 2019.
  34. Benchmarking atlas-level data integration in single-cell genomics. Nature methods, 19(1):41–50, 1 2022. ISSN 1548-7091. doi: 10.1038/s41592-021-01336-8. [Online; accessed 2024-02-01].
  35. cellxgene: a performant, scalable exploration platform for high dimensional sparse matrices. bioRxiv, 2021. doi: 10.1101/2021.04.05.438318. URL https://www.biorxiv.org/content/early/2021/04/06/2021.04.05.438318.
  36. Multi-modal self-supervised pre-training for regulatory genome across cell types, 2021.
  37. Morris, S. A. The evolving concept of cell identity in the single cell era. Development, 146(12):dev169748, 2019.
  38. What do self-supervised vision transformers learn? arxiv:2305.00729 [cs.CV,cs.AI,cs.LG], 5 2023. [Online; accessed 2024-01-31].
  39. Automated methods for cell type annotation on scrna-seq data. Computational and Structural Biotechnology Journal, 19:961 – 969, 2021.
  40. The impact of quantile and rank normalization procedures on the testing power of gene differential expression analysis. BMC bioinformatics, 14:1–10, 2013.
  41. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  42. Sentence-bert: Sentence embeddings using siamese bert-networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. doi: 10.18653/v1/d19-1410.
  43. Genenames.org: the hgnc resources in 2023. Nucleic Acids Research, 51:D1003 – D1009, 2022.
  44. Shen, H. T. Principal component analysis. In Encyclopedia of Database Systems, 2009.
  45. Transcriptomic diversity of cell types across the adult human brain. Science (New York, N.Y.), 382(6667):eadd7046, 10 2023. ISSN 0036-8075. doi: 10.1126/science.add7046. [Online; accessed 2024-01-31].
  46. Naught all zeros in sequence count data are the same. Computational and Structural Biotechnology Journal, 18:2789 – 2798, 2018.
  47. The obo foundry: coordinated evolution of ontologies to support biomedical data integration. Nature biotechnology, 25(11):1251–1255, 2007.
  48. Svensson, V. Droplet scrna-seq is not zero-inflated. Nature Biotechnology, 38:147–150, 2019.
  49. Autoimpute: Autoencoder based imputation of single-cell rna-seq data. Scientific Reports, 8, 2018.
  50. Transfer learning enables predictions in network biology. Nature, 618:616–624, 2023. URL https://api.semanticscholar.org/CorpusID:259002047.
  51. sccan: single-cell clustering using autoencoder and network fusion. Scientific Reports, 12, 2022.
  52. Fast and precise single-cell data analysis using a hierarchical autoencoder. Nature Communications, 12, 2019.
  53. Benchmarking principal component analysis for large-scale single-cell rna-sequencing. Genome Biology, 21, 2019.
  54. van der Maaten, L. Accelerating t-sne using tree-based algorithms. J. Mach. Learn. Res., 15:3221–3245, 2014.
  55. Attention is all you need, 2023.
  56. Multilingual translation for zero-shot biomedical classification using biotranslator. Nature Communications, 14, 2023a. URL https://api.semanticscholar.org/CorpusID:256701737.
  57. Protst: Multi-modality learning of protein sequences and biomedical texts. In International Conference on Machine Learning, 2023b. URL https://api.semanticscholar.org/CorpusID:256390530.
  58. Scbert as a large-scale pretrained deep language model for cell type annotation of single-cell rna-seq data. Nature Machine Intelligence, 4(10):852–866, 2022a. doi: 10.1038/s42256-022-00534-z.
  59. Contrastive learning enables rapid mapping to multimodal single-cell atlas of multimillion scale. Nature Machine Intelligence, 4:696 – 709, 2022b.
  60. A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nature Communications, 13, 2022. URL https://api.semanticscholar.org/CorpusID:246815222.
  61. Realistic cell type annotation and discovery for single-cell rna-seq data. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, pp.  4967–4974, 2023.
  62. Massively parallel digital transcriptional profiling of single cells. Nature communications, 8(1):14049, 2017.
  63. Comparative analysis of single-cell rna sequencing methods. Molecular cell, 65(4):631–643, 2017.
Citations (1)

Summary

We haven't generated a summary for this paper yet.