Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large-Scale Cell Representation Learning via Divide-and-Conquer Contrastive Learning (2306.04371v1)

Published 7 Jun 2023 in cs.CE

Abstract: Single-cell RNA sequencing (scRNA-seq) data is a potent tool for comprehending the "language of life" and can provide insights into various downstream biomedical tasks. Large-scale LLMs are starting to be used for cell representation learning. However, current LLM-based cell representation learning methods depend solely on the BERT architecture, causing an anisotropic embedding space that leads to inefficient semantic representation. Contrastive learning alleviates this problem by distributing the embeddings uniformly. As a larger batch size in contrastive learning results in better representation, the practical application of contrastive learning in cell representation learning is hampered by the high dimensionality of scRNA-seq data and the large parameter volume of LLMs. To address the batch size limitation, we propose a novel divide-and-conquer contrastive learning approach to decouple the batch size from the GPU memory size for cell representation learning. Based on our divide-and-conquer contrastive learning approach, we introduce Single-Cell LLM CelLLM, a large-scale cell representation learning model to handle high-dimensional scRNA-seq data with tens of thousands of genes. CelLLM has over 50 million parameters trained with 2 million scRNA-seq data and makes the first attempt to learn cell LLMs from both normal cells and cancer cells. CelLLM achieves new state-of-the-art (SOTA) results in all evaluated downstream tasks: including a 71.8 F_1-score for cell type annotation (a 3.0% absolute improvement over scBERT), an average F_1-score of 88.9 for single-cell drug sensitivity prediction in a few-shot scenario (an 8.3% absolute improvement), and a 93.4 Pearson's correlation for single-omics cell line drug sensitivity prediction (a 6.2% absolute improvement).

Citations (7)

Summary

We haven't generated a summary for this paper yet.