scDataset: Scalable Single-Cell Loader
- scDataset is a scalable, modular library that enables efficient minibatch streaming from large single-cell omics datasets using block sampling.
- It leverages a PyTorch IterableDataset architecture to minimize memory usage and support high-throughput, distributed deep learning training.
- Empirical benchmarks demonstrate up to 129× speedup over traditional loaders while preserving full metadata without format conversion.
scDataset is a scalable, modular data loading library specifically designed for efficient minibatch streaming from large-scale single-cell omics datasets, facilitating deep learning workflows that require random shuffling and memory efficiency. Its core innovation is to provide high-throughput, flexible, and randomized sampling directly from AnnData files—the de facto standard for single-cell data—without requiring format conversion or excessive memory usage. The framework is implemented as a PyTorch IterableDataset and achieves significant speedups over competing solutions for datasets of unprecedented size, such as Tahoe 100M (>100 million cells) (2506.01883).
1. Technical Motivation and Community Context
The expansion of single-cell omics datasets to hundreds of millions of cells, exemplified by Tahoe 100M, has rendered previous data loading paradigms inadequate. Deep learning frameworks require shuffled minibatch access for stochastic gradient descent and distributed training, yet existing solutions introduce major bottlenecks:
- AnnLoader, which works natively with AnnData, is constrained by slow single-sample random disk access (~20 samples/s), resulting in single-epoch processing times exceeding 58 days for large datasets.
- Alternate approaches, such as HuggingFace Datasets and BioNeMo, require conversion to dense formats, increasing storage requirements up to sixfold (e.g., 1.9 TB for Tahoe 100M) and often sacrificing support for rich metadata.
scDataset addresses these challenges by enabling direct, shuffled, and high-throughput minibatch access to data on disk, removing the need for conversion or full-memory loading. This advances the field by democratizing the ability to train modern deep models at the scale of foundation models on standard compute.
2. Architecture and Design Principles
scDataset is constructed as a PyTorch IterableDataset, leveraging this paradigm to enable sequential, block-wise, and buffered data access patterns without requiring random-access for each sample. The architecture centers on a modular API in which users define four callback functions for I/O and (pre)processing transforms, making the framework extensible to a broad array of backends (e.g., AnnData, HDF5, HuggingFace, BioNeMo).
Compatibility with optimized distributed training is provided natively via PyTorch's data loader multiprocessing. This allows scDataset to support distributed and multi-worker setups for maximum throughput, compared to AnnLoader, which lacks multiprocessing support.
3. Block Sampling and Batched Fetching Algorithms
Efficient random access to large, on-disk datasets is realized using two central techniques: block sampling and batched fetching.
3.1 Block Sampling
Random sampling at the cell level is inefficient for on-disk data due to the latency of many small, non-contiguous reads. Instead, scDataset adopts block sampling: the global index array representing all cells is partitioned into contiguous blocks of size . The block order is randomly shuffled, and then within each block, contiguous cells are loaded with a single disk read. This technique greatly reduces the frequency of random seeks on disk.
Formally:
Block size controls the balance between throughput and batch diversity: larger improves sequential I/O; smaller increases shuffle diversity at a cost in I/O efficiency.
3.2 Batched Fetching
To amortize I/O cost and remediate any batch-localization induced by block-wise shuffling, scDataset employs batched fetching. Here, multiple blocks (determined by a fetch factor ) are loaded into memory at once; within this buffer, further random shuffling is performed, and shuffled batches are emitted.
Algorithmically:
This hierarchy allows for tunable tradeoffs in speed and randomness, with empirical results indicating near-full entropy (i.e., batch diversity) is maintained for moderate block sizes and fetch factors.
4. Empirical Performance and Benchmarks
Comprehensive benchmarks against state-of-the-art loaders underscore the performance and efficiency of scDataset:
Loader/Framework | Format Conversion | Metadata Support | Throughput (Tahoe 100M, single core) | Multiprocessing | Storage (Tahoe 100M) |
---|---|---|---|---|---|
AnnLoader | No | Yes | 1× (~20 samples/sec) | No | 314 GB (AnnData) |
HuggingFace Datasets | Yes | Partial (lossy) | 21–27× slower than scDataset | Yes | ~1.9 TB |
BioNeMo | Yes | No | 18× slower than scDataset | Yes | 1.1–2.2 TB |
scDataset | No | Full | up to 48–129× faster than AnnLoader | Yes (native) | 314 GB (AnnData) |
On the Tahoe 100M dataset, scDataset achieves up to 48× speedup (block size 64, fetch factor 64, single core, epoch completes in <11 hours) and scales to 2593 samples/sec when using native multi-worker loading. No metadata loss is incurred, as AnnData remains the backend.
Batch diversity, measured by entropy over experimental batch labels (e.g., plate), is maintained at levels comparable to true random shuffling.
5. Integration, Extensibility, and Usage
scDataset's integration with PyTorch and AnnData ensures compatibility with existing bioinformatics and deep learning pipelines. Its modular callback interface enables application to other omics modalities or custom data stores. Users can employ the library without altering AnnData schema or converting inputs, allowing seamless transition from exploratory analysis to large-scale model training.
The open-source implementation is accessible at https://github.com/Kidara/scDataset. It supports reproducible research by maintaining consistency in data loading, batch formation, and metadata compatibility.
6. Impact and Opportunities for Single-Cell and Biomedical AI
By providing high-throughput, memory-efficient, and randomized minibatch access directly from AnnData, scDataset enables efficient training of deep neural networks—including foundation models—over the new generation of archive-scale single-cell datasets. This allows computational biology laboratories to engage in large-scale modeling without specialized hardware, format conversions, or loss of annotation/metadata, and lowers both computational and storage barriers for experimentation.
A plausible implication is that scDataset will accelerate the development and deployment of “virtual cell” models and in silico experimentation infrastructures, supporting scalable integration of omics, imaging, and clinical metadata with deep models. This enhancement of accessibility, reproducibility, and scale is expected to benefit the broader computational biology and AI research communities.