Scalable and Performant Data Loading (2504.20067v1)

Published 23 Apr 2025 in cs.DC

Abstract: We present SPDL (Scalable and Performant Data Loading), an open-source, framework-agnostic library designed for efficiently loading array data to GPU. Data loading is often a bottleneck in AI applications, and is challenging to optimize because it requires coordination of network calls, CPU-bound tasks, and GPU device transfer. On top of that, Python's GIL (Global Interpreter Lock) makes it difficult to gain performance improvement from multi-threading. We found that when data preprocessing functions release the GIL entirely, it is possible to execute them concurrently in a thread pool, thereby improving the workflow performance. Our benchmark shows that compared to the PyTorch DataLoader, SPDL can iterate through the ImageNet dataset 74% faster while using 38% less CPU and 50GB less memory. When training ViT-B/16 model, SPDL can send data to the GPU at a speed that does not starve the training. Additionally, when using SPDL on Python 3.13t, without changing any code, the throughput is further by improved by 33%, thanks to the disabled GIL. SPDL can improve the performance of current AI model training, and receives further performance improvements when Free-Threaded Python is adopted in production systems. SPDL is available at https://github.com/facebookresearch/spdl.

Summary

The paper presents SPDL as a novel, framework-agnostic library that overcomes GIL limitations by confining critical operations to a dedicated multi-threaded pool.
It employs an asynchronous event loop and modular pipeline stages to seamlessly coordinate data acquisition, preprocessing, and GPU transfer.
Benchmarks on AWS and Free-Threaded Python demonstrate SPDL's superior throughput, reduced resource usage, and robust performance compared to existing data loaders.

Data loading is a critical bottleneck in training modern machine learning models, especially as GPUs become faster and require increasingly high data throughput. This process typically involves multiple stages: data acquisition (often network-bound), pre-processing (CPU- and memory-bound), and GPU transfer (PCIe bandwidth-bound). Efficiently coordinating these stages and their diverse bottlenecks is challenging. A significant obstacle in Python-based data loading is the Global Interpreter Lock (GIL), which prevents true multi-threading for CPU-bound tasks, often leading practitioners to use multi-processing as a workaround.

While multi-processing helps bypass the GIL, it introduces its own overheads, including slow worker launch times, high static memory consumption (due to duplicated data like dataset indices), substantial inter-process communication (IPC) overhead (serialization/deserialization), sequential serialization in the main process when receiving data, and difficulty synchronizing state across processes.

Existing data loading libraries address these issues with varying approaches and trade-offs:

PyTorch DataLoader: Provides a simple Dataset/DataLoader API abstraction but hides internal logic, making optimization difficult. It relies heavily on multi-processing, incurring the associated overheads.
NVIDIA DALI: Employs thread-based parallelism but requires users to learn a Domain-Specific Language (DSL) to define pipelines, increasing the learning curve and maintenance complexity.
FFCV: Achieves high performance for computer vision by requiring data to be converted into a proprietary format. This adds an extra step to the workflow, limits flexibility for custom datasets or combining multiple datasets, and makes modifying the data format challenging.
Decord: Designed for video loading but can have unbounded resource usage by opening all videos at initialization and keeping decoders alive. Its initialization time scales with dataset size, and it is not robust to malformed files.

The paper presents SPDL (Scalable and Performant Data Loading) (2504.20067), an open-source, framework-agnostic library designed to provide high throughput, efficiency, and flexibility for array data loading to GPUs. SPDL's core insight is to leverage multi-threading effectively, even within the constraints of the GIL, by carefully structuring parallelization. It achieves this by restricting GIL contention to a minimal number of threads (the main thread and a dedicated scheduler thread) and dispatching only performance-critical, GIL-releasing operations (typically implemented in C/C++) to a larger thread pool. This contrasts with traditional approaches that parallelize the entire pipeline, leading to GIL contention across many threads.

SPDL is built around an asynchronous event loop running in a background scheduler thread. This event loop manages tasks across pipeline stages, seamlessly handling both asynchronous (e.g., network calls) and synchronous functions. Synchronous functions are delegated to a thread pool. Pipeline stages are connected via queues, which naturally handle backpressure; if a downstream stage (like model training) slows down, queues fill up, blocking upstream tasks and preventing excessive resource consumption.

The design principles guiding SPDL include:

High throughput: Maximizing the speed at which data is delivered to the GPU.
Visibility & Tunability: Allowing users to understand which stage is a bottleneck and configure concurrency for individual stages.
No domain-specific language: Using standard Python functions for pipeline definition to lower the learning curve.
Seamless asynchronous support: Integrating asynchronous operations efficiently as they are not constrained by the GIL.
Flexibility: Enabling the creation of diverse data loading pipelines.
Robustness: Handling sample processing failures gracefully.
Framework-agnostic: Decoupling data loading from specific deep learning frameworks.

SPDL provides a PipelineBuilder interface for constructing data loading workflows from user-defined functions. A pipeline starts with a source (e.g., an iterable of file paths), followed by chained pipe and aggregate operations, and ends with a sink. The pipe method allows specifying a function (synchronous or asynchronous) and its desired concurrency. The aggregate method is used for batching operations. The build method finalizes the pipeline, allowing configuration of the thread pool size. The resulting Pipeline object is iterable, and the auto_stop context manager ensures proper cleanup of background threads.

from spdl.dataloader import PipelineBuilder
import spdl.io
import torch
from typing import Iterable

def source() -> Iterable[str]:
    # Generates URLs or file paths
    yield "path/to/image1.jpg"
    yield "path/to/image2.jpg"
    # ...

async def download(url: str) -> bytes:
    # Example async function (e.g., using aiohttp)
    # In a real implementation, this would fetch data
    print(f"Downloading {url}...")
    await torch.utils.data.get_worker_info().seed(0) # Example using worker info if needed
    return b"dummy_image_data" # Dummy data

def decode_and_resize(data: bytes) -> spdl.io.ImageFrames:
    # Example using SPDL's GIL-releasing IO functions
    # Assumes a custom implementation of decode/resize using spdl.io
    print("Decoding and resizing...")
    packets = spdl.io.demux_image(data)
    filter_desc = spdl.io.get_video_filter_desc(scale_width=224, scale_height=224, pix_fmt="rgb24")
    frames = spdl.io.decode_packets(packets, filter_desc=filter_desc)
    return frames

def batch_and_transfer(frames: list[spdl.io.ImageFrames]) -> torch.Tensor:
    # Example batching and transferring to GPU using pre-allocated memory/stream
    # In a real setup, the storage and stream would be managed externally and passed in.
    # For simplicity, demonstrating the core conversion/transfer steps:
    print(f"Batching {len(frames)} frames and transferring...")
    # Assume a suitable pre-allocated storage is available (e.g., via partial application or closure)
    # buffer = spdl.io.convert_frames(frames, storage=PAGE_LOCKED_STORAGE)
    # cuda_buffer = spdl.io.transfer_buffer(buffer, device_config=spdl.io.cuda_config(device_index=0, stream=CUDA_STREAM.cuda_stream))
    # return spdl.io.to_torch(cuda_buffer)
    # Dummy return for illustration:
    dummy_batch_size = len(frames)
    dummy_tensor = torch.randn(dummy_batch_size, 3, 224, 224, device='cuda', dtype=torch.float32)
    return dummy_tensor

pipeline = (
    PipelineBuilder()
    .add_source(source()) # Stage 1: Source generation
    .pipe(download, concurrency=12) # Stage 2: Async download, max 12 concurrent downloads
    .pipe(decode_and_resize, concurrency=4) # Stage 3: Decode and resize, max 4 concurrent decoders
    .aggregate(32) # Stage 4: Batching, waits for 32 items
    .pipe(batch_and_transfer) # Stage 5: Transfer batch to GPU (concurrency defaults to 1, suitable for GPU transfer)
    .add_sink(buffer_size=3) # Sink: Buffer for completed batches
    .build(num_threads=16) # Build the pipeline with 16 threads in the pool
)

with pipeline.auto_stop():
    for i, batch in enumerate(pipeline):
        print(f"Received batch {i}: {batch.shape}")
        # Use the batch for model training/inference
        if i >= 2: # Stop after a few batches for example
            break

print("Pipeline stopped.")

SPDL provides high-performance I/O functions, implemented in C++ using libraries like FFmpeg [2006]. These functions are designed to release the GIL during their execution. They minimize memory copies by working with internal data structures and converting them directly to pre-allocated, page-locked memory suitable for efficient GPU transfer. Functions like spdl.io.demux_image, spdl.io.decode_packets, spdl.io.convert_frames, and spdl.io.transfer_buffer facilitate this process, enabling direct copies to CUDA device memory via a specified stream, avoiding the default stream used for model computation.

Benchmarks on an AWS p4d.24xlarge instance demonstrated SPDL's performance benefits. In data loading-only tests on ImageNet, SPDL achieved significantly higher throughput than PyTorch DataLoader, while using 38% less CPU and 50GB less memory, largely due to avoiding multi-processing overhead. In end-to-end inference and training benchmarks using ViT-B/16, SPDL consistently outperformed PyTorch DataLoader and DALI, achieving throughput close to the theoretical maximum of the model without data loading delays. For example, in the training benchmark, SPDL's peak performance was near that of a dummy data loader that returns pre-generated tensors.

A key finding highlighted by the paper is SPDL's compatibility with Free-Threaded Python (like the experimental Python 3.13t). Running SPDL on 3.13t resulted in a 33% performance improvement compared to 3.12, without any code changes. This indicates that SPDL's design, which minimizes GIL contention in Python code while relying on GIL-releasing native implementations, is well-suited for the future of Python where the GIL may be optional. The paper also shows that SPDL on current Python 3.12 already achieves 67% of the potential speedup seen with FT-Python compared to PyTorch DataLoader on 3.12.

The appendix includes a benchmark comparing SPDL's video decoding performance against Decord on the Kinetics 400 dataset [2017]. SPDL achieved similar throughput to Decord using fewer threads and resources, and was more robust to file errors, unlike Decord which fails initialization on malformed files and has significant initialization overhead for large datasets.

In summary, SPDL is presented as a robust, efficient, and scalable data loading library that leverages a sophisticated multi-threading architecture to overcome the performance limitations of multi-processing and Python's GIL for array data. Its flexibility, tunability, and resource efficiency offer significant improvements for ML training workflows, and its design positions it to benefit immediately from advancements like Free-Threaded Python. SPDL is available as an open-source project.

PDF Markdown

GitHub

GitHub - facebookresearch/spdl: Scalable and Performant Data Loading (240 stars)

Tweets

https://twitter.com/iScienceLuvr/status/1917519842078515449

https://twitter.com/HPCPapers/status/1929909358608920689

Scalable and Performant Data Loading (2504.20067v1)

Summary

Related Papers

GitHub

Tweets