GetBatch: Distributed Multi-Object Retrieval for ML Data Loading

Published 25 Feb 2026 in cs.DC, cs.AI, cs.DB, and cs.LG | (2602.22434v1)

Abstract: Machine learning training pipelines consume data in batches. A single training step may require thousands of samples drawn from shards distributed across a storage cluster. Issuing thousands of individual GET requests incurs per-request overhead that often dominates data transfer time. To solve this problem, we introduce GetBatch - a new object store API that elevates batch retrieval to a first-class storage operation, replacing independent GET operations with a single deterministic, fault-tolerant streaming execution. GetBatch achieves up to 15x throughput improvement for small objects and, in a production training workload, reduces P95 batch retrieval latency by 2x and P99 per-object tail latency by 3.7x compared to individual GET requests.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces GetBatch, a storage-native primitive that retrieves entire batches in one operation to reduce network round-trip overhead.
It demonstrates significant improvements, including a 15x throughput boost for 10 KiB objects and substantial reductions in p95 and p99 latencies.
The design simplifies client-side data loading via AIStore Python SDK, ensuring deterministic ordering and reproducible random sampling in distributed ML training.

Authoritative Summary of "GetBatch: Distributed Multi-Object Retrieval for ML Data Loading"

Problem Statement and Motivation

Modern ML pipelines increasingly rely on distributed storage for large-scale datasets, leading to a fundamental tension between random access sampling (critical for robust training) and sequential access (necessary for efficient I/O). Conventional object stores expose only per-object GET operations, incurring significant per-request overhead—especially network round trips, control-plane dispatching, and connection management. These factors dominate latency and suboptimal throughput, especially for small object sizes and high-concurrency training environments. Existing mitigation strategies, such as sharded archives (TARs), compromise sampling randomness, and require complex client-side workarounds, including buffer management and order shuffling.

GetBatch: Core Design and Execution Model

GetBatch introduces a storage-native, batch-level retrieval primitive integrated into NVIDIA AIStore. The critical innovation is treating the entire batch as a single atomic storage operation, enabling client-side sampling of arbitrary batches and then requesting the complete batch with a single streaming request. Internally, GetBatch orchestrates distributed parallel retrieval: a Designated Target (DT) node is responsible for assembling and maintaining strict ordering, with ancillary storage nodes acting as senders for locally-owned objects. This distributed model ensures coordinated retrieval across shards, disks, and cluster nodes, with immediate serialization into a single, deterministic TAR archive.

Notably, GetBatch enforces deterministic output ordering regardless of physical data placement, allowing reproducible training sample-label alignment and simplifying downstream data loader logic. Client-side integration is facilitated via the AIStore Python SDK, allowing seamless adoption into frameworks such as PyTorch and Lhotse without modifying sampling logic.

Execution Options and Fault Handling

GetBatch exposes configurable execution options:

Streaming Mode (strm): Enables emission of serialized output as soon as the earliest entries become available, reducing time-to-first-byte and improving accelerator utilization.
Continue-on-error (coer): Allows soft errors to be surfaced with explicit placeholders in the output, preserving batch alignment and preventing aborts from missing samples or transient failures.
Colocation Hints (coloc): Optimizes DT selection for physical locality, reducing inter-node transfers for clustered datasets.

Admission control is strictly enforced on the DT to prevent resource exhaustion, rejecting new work under memory pressure and throttling under CPU/disk pressure. Observability is addressed via per-node Prometheus metrics to distinguish bottlenecks in network coordination versus local resource contention.

Empirical Results and Numerical Analysis

Synthetic Benchmark

On a 16-node AIStore cluster, GetBatch yields up to 15x throughput improvement for 10 KiB objects compared to per-object GET operations. The magnitude of speedup systematically decreases with object size: 6.2x for 100 KiB, and 1.7x for 1 MiB, corroborating that per-request overhead becomes less dominant as transfer time increases.

Throughput Results:

Object Size	GET Baseline	GetBatch (Max)	Speedup
10 KiB	0.5 GiB/s	7.3 GiB/s	15x
100 KiB	4.2 GiB/s	26.1 GiB/s	6.2x
1 MiB	22.3 GiB/s	37.0 GiB/s	1.7x

Production Training Workload

In a distributed Canary-1B-Flash ASR training, GetBatch demonstrates substantial improvements in latency stability:

P95 batch retrieval latency reduced by 2x (Random GET: 3,668.7 ms → GetBatch: 1,808.6 ms)
P99 per-object tail latency improved by 3.7x (Random GET: 53.5 ms → GetBatch: 14.5 ms)
Step-time jitter reduced by 40% (due to narrower P99-to-P50 latency spread)

Aggregate bandwidth is comparable at scale, but tail latency and request amplification significantly impact training efficiency (GPU idle cycles). GetBatch's streaming retrieval model eliminates hundreds of independent requests in favor of single coordinated responses, mitigating straggler effects and control-plane overhead.

Practical Implications and Integration

GetBatch reconciles random sampling flexibility with optimal I/O efficiency, substantially simplifying client-side batch assembly and error handling. The integration pattern enables PyTorch and Lhotse data loaders to issue a single batch request, decoupling sample selection from data retrieval. This design eliminates the necessity for client-managed concurrent connections and complex reassembly logic, reducing code complexity and the likelihood of errors in large-scale deployments.

The architectural approach enables training frameworks to maintain perfect randomness without sacrificing I/O performance, making GetBatch particularly well-suited for datasets comprising small samples (images, audio segments, text) common in speech and vision pipelines.

Existing solutions (HTTP/2 multiplexing, gRPC streaming, client-side caching) address only portions of the overhead and complexity. None provide server-coordinated, deterministic ordering or eliminate request amplification at the storage layer. Sequential-access formats (WebDataset, TFRecord, FFCV, Petastorm) optimize for I/O but introduce friction for randomized sampling and complicate reproducibility.

Significantly, GetBatch operates orthogonally to GPU-accelerated preprocessing frameworks (DALI, SPDL), targeting storage retrieval rather than transformation. Synergistic integration could yield end-to-end pipeline optimization from storage to device.

Scalability, Limitations, and Future Directions

GetBatch's DT model distributes serialization and ordering load across cluster nodes. While disk saturation is the first limiting factor under sustained pressure, graceful degradation mechanisms (throttling, admission control) prevent systemic collapse. Nevertheless, extreme concurrency or batch sizes can create serialization bottlenecks; further scaling experiments are warranted for larger clusters.

A critical limitation is reliance on the AIStore backend for support—batched retrieval semantics are absent from standard S3 APIs and widespread object stores. Adoption of such primitives in broader storage frameworks would dramatically improve scalable ML pipeline efficiency. Proposed extensions, such as server-side shuffling, could further simplify downstream training code.

Conclusion

GetBatch establishes batch-level retrieval as a fundamental storage primitive, bridging the gap between random access sampling and high-throughput sequential I/O for distributed ML training. The empirical evidence demonstrates robust throughput scalability (up to 15x for small objects), substantial latency stability improvements, and greatly simplified client logic. Its approach offers beneficial implications for reproducible, efficient training across diverse ML workloads. Broader adoption and integration with other storage and preprocessing frameworks represent promising avenues for further optimization in large-scale ML ecosystem architectures.

Markdown Report Issue