GetBatch: Optimizing Multi-Object Retrieval
- GetBatch is an object store API that streamlines ML training by batching GET requests for small objects, reducing per-request overhead.
- It employs deterministic ordering, pipelined streaming, and robust error recovery to maintain performance in distributed storage environments.
- Leveraging one-time batch overhead, GetBatch achieves up to 15× speedup for small object retrieval, crucial for high-throughput ML pipelines.
GetBatch is an object store API designed to optimize distributed multi-object retrieval in ML training pipelines. It promotes batch retrieval to a first-class storage operation, replacing individual GET requests with a single deterministic, fault-tolerant streaming execution. This approach directly addresses the dominant per-request overhead associated with traditional random access to small objects residing in distributed storage clusters, yielding substantial improvements in throughput and tail latency (Aizman et al., 25 Feb 2026).
1. Motivation and Formal Problem Definition
Machine learning training frequently utilizes batches comprising hundreds to thousands of samples. When datasets are stored in an object store and accessed using "map-style" random reads, each batch typically translates into numerous independent HTTP GET requests, one per sample. Each GET request incurs significant control-plane and transport-layer overhead due to TCP handshakes, HTTP parsing, request scheduling, and related setup costs. For small objects (10 KiB–100 KiB), this overhead often dominates total retrieval time.
The cost model can be formalized as follows. Let denote the batch size, the average sample size in bytes, the available network bandwidth, and the per-request overhead. The cost of retrieving a batch via individual GETs is:
For GetBatch:
where is a one-time cost for the entire batch. For small (), speedup scales as with increasing batch size.
2. API Semantics and Programming Model
GetBatch is invoked via an HTTP GET operation to the object store gateway, transmitting a JSON payload specifying the batch parameters. The JSON object includes:
"mime": Specifies the output format (default"tar")."in": List of entries, each withbucket,objname, and optionalarchpath."strm": Enables/disables streaming mode."coer": Continue-on-error mode toggle."coloc": Colocation hints toggle.
Sample JSON Payload:
1 2 3 4 5 6 7 8 9 10 |
{
"mime": "tar",
"in": [
{"bucket": "imagenet", "objname": "images/img_0001.jpg"},
{"bucket": "shards", "objname": "train-0003.tar", "archpath": "labels/0003.txt"}
],
"strm": true,
"coer": false,
"coloc": false
} |
Client-Side Python SDK Example:
1 2 3 4 5 6 7 |
from aistore.sdk import Client client = Client("http://ais-gateway") batch = client.batch( entries=[("imagenet","img_0001.jpg"), ...], mime="tar", strm=True, coer=False, coloc=False) for meta, content in batch.get(): process_sample(content) |
PyTorch DataLoader Integration:
1 2 3 4 5 6 7 8 |
class MyDataset(IterableDataset): def __iter__(self): while True: paths = self.sampler.next_batch() batch = client.batch(paths, bucket) for _, blob in batch.get(): yield self.process(blob) dl = DataLoader(MyDataset(), batch_size=None) |
GetBatch preserves response entry ordering to exactly match the order of the "in" list.
3. System Architecture and Execution Pipeline
The GetBatch protocol involves the following components:
- Client: The training loop worker that initiates the request.
- Proxy: A stateless HTTP gateway mediating client communication.
- Designated Target (DT): The storage node that orchestrates ordered assembly and streaming.
- Senders: Storage nodes providing requested objects (or subsets).
- Peer-to-peer Connections: Persistent, direct links among storage targets.
Execution Phases:
- DT Registration: The Proxy selects a DT (via consistent hashing or colocation hints), forwards the batch specification, and receives an execution ID (execID).
- Sender Activation: Proxy instructs all relevant storage targets to fetch and send their respective entries for the execID.
- Redirection and Ordered Streaming: Proxy redirects the client to the DT endpoint, where data is merged in strict order and streamed (TAR or similar format). If
strm=true, streaming commences as soon as any entry is delivered (pipelined).
ASCII Sequence Diagram:
1 2 3 4 5 6 7 8 9 10 11 |
Client │
│ GET /getbatch {JSON}
▼
Proxy ──┐
│ │ choose DT
│ ├─> DT: register execID
│ └─> all Targets (“you are sender for execID”)
▼ ▲
Client ◀──302───┘
│ │
└────────────▶ DT (streaming response) |
Ordering Guarantee: Output strictly follows client-specified entry order, regardless of sender arrival times.
4. Fault Tolerance, Admission Control, and Observability
GetBatch delineates between hard errors (which abort the entire batch, e.g., JSON parsing failures, out-of-memory) and soft errors (missing objects, transient network failures, target timeouts).
Continue-on-Error (coer) Mode:
- Soft errors are handled by emitting a placeholder entry in the archive, maintaining positional alignment.
- Recovery attempts are made up to configurable thresholds (
R_max), querying alternate targets via "get-from-neighbor" strategies.
Recovery Algorithm (informal):
1 2 3 4 5 6 |
For each missing/timed-out entry E at DT:
for attempt in 1..R_max:
DT asks another target for E’s data
if success: insert into ordered stream; break
if still missing and coer=True: emit placeholder
else: abort request |
Admission Control:
- Memory overload at DT blocks new requests (HTTP 429).
- CPU/disk pressure triggers calibrated throttling (sleeps per work item).
Observability Metrics (via Prometheus):
work_items_totalbytes_sent_local,bytes_sent_remoterxwaitthrottleerrors_hard,errors_softrecovery_attempts,recovery_failures
5. Performance Analysis
Empirical evaluation was performed on a 16-node AIStore cluster (OCI, 16×NVMe SSD, 100 Gbps), with 80 concurrent workers across 8 client nodes, using varying object and batch sizes.
Sustained Throughput (GiB/s) and Speedup:
| Object Size | GET | GetBatch (256) | GetBatch (1024) | GetBatch (4096) |
|---|---|---|---|---|
| 10 KiB | 0.5 | 4.5 (9×) | 6.0 (12×) | 7.3 (15×) |
| 100 KiB | 4.2 | 20.7 (4.9×) | 24.1 (5.7×) | 26.1 (6.2×) |
| 1 MiB | 22.3 | 32.4 (1.5×) | 35.2 (1.6×) | 37.0 (1.7×) |
Production-Scale Training Pipeline ("Canary-1B-Flash"):
- Hardware: 128 × A100-80GB GPUs, 16 AIStore nodes
- Data loader: 1024 workers, dynamic bucketing, OOMptimizer
Batch and Per-Object Latency (ms):
| Method | P50 | P95 | P99 | Avg || Per-Object (P50/P95/P99/Avg) | |---------------|-------|--------|--------|---------||------------------------------------| | Sequential I/O| 243.7 | 431.2 | 638.9 | 261.4 || 1.2 / 5.2 / 6.8 / 2.0 | | Random GET | 934.7 | 3668.7 | 4814.3 | 1320.0 || 9.1 / 27.3 / 53.5 / 12.3 | | GetBatch | 427.5 | 1808.6 | 2744.7 | 624.7 || 5.1 / 10.5 / 14.5 / 5.7 |
GetBatch reduced P95 batch retrieval latency by 2× (3668→1808 ms) and P99 per-object tail latency by 3.7× (53.5→14.5 ms) relative to random GET, while batch-time jitter (P99–P50) dropped by 40%.
6. Comparative Analysis and Limitations
Independent GETs require control-plane interactions per batch, with corresponding round trips, scheduling load, and per-operation parsing. By contrast, GetBatch collapses all control-plane scheduling and HTTP parsing into a single operation, then streams bulk data to the client. For small objects, GetBatch achieves significant speedup through amortized overhead and streaming, with performance gains up to 15× observed for 10 KiB objects.
Algorithmic Complexity:
| Approach | Complexity |
|---|---|
| Independent GET | |
| GetBatch |
Notable Limitations:
- Requires AIStore as the backend; not a general-purpose S3 extension.
- Single DT per request may introduce serialization bottlenecks at extreme batch sizes/concurrency; mitigated by admission control.
- No built-in server-side sampling or shuffling; client must handle sampling, though determinism of ordering aids reproducibility.
- Scalability beyond 16–32 nodes and tight integration with GPU-side data pipelines (e.g., NVIDIA DALI) are areas for further development.
7. Conclusion and Future Directions
GetBatch presents a deterministic, pipelined, and fault-tolerant mechanism for large-scale ML data loading, delivering high throughput and robust tail-latency improvements in distributed training environments. Its support for streaming, strict ordering, and error recovery allows for integration with common frameworks via thin SDK layers. Research directions include scaling the Designated Target component for larger clusters, extending batched-GET APIs to a broader set of object stores, and supporting downstream preprocessing integration. The API is open-source within NVIDIA AIStore and is documented at https://github.com/NVIDIA/aistore (Aizman et al., 25 Feb 2026).