GetBatch: Optimizing Multi-Object Retrieval

Updated 2 March 2026

GetBatch is an object store API that streamlines ML training by batching GET requests for small objects, reducing per-request overhead.
It employs deterministic ordering, pipelined streaming, and robust error recovery to maintain performance in distributed storage environments.
Leveraging one-time batch overhead, GetBatch achieves up to 15× speedup for small object retrieval, crucial for high-throughput ML pipelines.

GetBatch is an object store API designed to optimize distributed multi-object retrieval in ML training pipelines. It promotes batch retrieval to a first-class storage operation, replacing individual GET requests with a single deterministic, fault-tolerant streaming execution. This approach directly addresses the dominant per-request overhead associated with traditional random access to small objects residing in distributed storage clusters, yielding substantial improvements in throughput and tail latency (Aizman et al., 25 Feb 2026).

1. Motivation and Formal Problem Definition

Machine learning training frequently utilizes batches comprising hundreds to thousands of samples. When datasets are stored in an object store and accessed using "map-style" random reads, each batch typically translates into numerous independent HTTP GET requests, one per sample. Each GET request incurs significant control-plane and transport-layer overhead due to TCP handshakes, HTTP parsing, request scheduling, and related setup costs. For small objects (10 KiB–100 KiB), this overhead often dominates total retrieval time.

The cost model can be formalized as follows. Let $N$ denote the batch size, $S$ the average sample size in bytes, $B$ the available network bandwidth, and $T_{req}$ the per-request overhead. The cost of retrieving a batch via individual GETs is:

$\text{Cost}_{\text{individual}} = N \times (T_{req} + S/B)$

For GetBatch:

$\text{Cost}_{\text{GetBatch}} = T_{req,\text{batch}} + N \times (S/B)$

where $T_{req,\text{batch}}$ is a one-time cost for the entire batch. For small $S$ ( $S/B \ll T_{req}$ ), speedup scales as $O(N)$ with increasing batch size.

2. API Semantics and Programming Model

GetBatch is invoked via an HTTP GET operation to the object store gateway, transmitting a JSON payload specifying the batch parameters. The JSON object includes:

"mime": Specifies the output format (default "tar").
"in": List of entries, each with bucket, objname, and optional archpath.
"strm": Enables/disables streaming mode.
"coer": Continue-on-error mode toggle.
"coloc": Colocation hints toggle.

Sample JSON Payload:

{
  "mime": "tar",
  "in": [
    {"bucket": "imagenet", "objname": "images/img_0001.jpg"},
    {"bucket": "shards", "objname": "train-0003.tar", "archpath": "labels/0003.txt"}
  ],
  "strm": true,
  "coer": false,
  "coloc": false
}

Client-Side Python SDK Example:

from aistore.sdk import Client
client = Client("http://ais-gateway")
batch = client.batch(
    entries=[("imagenet","img_0001.jpg"), ...],
    mime="tar", strm=True, coer=False, coloc=False)
for meta, content in batch.get():
    process_sample(content)

PyTorch DataLoader Integration:

class MyDataset(IterableDataset):
    def __iter__(self):
        while True:
            paths = self.sampler.next_batch()
            batch = client.batch(paths, bucket)
            for _, blob in batch.get():
                yield self.process(blob)
dl = DataLoader(MyDataset(), batch_size=None)

GetBatch preserves response entry ordering to exactly match the order of the "in" list.

3. System Architecture and Execution Pipeline

The GetBatch protocol involves the following components:

Client: The training loop worker that initiates the request.
Proxy: A stateless HTTP gateway mediating client communication.
Designated Target (DT): The storage node that orchestrates ordered assembly and streaming.
Senders: Storage nodes providing requested objects (or subsets).
Peer-to-peer Connections: Persistent, direct links among storage targets.

Execution Phases:

DT Registration: The Proxy selects a DT (via consistent hashing or colocation hints), forwards the batch specification, and receives an execution ID (execID).
Sender Activation: Proxy instructs all relevant storage targets to fetch and send their respective entries for the execID.
Redirection and Ordered Streaming: Proxy redirects the client to the DT endpoint, where data is merged in strict order and streamed (TAR or similar format). If strm=true, streaming commences as soon as any entry is delivered (pipelined).

ASCII Sequence Diagram:

Client       │
     │ GET /getbatch {JSON}
     ▼
   Proxy ──┐
     │     │ choose DT
     │     ├─> DT: register execID
     │     └─> all Targets (“you are sender for execID”)
     ▼             ▲
 Client ◀──302───┘
     │             │
     └────────────▶ DT (streaming response)

Ordering Guarantee: Output strictly follows client-specified entry order, regardless of sender arrival times.

4. Fault Tolerance, Admission Control, and Observability

GetBatch delineates between hard errors (which abort the entire batch, e.g., JSON parsing failures, out-of-memory) and soft errors (missing objects, transient network failures, target timeouts).

Continue-on-Error (coer) Mode:

Soft errors are handled by emitting a placeholder entry in the archive, maintaining positional alignment.
Recovery attempts are made up to configurable thresholds (R_max), querying alternate targets via "get-from-neighbor" strategies.

Recovery Algorithm (informal):

For each missing/timed-out entry E at DT:
    for attempt in 1..R_max:
        DT asks another target for E’s data
        if success: insert into ordered stream; break
    if still missing and coer=True: emit placeholder
    else: abort request

Admission Control:

Memory overload at DT blocks new requests (HTTP 429).
CPU/disk pressure triggers calibrated throttling (sleeps per work item).

Observability Metrics (via Prometheus):

work_items_total
bytes_sent_local, bytes_sent_remote
rxwait
throttle
errors_hard, errors_soft
recovery_attempts, recovery_failures

5. Performance Analysis

Empirical evaluation was performed on a 16-node AIStore cluster (OCI, 16×NVMe SSD, 100 Gbps), with 80 concurrent workers across 8 client nodes, using varying object and batch sizes.

Sustained Throughput (GiB/s) and Speedup:

Object Size	GET	GetBatch (256)	GetBatch (1024)	GetBatch (4096)
10 KiB	0.5	4.5 (9×)	6.0 (12×)	7.3 (15×)
100 KiB	4.2	20.7 (4.9×)	24.1 (5.7×)	26.1 (6.2×)
1 MiB	22.3	32.4 (1.5×)	35.2 (1.6×)	37.0 (1.7×)

Production-Scale Training Pipeline ("Canary-1B-Flash"):

Hardware: 128 × A100-80GB GPUs, 16 AIStore nodes
Data loader: 1024 workers, dynamic bucketing, OOMptimizer

Batch and Per-Object Latency (ms):

| Method | P50 | P95 | P99 | Avg || Per-Object (P50/P95/P99/Avg) | |---------------|-------|--------|--------|---------||------------------------------------| | Sequential I/O| 243.7 | 431.2 | 638.9 | 261.4 || 1.2 / 5.2 / 6.8 / 2.0 | | Random GET | 934.7 | 3668.7 | 4814.3 | 1320.0 || 9.1 / 27.3 / 53.5 / 12.3 | | GetBatch | 427.5 | 1808.6 | 2744.7 | 624.7 || 5.1 / 10.5 / 14.5 / 5.7 |

GetBatch reduced P95 batch retrieval latency by 2× (3668→1808 ms) and P99 per-object tail latency by 3.7× (53.5→14.5 ms) relative to random GET, while batch-time jitter (P99–P50) dropped by 40%.

6. Comparative Analysis and Limitations

Independent GETs require $O(N)$ control-plane interactions per batch, with corresponding round trips, scheduling load, and per-operation parsing. By contrast, GetBatch collapses all control-plane scheduling and HTTP parsing into a single $O(1)$ operation, then streams bulk data to the client. For small objects, GetBatch achieves significant speedup through amortized overhead and streaming, with performance gains up to 15× observed for 10 KiB objects.

Algorithmic Complexity:

Approach	Complexity
Independent GET	$O(N \cdot (1 + S/B))$
GetBatch	$O(1 + N \cdot (S/B))$

Notable Limitations:

Requires AIStore as the backend; not a general-purpose S3 extension.
Single DT per request may introduce serialization bottlenecks at extreme batch sizes/concurrency; mitigated by admission control.
No built-in server-side sampling or shuffling; client must handle sampling, though determinism of ordering aids reproducibility.
Scalability beyond 16–32 nodes and tight integration with GPU-side data pipelines (e.g., NVIDIA DALI) are areas for further development.

7. Conclusion and Future Directions

GetBatch presents a deterministic, pipelined, and fault-tolerant mechanism for large-scale ML data loading, delivering high throughput and robust tail-latency improvements in distributed training environments. Its support for streaming, strict ordering, and error recovery allows for integration with common frameworks via thin SDK layers. Research directions include scaling the Designated Target component for larger clusters, extending batched-GET APIs to a broader set of object stores, and supporting downstream preprocessing integration. The API is open-source within NVIDIA AIStore and is documented at https://github.com/NVIDIA/aistore (Aizman et al., 25 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

GetBatch: Distributed Multi-Object Retrieval for ML Data Loading (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GetBatch.