Papers
Topics
Authors
Recent
Search
2000 character limit reached

Binocular Batching for LLM Inference

Updated 3 February 2026
  • Binocular batching is a technique that partitions LLM inference requests into two bins—short and long—based on predicted execution times to reduce overall service time.
  • By grouping requests with similar expected processing times, the method minimizes waiting delays and improves throughput by up to 31% in uniform scenarios.
  • The approach employs queueing-theoretic models and lightweight predictors, ensuring robust performance even with moderate prediction errors.

Binocular batching is a two-bin instantiation of the Multi-Bin Batching framework, designed to increase throughput during inference of LLMs by leveraging queueing-theoretic control policies that group requests with similar execution lengths. This method addresses the inefficiency arising from heterogeneous request runtimes in traditional batching, which causes hardware to be underutilized while waiting for the longest request to complete. By partitioning incoming requests—via prediction—into “short” and “long” bins at the median of the execution time distribution, binocular batching forms uniform-length batches and provably improves both resource utilization and LLM inference throughput (Guldogan et al., 2024).

1. Queueing-Theoretic Model and Problem Formulation

Binocular batching operates within a queueing-theoretic framework. The system is modeled as follows:

  • Requests arrive according to a Poisson process with rate λ\lambda.
  • The server maintains a single, infinite-buffer queue, processing fixed-size batches.
  • Each incoming request’s service (generation) time ll is independent and identically distributed, with a uniform distribution U[lmin,lmax]\mathcal U[l_{\min}, l_{\max}].
  • The service time of a batch, tservicet_{\mathrm{service}}, is defined as max{l1,,lB}\max\{l_1, \dots, l_B\}, i.e., the maximum of the batch’s request times. This setup reflects typical GPU-backed LLM inference workloads, where maximizing throughput is constrained by the slowest request in each batch.

2. Two-Bin Partitioning and Threshold Assignment

The engine of binocular batching is the partition of the request length domain [lmin,lmax][l_{\min}, l_{\max}] into two intervals of equal probability mass under the uniform distribution. The unique optimal placement for the separation threshold, which maximizes throughput under the uniform law, is:

  • l0=lminl_0 = l_{\min}
  • l1=lmin+12(lmaxlmin)l_1 = l_{\min} + \tfrac{1}{2}(l_{\max} - l_{\min})
  • l2=lmaxl_2 = l_{\max}

Bin 1 comprises {l:lminl<l1}\{l: l_{\min} \leq l < l_1\} ("short"), and Bin 2 comprises {l:l1llmax}\{l: l_1 \leq l \leq l_{\max}\} ("long"). In practical deployments, the unknown true ll is replaced by a contemporaneous or historical prediction l^\hat{l}, using lightweight regression proxies or small models, and the request is binned based on whether l^<l1\hat{l} < l_1 or not.

3. Binocular Batching Control Policy

The control policy is implemented as two independent sub-queues, each corresponding to one bin. Pending requests are enqueued according to their estimated generation length. When any bin’s sub-queue accumulates BB requests, a batch is formed and appended to a central service queue. The server pulls from this central queue in a first-completed-batch, first-served fashion. The full policy, matching Algorithm 1 in (Guldogan et al., 2024) specialized to k=2k=2, is described as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Initialize empty queues Q1, Q2; central service queue S = []
Upon arrival of request r:
    estimate execution length hat_l
    if hat_l < l1:
        Q1.enqueue(r)
    else:
        Q2.enqueue(r)
    for Qi in {Q1, Q2}:
        while Qi.size() >= B:
            batch = Qi.dequeue(B)
            S.append(batch)
While server is free and S is not empty:
    batch = S.pop()
    run_model(batch)  # service time = max(actual request lengths)

Requests are thus always processed in homogeneous-length cohorts, mitigating performance bottlenecking by outlier requests.

4. Analytical Formulas, Throughput, and Optimality

The expected service time for a batch in binocular batching is the probabilistic mixture of the maxima over two uniform intervals. Let aa and bb denote the bin endpoints, and BB the batch size:

  • For uniform [a,b][a,b], E[max]=BB+1a+1B+1b\mathbb E[\max] = \tfrac{B}{B+1} a + \tfrac{1}{B+1} b.

The key throughput and resource utilization formulas for binocular (k=2k=2) batching are:

  • Expected service time:

E[tservice,2]=lmax+lmin2+12(BB+1lmax+1B+1lminlmax+lmin2)\mathbb E[t_{\mathrm{service},2}] = \frac{l_{\max} + l_{\min}}{2} + \frac{1}{2}\left( \frac{B}{B+1}l_{\max} + \frac{1}{B+1}l_{\min} - \frac{l_{\max} + l_{\min}}{2} \right)

  • Throughput:

c2=BE[tservice,2]c_2 = \frac{B}{\mathbb E[t_{\mathrm{service},2}]}

  • Stability condition (resource utilization):

ρ=λE[tservice,2]B<1\rho = \frac{\lambda \mathbb E[t_{\mathrm{service},2}]}{B} < 1

Throughput optimality is established by convexity arguments, showing that minimizing E[tservice,2]\mathbb E[t_{\mathrm{service},2}] with respect to l1l_1 is achieved when both bins have equal mass, producing a strictly higher throughput than single-queue (k=1k=1) batching, and that performance monotonically increases with bin count (ckc_k increases with kk).

5. Empirical Results and Performance Gains

Empirical findings in (Guldogan et al., 2024) demonstrate notable throughput improvements:

  • In a simulated uniform scenario with lmin=1l_{\min}=1, lmax=20l_{\max}=20, and B=128B=128, standard (1-bin) throughput is approximately $6.45$ requests/sec, while the 2-bin (binocular) scheme achieves $8.43$ requests/sec, a gain of nearly 31%31\%.
  • In LLM-in-the-loop deployments (Phi-3.5-min-instruct, B=8B=8, oracle length knowledge), 2-bin throughput is $1.2$–1.3×1.3\times higher than single-queue, and end-to-end experiments report 2030%20–30\% throughput improvements.

Robustness analysis demonstrates that with two bins and length-prediction error probability pep_e up to $0.1$, throughput remains within 90%90\% of the oracle (error-free) two-bin throughput, indicating resilience to moderate prediction inaccuracies.

6. Implementation Notes and Practical Trade-offs

Predicting request execution length can leverage simple linear proxies or compact regressors, and accuracy needs only to reliably distinguish “short” from “long” with >85%> 85\% correctness to capture the majority of throughput gains. For non-uniform length distributions, the bin threshold should correspond to the 50th percentile of predicted lengths.

A trade-off of binocular batching is that splitting the arrival stream into two bins halves the arrival rate per bin, slightly increasing the time to fill a batch. Under typical loads, this overhead is limited (5\lesssim5–$10$ ms) and is counterbalanced by the $20$–30%30\% throughput increase. Production systems should implement “max-wait” timers (e.g., $10$ ms) to bound latency growth during temporary underload. Integration with GPU batch feeders requires maintaining two sub-queues at the front-end and dispatching batch-ready groups promptly to the hardware scheduler.

7. Summary and Context within Multi-Bin Batching

Binocular batching is a specialization of Multi-Bin Batching to k=2k=2. It partitions incoming inference requests into "short" and "long" bins based on a central threshold, then batches and serves them to maximize throughput. The method provably and empirically reduces expected batch service times by minimizing the per-batch maximum, and consistently achieves significant throughput advantages over naive batching in both simulation and LLM production settings, with robust performance under length prediction errors and minimal practical implementation overhead (Guldogan et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Binocular Batching.