Split Buffering in Clos-Network Switches

Updated 20 December 2025

Split buffering is a technique that decomposes a traditional central buffering stage into two bufferless modules separated by virtual-output-module queues (VOMQs), enhancing load balance and performance.
In the LBC switch, VOMQs enable deterministic cyclic interconnection and in-order forwarding while maintaining 100% throughput under various traffic loads.
Deterministic scheduling and in-sequence hold-down mechanisms eliminate the need for memory speedup, reducing configuration complexity and ensuring stable operation.

Split buffering, as implemented in Clos-network packet switches, refers to the architectural arrangement in which what is traditionally a single central stage is decomposed into two bufferless stages with intervening buffers. The most prominent design exemplifying split buffering is the Split-Central-Buffered Load-Balancing Clos-network (LBC) switch, which organizes its central modules into a central-input stage and a central-output stage, placing virtual-output-module queues (VOMQs) between them. This mechanism enables load-balanced and in-order cell forwarding with high throughput and minimal configuration complexity, without the need for memory speedup or expanded central-stage replication (Sule et al., 2018).

1. Architectural Foundations and Module Decomposition

The LBC switch is configured as a four-stage Clos topology with $N=nk$ inputs and outputs. Each port is labeled $IP(i,s)$ or $OP(j,d)$ , where $i,j \in \{0,\ldots,k-1\}$ and $s,d \in \{0,\ldots,n-1\}$ . The LBC architecture consists of:

Input stage (IM): $k$ independent $n \times m$ bufferless crossbars, denoted $IM(i)$ .
Central-input stage (CIM): $m$ independent $k \times k$ bufferless crossbars, denoted $CIM(r)$ .
Central-output stage (COM): $m$ independent $k \times k$ bufferless crossbars, denoted $COM(r)$ .
Output stage (OM): $k$ independent $m \times n$ buffered crossbars, denoted $OM(j)$ , each with crosspoint buffers $CB(r, j, d)$ .

The split-central concept divides the central stage into two consecutive bufferless modules (CIM and COM), with buffer resources (VOMQs) interleaved between them. The data path is:

$IP(i, s) \to IM(i) \to L_{IM}(i, r) \to CIM(r) \to VOMQ(r, i, j) \to COM(j) \to OM(j) \to OP(j,d)$

This structure avoids congestion and ordering issues found in traditional buffered and unbuffered three-stage Clos switches (Sule et al., 2018).

2. Split-Central Buffering and Virtual-Output-Module Queues

Split buffering is substantiated by locating queues between the two central stages. While an alternative—virtual-output-port queues (VOPQs), one per output port—can be conceived, the LBC design employs VOMQs placed at the outputs of CIMs. For each $CIM(r)$ and destination $OM(j)$ , a VOMQ $(r, *, j)$ is instantiated, into which all cells exiting $CIM(r)$ and destined for $OM(j)$ are placed, regardless of their originating IM. From these VOMQs, cells are delivered to one of the inputs of $COM(j)$ . Figures from the primary reference depict this arrangement and clarify that VOMQs multiplex cell arrivals originating from multiple IMs (Sule et al., 2018).

3. Deterministic Cyclic Interconnection Scheduling

All bufferless stages (IM, CIM, COM) operate under a deterministic, cyclic configuration regime composed of $k$ disjoint matchings per stage. During time slot $t$ :

IM(i): $IP(i,s)$ connects to $CIM$ index $r = (s + t) \bmod m$ .
CIM(r): Its input from $IM(i)$ is connected to output port $p = (i + t) \bmod k$ .
COM(r): In reversed order, input $p$ connects to $OM$ index $j = (p - t) \bmod k$ .

Each matching is a permutation, and their cyclic composition forms permutation matrices $\Pi(t)$ and $\Phi(t)$ . Over $k$ slots,

$P_1 = \sum_{t=0}^{k-1} \Pi(t), \quad P_2 = \sum_{t=0}^{k-1} \Phi(t)$

Each permutation contributes one nonzero per row and column, ensuring balanced and deadlock-free interconnection. This pre-determined, periodic schedule is crucial for load balancing and low configuration complexity (Sule et al., 2018).

4. In-Sequence Cell Forwarding Policy

To guarantee in-order delivery, which can otherwise be violated by distributed buffering, the LBC switch introduces a hold-down mechanism for VOQ departures. When a cell $\tau$ for flow $y$ departs its VOQ and enters a VOMQ with occupancy $\delta$ , all subsequent cells of the same flow are held at the VOQ for exactly $\delta \cdot k$ additional slots before permitting release. Each input IP maintains, per reachable VOMQ, a counter $IPC(r, j)$ . When $VOQ(i, s, j, d)$ transmits a cell and $IPC=\sigma$ , a hold-down timer of $(\sigma - 1)k$ slots is set. This rule ensures that younger cells cannot overtake older ones due to variable queuing times in the central buffers (Sule et al., 2018).

5. Throughput and Queuing Stability Analysis

For admissible i.i.d. traffic, with $R_1 = [\lambda_{u,v}]$ as the $N \times N$ arrival-rate matrix (row/column sums ≤1), the system load balancing is analyzed as follows:

After IM+CIM permutations:

$R_2 = \frac{1}{k} \left( (R_1 * \mathbf{1}) \circ P_1 \right)$

where $*$ is matrix multiplication by all-ones matrix, $\circ$ is elementwise multiplication. Each input's traffic is equally distributed along $k$ paths.

After COM:

$R_3(j) = R_2(j) \circ P_2, \quad j=0,\ldots,k-1$

The OM stage aggregates these to recover each output's share:

$R_4(v) = R_1(v), \qquad R_5(v) = \mathbf{1}^T R_4(v) = \sum_u \lambda_{u,v}$

Drift arguments in Appendix A show that VOQs, VOMQs, and crosspoint buffers (CBs) remain weakly stable under these admissible loads, thereby guaranteeing 100% throughput under all admissible i.i.d. traffic (Sule et al., 2018).

6. Elimination of Memory-Speedup and Central-Stage Replication

Unlike buffered Clos designs that require central-stage memory to operate above line rate or expanded central buffer capacity to avoid internal bottlenecks, the LBC with split buffering employs only line-rate bufferless crossbars (IM, CIM, COM) and low-complexity VOMQs. Each VOMQ is served at least once every $k$ slots, allowing service rates to match offered load directly and eliminating requirements for speedup or additional internal replication. Configuration cost per slot remains $O(1)$ , with mappings computed directly via cyclic permutation rules (Sule et al., 2018).

7. Empirical Performance and Comparative Evaluation

Simulation studies contrasted LBC with an Output-Queued (OQ) ideal switch, conventional buffered-central Clos (MMM), and MMM with expanded memory (MMMⁱᵉ), for $N=64$ and $N=256$ . Key outcomes:

Uniform Bernoulli traffic: LBC attained 100% throughput, with mean queueing delay close to OQ and outperforming MMM at high loads ( $\rho > 0.95$ ).
Uniform bursty traffic (ON-OFF, mean burst $\ell=10,30$ ): LBC maintained 100% throughput; MMM dropped to ~80% and 75% at respective burst durations.
Unbalanced nonuniform patterns (Bernoulli, diagonal weighting $\omega=0.6$ ): LBC preserved 100% throughput, delay near OQ baseline.
Hot-spot traffic: LBC sustained full throughput and low delay akin to OQ, even under single-output traffic concentration scenarios.
Stress tests: Scenarios with multiple flows targeting a single OM or module-specific hot-spots exhibited bounded delay and at most single-cell crosspoint buffer occupancy.

These findings collectively demonstrate that split-central buffered architectures in LBC achieve deterministic, high-throughput operation under various load profiles, in-sequence forwarding, and configuration simplicity without resorting to conventionally necessary hardware enhancements (Sule et al., 2018).

PDF Markdown Chat (Pro)

References (1)

A Split-Central-Buffered Load-Balancing Clos-Network Switch with In-Order Forwarding (2018)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Split Buffering.