Papers
Topics
Authors
Recent
2000 character limit reached

Split Buffering in Clos-Network Switches

Updated 20 December 2025
  • Split buffering is a technique that decomposes a traditional central buffering stage into two bufferless modules separated by virtual-output-module queues (VOMQs), enhancing load balance and performance.
  • In the LBC switch, VOMQs enable deterministic cyclic interconnection and in-order forwarding while maintaining 100% throughput under various traffic loads.
  • Deterministic scheduling and in-sequence hold-down mechanisms eliminate the need for memory speedup, reducing configuration complexity and ensuring stable operation.

Split buffering, as implemented in Clos-network packet switches, refers to the architectural arrangement in which what is traditionally a single central stage is decomposed into two bufferless stages with intervening buffers. The most prominent design exemplifying split buffering is the Split-Central-Buffered Load-Balancing Clos-network (LBC) switch, which organizes its central modules into a central-input stage and a central-output stage, placing virtual-output-module queues (VOMQs) between them. This mechanism enables load-balanced and in-order cell forwarding with high throughput and minimal configuration complexity, without the need for memory speedup or expanded central-stage replication (Sule et al., 2018).

1. Architectural Foundations and Module Decomposition

The LBC switch is configured as a four-stage Clos topology with N=nkN=nk inputs and outputs. Each port is labeled IP(i,s)IP(i,s) or OP(j,d)OP(j,d), where i,j{0,,k1}i,j \in \{0,\ldots,k-1\} and s,d{0,,n1}s,d \in \{0,\ldots,n-1\}. The LBC architecture consists of:

  1. Input stage (IM): kk independent n×mn \times m bufferless crossbars, denoted IM(i)IM(i).
  2. Central-input stage (CIM): mm independent k×kk \times k bufferless crossbars, denoted CIM(r)CIM(r).
  3. Central-output stage (COM): mm independent k×kk \times k bufferless crossbars, denoted COM(r)COM(r).
  4. Output stage (OM): kk independent m×nm \times n buffered crossbars, denoted OM(j)OM(j), each with crosspoint buffers CB(r,j,d)CB(r, j, d).

The split-central concept divides the central stage into two consecutive bufferless modules (CIM and COM), with buffer resources (VOMQs) interleaved between them. The data path is:

IP(i,s)IM(i)LIM(i,r)CIM(r)VOMQ(r,i,j)COM(j)OM(j)OP(j,d)IP(i, s) \to IM(i) \to L_{IM}(i, r) \to CIM(r) \to VOMQ(r, i, j) \to COM(j) \to OM(j) \to OP(j,d)

This structure avoids congestion and ordering issues found in traditional buffered and unbuffered three-stage Clos switches (Sule et al., 2018).

2. Split-Central Buffering and Virtual-Output-Module Queues

Split buffering is substantiated by locating queues between the two central stages. While an alternative—virtual-output-port queues (VOPQs), one per output port—can be conceived, the LBC design employs VOMQs placed at the outputs of CIMs. For each CIM(r)CIM(r) and destination OM(j)OM(j), a VOMQ(r,,j)(r, *, j) is instantiated, into which all cells exiting CIM(r)CIM(r) and destined for OM(j)OM(j) are placed, regardless of their originating IM. From these VOMQs, cells are delivered to one of the inputs of COM(j)COM(j). Figures from the primary reference depict this arrangement and clarify that VOMQs multiplex cell arrivals originating from multiple IMs (Sule et al., 2018).

3. Deterministic Cyclic Interconnection Scheduling

All bufferless stages (IM, CIM, COM) operate under a deterministic, cyclic configuration regime composed of kk disjoint matchings per stage. During time slot tt:

  • IM(i): IP(i,s)IP(i,s) connects to CIMCIM index r=(s+t)modmr = (s + t) \bmod m.
  • CIM(r): Its input from IM(i)IM(i) is connected to output port p=(i+t)modkp = (i + t) \bmod k.
  • COM(r): In reversed order, input pp connects to OMOM index j=(pt)modkj = (p - t) \bmod k.

Each matching is a permutation, and their cyclic composition forms permutation matrices Π(t)\Pi(t) and Φ(t)\Phi(t). Over kk slots,

P1=t=0k1Π(t),P2=t=0k1Φ(t)P_1 = \sum_{t=0}^{k-1} \Pi(t), \quad P_2 = \sum_{t=0}^{k-1} \Phi(t)

Each permutation contributes one nonzero per row and column, ensuring balanced and deadlock-free interconnection. This pre-determined, periodic schedule is crucial for load balancing and low configuration complexity (Sule et al., 2018).

4. In-Sequence Cell Forwarding Policy

To guarantee in-order delivery, which can otherwise be violated by distributed buffering, the LBC switch introduces a hold-down mechanism for VOQ departures. When a cell τ\tau for flow yy departs its VOQ and enters a VOMQ with occupancy δ\delta, all subsequent cells of the same flow are held at the VOQ for exactly δk\delta \cdot k additional slots before permitting release. Each input IP maintains, per reachable VOMQ, a counter IPC(r,j)IPC(r, j). When VOQ(i,s,j,d)VOQ(i, s, j, d) transmits a cell and IPC=σIPC=\sigma, a hold-down timer of (σ1)k(\sigma - 1)k slots is set. This rule ensures that younger cells cannot overtake older ones due to variable queuing times in the central buffers (Sule et al., 2018).

5. Throughput and Queuing Stability Analysis

For admissible i.i.d. traffic, with R1=[λu,v]R_1 = [\lambda_{u,v}] as the N×NN \times N arrival-rate matrix (row/column sums ≤1), the system load balancing is analyzed as follows:

  • After IM+CIM permutations:

R2=1k((R11)P1)R_2 = \frac{1}{k} \left( (R_1 * \mathbf{1}) \circ P_1 \right)

where * is matrix multiplication by all-ones matrix, \circ is elementwise multiplication. Each input's traffic is equally distributed along kk paths.

  • After COM:

R3(j)=R2(j)P2,j=0,,k1R_3(j) = R_2(j) \circ P_2, \quad j=0,\ldots,k-1

  • The OM stage aggregates these to recover each output's share:

R4(v)=R1(v),R5(v)=1TR4(v)=uλu,vR_4(v) = R_1(v), \qquad R_5(v) = \mathbf{1}^T R_4(v) = \sum_u \lambda_{u,v}

Drift arguments in Appendix A show that VOQs, VOMQs, and crosspoint buffers (CBs) remain weakly stable under these admissible loads, thereby guaranteeing 100% throughput under all admissible i.i.d. traffic (Sule et al., 2018).

6. Elimination of Memory-Speedup and Central-Stage Replication

Unlike buffered Clos designs that require central-stage memory to operate above line rate or expanded central buffer capacity to avoid internal bottlenecks, the LBC with split buffering employs only line-rate bufferless crossbars (IM, CIM, COM) and low-complexity VOMQs. Each VOMQ is served at least once every kk slots, allowing service rates to match offered load directly and eliminating requirements for speedup or additional internal replication. Configuration cost per slot remains O(1)O(1), with mappings computed directly via cyclic permutation rules (Sule et al., 2018).

7. Empirical Performance and Comparative Evaluation

Simulation studies contrasted LBC with an Output-Queued (OQ) ideal switch, conventional buffered-central Clos (MMM), and MMM with expanded memory (MMMⁱᵉ), for N=64N=64 and N=256N=256. Key outcomes:

  • Uniform Bernoulli traffic: LBC attained 100% throughput, with mean queueing delay close to OQ and outperforming MMM at high loads (ρ>0.95\rho > 0.95).
  • Uniform bursty traffic (ON-OFF, mean burst =10,30\ell=10,30): LBC maintained 100% throughput; MMM dropped to ~80% and 75% at respective burst durations.
  • Unbalanced nonuniform patterns (Bernoulli, diagonal weighting ω=0.6\omega=0.6): LBC preserved 100% throughput, delay near OQ baseline.
  • Hot-spot traffic: LBC sustained full throughput and low delay akin to OQ, even under single-output traffic concentration scenarios.
  • Stress tests: Scenarios with multiple flows targeting a single OM or module-specific hot-spots exhibited bounded delay and at most single-cell crosspoint buffer occupancy.

These findings collectively demonstrate that split-central buffered architectures in LBC achieve deterministic, high-throughput operation under various load profiles, in-sequence forwarding, and configuration simplicity without resorting to conventionally necessary hardware enhancements (Sule et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Split Buffering.