Split Buffering in Clos-Network Switches
- Split buffering is a technique that decomposes a traditional central buffering stage into two bufferless modules separated by virtual-output-module queues (VOMQs), enhancing load balance and performance.
- In the LBC switch, VOMQs enable deterministic cyclic interconnection and in-order forwarding while maintaining 100% throughput under various traffic loads.
- Deterministic scheduling and in-sequence hold-down mechanisms eliminate the need for memory speedup, reducing configuration complexity and ensuring stable operation.
Split buffering, as implemented in Clos-network packet switches, refers to the architectural arrangement in which what is traditionally a single central stage is decomposed into two bufferless stages with intervening buffers. The most prominent design exemplifying split buffering is the Split-Central-Buffered Load-Balancing Clos-network (LBC) switch, which organizes its central modules into a central-input stage and a central-output stage, placing virtual-output-module queues (VOMQs) between them. This mechanism enables load-balanced and in-order cell forwarding with high throughput and minimal configuration complexity, without the need for memory speedup or expanded central-stage replication (Sule et al., 2018).
1. Architectural Foundations and Module Decomposition
The LBC switch is configured as a four-stage Clos topology with inputs and outputs. Each port is labeled or , where and . The LBC architecture consists of:
- Input stage (IM): independent bufferless crossbars, denoted .
- Central-input stage (CIM): independent bufferless crossbars, denoted .
- Central-output stage (COM): independent bufferless crossbars, denoted .
- Output stage (OM): independent buffered crossbars, denoted , each with crosspoint buffers .
The split-central concept divides the central stage into two consecutive bufferless modules (CIM and COM), with buffer resources (VOMQs) interleaved between them. The data path is:
This structure avoids congestion and ordering issues found in traditional buffered and unbuffered three-stage Clos switches (Sule et al., 2018).
2. Split-Central Buffering and Virtual-Output-Module Queues
Split buffering is substantiated by locating queues between the two central stages. While an alternative—virtual-output-port queues (VOPQs), one per output port—can be conceived, the LBC design employs VOMQs placed at the outputs of CIMs. For each and destination , a VOMQ is instantiated, into which all cells exiting and destined for are placed, regardless of their originating IM. From these VOMQs, cells are delivered to one of the inputs of . Figures from the primary reference depict this arrangement and clarify that VOMQs multiplex cell arrivals originating from multiple IMs (Sule et al., 2018).
3. Deterministic Cyclic Interconnection Scheduling
All bufferless stages (IM, CIM, COM) operate under a deterministic, cyclic configuration regime composed of disjoint matchings per stage. During time slot :
- IM(i): connects to index .
- CIM(r): Its input from is connected to output port .
- COM(r): In reversed order, input connects to index .
Each matching is a permutation, and their cyclic composition forms permutation matrices and . Over slots,
Each permutation contributes one nonzero per row and column, ensuring balanced and deadlock-free interconnection. This pre-determined, periodic schedule is crucial for load balancing and low configuration complexity (Sule et al., 2018).
4. In-Sequence Cell Forwarding Policy
To guarantee in-order delivery, which can otherwise be violated by distributed buffering, the LBC switch introduces a hold-down mechanism for VOQ departures. When a cell for flow departs its VOQ and enters a VOMQ with occupancy , all subsequent cells of the same flow are held at the VOQ for exactly additional slots before permitting release. Each input IP maintains, per reachable VOMQ, a counter . When transmits a cell and , a hold-down timer of slots is set. This rule ensures that younger cells cannot overtake older ones due to variable queuing times in the central buffers (Sule et al., 2018).
5. Throughput and Queuing Stability Analysis
For admissible i.i.d. traffic, with as the arrival-rate matrix (row/column sums ≤1), the system load balancing is analyzed as follows:
- After IM+CIM permutations:
where is matrix multiplication by all-ones matrix, is elementwise multiplication. Each input's traffic is equally distributed along paths.
- After COM:
- The OM stage aggregates these to recover each output's share:
Drift arguments in Appendix A show that VOQs, VOMQs, and crosspoint buffers (CBs) remain weakly stable under these admissible loads, thereby guaranteeing 100% throughput under all admissible i.i.d. traffic (Sule et al., 2018).
6. Elimination of Memory-Speedup and Central-Stage Replication
Unlike buffered Clos designs that require central-stage memory to operate above line rate or expanded central buffer capacity to avoid internal bottlenecks, the LBC with split buffering employs only line-rate bufferless crossbars (IM, CIM, COM) and low-complexity VOMQs. Each VOMQ is served at least once every slots, allowing service rates to match offered load directly and eliminating requirements for speedup or additional internal replication. Configuration cost per slot remains , with mappings computed directly via cyclic permutation rules (Sule et al., 2018).
7. Empirical Performance and Comparative Evaluation
Simulation studies contrasted LBC with an Output-Queued (OQ) ideal switch, conventional buffered-central Clos (MMM), and MMM with expanded memory (MMMⁱᵉ), for and . Key outcomes:
- Uniform Bernoulli traffic: LBC attained 100% throughput, with mean queueing delay close to OQ and outperforming MMM at high loads ().
- Uniform bursty traffic (ON-OFF, mean burst ): LBC maintained 100% throughput; MMM dropped to ~80% and 75% at respective burst durations.
- Unbalanced nonuniform patterns (Bernoulli, diagonal weighting ): LBC preserved 100% throughput, delay near OQ baseline.
- Hot-spot traffic: LBC sustained full throughput and low delay akin to OQ, even under single-output traffic concentration scenarios.
- Stress tests: Scenarios with multiple flows targeting a single OM or module-specific hot-spots exhibited bounded delay and at most single-cell crosspoint buffer occupancy.
These findings collectively demonstrate that split-central buffered architectures in LBC achieve deterministic, high-throughput operation under various load profiles, in-sequence forwarding, and configuration simplicity without resorting to conventionally necessary hardware enhancements (Sule et al., 2018).