Papers
Topics
Authors
Recent
Search
2000 character limit reached

Blocking Linearizable Stack Implementation

Updated 15 January 2026
  • The paper introduces the SEC stack, which integrates sharding, FAA counters, lightweight elimination, and combining to boost throughput.
  • The design partitions operations into batches managed by aggregators, reducing contention by handling push/pop coordination locally.
  • Empirical results indicate the SEC stack achieves up to 6× speedup over prior approaches, particularly in high-contention, multi-thread environments.

A blocking linearizable stack implementation provides a concurrent stack data structure that ensures strict linearizability—every operation appears to take effect instantaneously at some point between invocation and response—while allowing threads to potentially block under contention. The “Sharded Elimination and Combining” (SEC) stack is a recent, highly efficient realization of this paradigm, combining sharding, fetch-and-increment (FAA) counters, lightweight elimination, and aggregation-based combining to achieve substantially higher throughput and scalability, particularly in high-contention and large-thread-count systems (Singh et al., 8 Jan 2026).

1. High-Level Design and Sharding

The SEC stack replaces the classic single-pointer Treiber stack structure with a set of KK “aggregators.” Each aggregator manages up to PP threads, where every thread is statically mapped to an aggregator by integer division of its thread ID, i.e., aggregator tid/P\left\lfloor \text{tid}/P \right\rfloor. Each aggregator organizes its incoming stack operations (push and pop) into successive “batches.” Within a batch, threads announce their operations and coordinate to apply elimination or combining.

Critical structures:

  • Aggregators: KK independent managers for partitioning threads and their batches.
  • Batch: Tracks push and pop announcements with FAA counters (B.pushCountB.\text{pushCount}, B.popCountB.\text{popCount}), keeps elimination slots in B.eliminationArray[P]B.\text{eliminationArray}[P], and maintains freeze (batch cutoff) snapshots.
  • Global Stack: A singly-linked list identified by an atomic pointer, stackTopstackTop.

Sharding disperses contention and ensures that only a subset of threads contend for batch-level resources at any given time, substantially reducing cross-thread interference compared to centralized stacks (Singh et al., 8 Jan 2026).

2. Key Algorithms: Push/Pop, Elimination, and Combining

SEC’s push and pop operations orchestrate synchronization via batch-local FAA and lightweight flag-passing, minimizing direct contention and expensive global synchronization.

2.1 Push Algorithm

A push thread:

  1. Allocates a new stack node and atomically obtains a unique sequence number in its batch via FAA(FAA(%%%%8%%%%B.pushCount).
  2. Records its node in the batch’s elimination array.
  3. The first thread in a batch becomes the “freezer” by winning a TestSet on B.isFreezerDecidedB.isFreezerDecided, snapshots batch counters, and initiates a new batch.
  4. Elimination phase: If its sequence number is less than both snapshot counts, it will be eliminated by a pop (and vice versa).
  5. Combining phase: If it “survived” elimination, exactly one thread per batch (determined by push/pop count parity) gathers the remaining pushes into a substack and splices it onto stackTopstackTop with a single CAS. Non-combiner threads spin on isBatchAppliedisBatchApplied to retrieve their result.

2.2 Pop Algorithm

A pop thread symmetrically:

  1. Atomically increments B.popCountB.popCount, announces its operation.
  2. Freezer logic and batch transitioning proceed as in push.
  3. During elimination, pairs off with a waiting push (by index).
  4. Surviving pops collectively remove a prefix of stackTopstackTop via a single CAS (by the designated combiner), dispensing stack values back to waiting pop threads.

2.3 Elimination and Combining

  • Elimination: Immediately following batch freeze, index-aligned push/pop pairs (iith push and iith pop with i<min(B.pushCountAtFreeze,B.popCountAtFreeze)i < min(B.\text{pushCountAtFreeze}, B.\text{popCountAtFreeze})) exchange values via B.eliminationArray[i]B.eliminationArray[i] and return, requiring only two FAA instructions and a write/read to the array.
  • Combining: The “surviving” set of pushes (or pops)—that is, those not eliminated within the batch—are combined by a single thread which performs bulk insertion (or removal) on stackTopstackTop using a single atomic CAS.

This design avoids per-thread hashing or interaction on global arrays: each aggregator exclusively manages only its own batch and threads (Singh et al., 8 Jan 2026).

3. Blocking and Linearizability Properties

Linearization Points

  • Eliminated operations: The linearization point is when a pop reads a push’s node from eliminationArray[i]eliminationArray[i].
  • Combined operations: Non-eliminated pushes or pops are linearized at the moment the batch combiner’s CAS (insertion or removal) succeeds.
  • The stack’s state invariant: At time tt, stackTopstackTop is the result of all operations linearized prior to tt; eliminated pairs leave no trace, and combined batches are applied atomically in order.

Blocking Semantics

SEC is blocking, not lock-free: threads may spin on the batch’s isFreezerDecidedisFreezerDecided or isBatchAppliedisBatchApplied flags waiting for the batch leader (freezer/combiner) to proceed. Progress requires at least one thread per batch to be scheduled and complete the freeze/combining phase. Under typical fair scheduling, batches complete without indefinite starvation (Singh et al., 8 Jan 2026).

4. Complexity and Empirical Performance

Operation Complexity

  • In the elimination or combining fast path, each operation performs $2$ FAA instructions, $1$ TestSet (for freezer election), and O(1)O(1) CAS in total.
  • Under high contention, the batch size increases, maximizing elimination and combining opportunities and thus minimizing global pointer contention.
  • In the worst case, if a thread repeatedly misses the freeze boundary, it may retry across several batches (bounded in expectation by system load).

Empirical Results

Summary of key performance results:

Machine Cores Max Threads SEC vs. Prior
Emerald Rapids 24 56 Up to 2× faster (all workloads)
IceLake-SP 48 96 2.3×–2.6× faster to 96 threads

Further highlights:

  • SEC is up to 6×6\times faster than FA-Stack on push-only workloads at 56 threads.
  • SEC maintains a 1.8×1.8\times2.1×2.1\times speedup over timestamp stacks (TSI) even under read-heavy workloads.
  • Batched elimination accounted for 78%78\% of operations; combining served 22%22\%.
  • Optimal thread-to-aggregator mapping was K=2K=2 for the tested workloads; higher KK led to excessive thread dispersion and reduced batch sizes/elimination rates (Singh et al., 8 Jan 2026).

5. Comparative Advantages and Applicable Workloads

SEC outperforms prior stacks—elimination-backoff (EB), flat-combining (FC), CCSynch, and TSI—by:

  • Dispersing contention across aggregators, avoiding high-CAS collision rates present in EB stacks.
  • Employing two-FAA elimination instead of EB’s triple-CAS sequence, reducing remote memory traffic.
  • Assigning only one thread as combiner per batch, sharply reducing the frequency of stackTopstackTop CAS operations.
  • Achieving strong scalability to large thread counts (≥32), scenarios where FC and CCSynch serialize too many operations under a single combiner.
  • Avoiding the global-scan overhead which hinders timestamp stacks, particularly in read-heavy regimes.

6. Operational Characterization and Deployment Guidelines

The SEC stack’s design is tuned for environments with:

  • High thread counts (32\geq 32), where classic elimination/concurrent stacks suffer due to pointer contention and linearization bottlenecks.
  • Workloads that feature frequent updates (push/pop balance), in which the batch elimination achieves maximal cancellation and minimal global synchronization.
  • Systems where fast-path batching is not only beneficial for throughput but crucial for keeping remote-memory synchronization and pointer update costs manageable.

A plausible implication is that, in low-contention settings or in environments with highly skewed operation distributions, tuning KK (number of aggregators) and PP (threads per aggregator) is necessary to maximize batching benefits while avoiding underutilization of elimination opportunities (Singh et al., 8 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Blocking Linearizable Stack Implementation.