Blocking Linearizable Stack Implementation

Updated 15 January 2026

The paper introduces the SEC stack, which integrates sharding, FAA counters, lightweight elimination, and combining to boost throughput.
The design partitions operations into batches managed by aggregators, reducing contention by handling push/pop coordination locally.
Empirical results indicate the SEC stack achieves up to 6× speedup over prior approaches, particularly in high-contention, multi-thread environments.

A blocking linearizable stack implementation provides a concurrent stack data structure that ensures strict linearizability—every operation appears to take effect instantaneously at some point between invocation and response—while allowing threads to potentially block under contention. The “Sharded Elimination and Combining” (SEC) stack is a recent, highly efficient realization of this paradigm, combining sharding, fetch-and-increment (FAA) counters, lightweight elimination, and aggregation-based combining to achieve substantially higher throughput and scalability, particularly in high-contention and large-thread-count systems (Singh et al., 8 Jan 2026).

1. High-Level Design and Sharding

The SEC stack replaces the classic single-pointer Treiber stack structure with a set of $K$ “aggregators.” Each aggregator manages up to $P$ threads, where every thread is statically mapped to an aggregator by integer division of its thread ID, i.e., aggregator $\left\lfloor \text{tid}/P \right\rfloor$ . Each aggregator organizes its incoming stack operations (push and pop) into successive “batches.” Within a batch, threads announce their operations and coordinate to apply elimination or combining.

Critical structures:

Aggregators: $K$ independent managers for partitioning threads and their batches.
Batch: Tracks push and pop announcements with FAA counters ( $B.\text{pushCount}$ , $B.\text{popCount}$ ), keeps elimination slots in $B.\text{eliminationArray}[P]$ , and maintains freeze (batch cutoff) snapshots.
Global Stack: A singly-linked list identified by an atomic pointer, $stackTop$ .

Sharding disperses contention and ensures that only a subset of threads contend for batch-level resources at any given time, substantially reducing cross-thread interference compared to centralized stacks (Singh et al., 8 Jan 2026).

2. Key Algorithms: Push/Pop, Elimination, and Combining

SEC’s push and pop operations orchestrate synchronization via batch-local FAA and lightweight flag-passing, minimizing direct contention and expensive global synchronization.

2.1 Push Algorithm

A push thread:

Allocates a new stack node and atomically obtains a unique sequence number in its batch via $FAA(%%%%8%%%%B.pushCount)$ .
Records its node in the batch’s elimination array.
The first thread in a batch becomes the “freezer” by winning a TestSet on $B.isFreezerDecided$ , snapshots batch counters, and initiates a new batch.
Elimination phase: If its sequence number is less than both snapshot counts, it will be eliminated by a pop (and vice versa).
Combining phase: If it “survived” elimination, exactly one thread per batch (determined by push/pop count parity) gathers the remaining pushes into a substack and splices it onto $P$ 0 with a single CAS. Non-combiner threads spin on $P$ 1 to retrieve their result.

2.2 Pop Algorithm

A pop thread symmetrically:

Atomically increments $P$ 2, announces its operation.
Freezer logic and batch transitioning proceed as in push.
During elimination, pairs off with a waiting push (by index).
Surviving pops collectively remove a prefix of $P$ 3 via a single CAS (by the designated combiner), dispensing stack values back to waiting pop threads.

2.3 Elimination and Combining

Elimination: Immediately following batch freeze, index-aligned push/pop pairs ( $P$ 4th push and $P$ 5th pop with $P$ 6) exchange values via $P$ 7 and return, requiring only two FAA instructions and a write/read to the array.
Combining: The “surviving” set of pushes (or pops)—that is, those not eliminated within the batch—are combined by a single thread which performs bulk insertion (or removal) on $P$ 8 using a single atomic CAS.

This design avoids per-thread hashing or interaction on global arrays: each aggregator exclusively manages only its own batch and threads (Singh et al., 8 Jan 2026).

3. Blocking and Linearizability Properties

Linearization Points

Eliminated operations: The linearization point is when a pop reads a push’s node from $P$ 9.
Combined operations: Non-eliminated pushes or pops are linearized at the moment the batch combiner’s CAS (insertion or removal) succeeds.
The stack’s state invariant: At time $\left\lfloor \text{tid}/P \right\rfloor$ 0, $\left\lfloor \text{tid}/P \right\rfloor$ 1 is the result of all operations linearized prior to $\left\lfloor \text{tid}/P \right\rfloor$ 2; eliminated pairs leave no trace, and combined batches are applied atomically in order.

Blocking Semantics

SEC is blocking, not lock-free: threads may spin on the batch’s $\left\lfloor \text{tid}/P \right\rfloor$ 3 or $\left\lfloor \text{tid}/P \right\rfloor$ 4 flags waiting for the batch leader (freezer/combiner) to proceed. Progress requires at least one thread per batch to be scheduled and complete the freeze/combining phase. Under typical fair scheduling, batches complete without indefinite starvation (Singh et al., 8 Jan 2026).

4. Complexity and Empirical Performance

Operation Complexity

In the elimination or combining fast path, each operation performs $\left\lfloor \text{tid}/P \right\rfloor$ 5 FAA instructions, $\left\lfloor \text{tid}/P \right\rfloor$ 6 TestSet (for freezer election), and $\left\lfloor \text{tid}/P \right\rfloor$ 7 CAS in total.
Under high contention, the batch size increases, maximizing elimination and combining opportunities and thus minimizing global pointer contention.
In the worst case, if a thread repeatedly misses the freeze boundary, it may retry across several batches (bounded in expectation by system load).

Empirical Results

Summary of key performance results:

Machine	Cores	Max Threads	SEC vs. Prior
Emerald Rapids	24	56	Up to 2× faster (all workloads)
IceLake-SP	48	96	2.3×–2.6× faster to 96 threads

Further highlights:

SEC is up to $\left\lfloor \text{tid}/P \right\rfloor$ 8 faster than FA-Stack on push-only workloads at 56 threads.
SEC maintains a $\left\lfloor \text{tid}/P \right\rfloor$ 9– $K$ 0 speedup over timestamp stacks (TSI) even under read-heavy workloads.
Batched elimination accounted for $K$ 1 of operations; combining served $K$ 2.
Optimal thread-to-aggregator mapping was $K$ 3 for the tested workloads; higher $K$ 4 led to excessive thread dispersion and reduced batch sizes/elimination rates (Singh et al., 8 Jan 2026).

5. Comparative Advantages and Applicable Workloads

SEC outperforms prior stacks—elimination-backoff (EB), flat-combining (FC), CCSynch, and TSI—by:

Dispersing contention across aggregators, avoiding high-CAS collision rates present in EB stacks.
Employing two-FAA elimination instead of EB’s triple-CAS sequence, reducing remote memory traffic.
Assigning only one thread as combiner per batch, sharply reducing the frequency of $K$ 5 CAS operations.
Achieving strong scalability to large thread counts (≥32), scenarios where FC and CCSynch serialize too many operations under a single combiner.
Avoiding the global-scan overhead which hinders timestamp stacks, particularly in read-heavy regimes.

6. Operational Characterization and Deployment Guidelines

The SEC stack’s design is tuned for environments with:

High thread counts ( $K$ 6), where classic elimination/concurrent stacks suffer due to pointer contention and linearization bottlenecks.
Workloads that feature frequent updates (push/pop balance), in which the batch elimination achieves maximal cancellation and minimal global synchronization.
Systems where fast-path batching is not only beneficial for throughput but crucial for keeping remote-memory synchronization and pointer update costs manageable.

A plausible implication is that, in low-contention settings or in environments with highly skewed operation distributions, tuning $K$ 7 (number of aggregators) and $K$ 8 (threads per aggregator) is necessary to maximize batching benefits while avoiding underutilization of elimination opportunities (Singh et al., 8 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Sharded Elimination and Combining for Highly-Efficient Concurrent Stacks (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Blocking Linearizable Stack Implementation.