Egalitarian Paxos: Leaderless SMR
- EPaxos is a leaderless state-machine replication protocol that enables any replica to coordinate command commits using explicit dependency sets.
- It leverages workload commutativity to achieve fast-path commits while its EPaxos* variant resolves ballot ambiguities and restoration of linearizability guarantees.
- EPaxos defines optimal quorum and fault-tolerance bounds, making it a scalable and reliable choice for distributed systems under varying failure scenarios.
Egalitarian Paxos (EPaxos) is a fault-tolerant, leaderless state-machine replication (SMR) protocol that allows any replica to act as coordinator, avoiding the single-point-of-failure and performance bottlenecks inherent in classic Paxos and Raft protocols. EPaxos was designed to exploit commutativity in workloads for lower commit latency, but early formulations were complicated by specification ambiguities, subtle liveness and safety bugs, and incomplete correctness proofs. Subsequent work introduced a rigorous, simplified, and provably optimal variant, denoted EPaxos* (Ryabinin et al., 4 Nov 2025), and clarified the necessary ballot management to restore linearizability guarantees (Sutra, 2019).
1. Design Principles and Motivation
EPaxos departs from leader-based SMR by enabling any replica to initiate and coordinate the commitment of client commands. This design reduces client-to-leader round-trip times and avoids centralized failure bottlenecks. EPaxos employs the following principles:
- Leaderless coordination: Any replica can coordinate any command.
- Fast-path commit: If no concurrent conflicting commands exist and the system tolerates at most failures (with the maximum tolerated number of process crashes), a conflict-free command can be committed in two message delays.
- Commutativity-aware ordering: Commands are modeled with explicit dependency sets, so nonconflicting (commutative) operations may be executed in any topological order.
This approach is particularly effective in workloads where command conflicts are rare. Given replicas, EPaxos maintains nonzero throughput as long as up to arbitrary processes crash or become unreachable.
The initial promise of EPaxos in reducing global coordination and latency motivated further analysis of its protocol details and correctness properties.
2. Protocol Workflow and Data Structures
Each replica manages, for each command ID :
- (the dependency set)
- $\mathit{phase}[\mathit{id}] \in \{\pre,\acc,\com\}$ (indicating pre-accepted, accepted, or committed)
- Ballot tracking: the original protocol used a single , while the corrected variant uses two: (join/prepare ballot) and (accepted ballot).
Client submission and normal-case commit:
- On receiving a client's command , a coordinator computes its initial dependency set:
and broadcasts to all replicas.
- Each recipient sets local state and may extend by adding new known conflicts, replying .
- The coordinator collects enough (at least for progress). Let .
- Fast Path: If and for all , broadcast immediately (commit in two message delays).
- Slow Path: Otherwise, broadcast and wait for responses, then issue .
Commands are executed in a topological order induced by the dependency graph. If dependency cycles arise, a deterministic rule is used to break them.
3. Recovery Procedures and Correctness
Original versions of EPaxos lacked both rigorous proofs and a robust recovery protocol, leading to safety and liveness violations, particularly in the handling of ballot transitions and concurrent recovery attempts (Ryabinin et al., 4 Nov 2025).
Key steps in simplified recovery (EPaxos∗):
- Each command is associated with a dynamic leader detector . If progress is suspected stalled and , the process sends to .
- The coordinator, in a new ballot , broadcasts to all and collects from quorum .
- The recovery logic distinguishes three cases:
- If some (those with maximal ) is committed, re-commit.
- If some is accepted, re-accept.
- Otherwise, search for a set , , where $(\text{phase}_q = \pre)$ and . If so, propose from .
- The validator broadcasts to , collecting potential invalidations before proceeding to commit or abort.
Table: Comparison of Recovery Logic
| Property | Original EPaxos | EPaxos* |
|---|---|---|
| Ballot variables | Single | Two per command |
| Recovery state pollution | TentativePreAccept | Stateless validation |
| Deadlock possibility | Potential, unresolved | Avoided through Waiting |
Agreement and visibility invariants are carefully established:
- No two commits for the same command can select different dependencies.
- Any two conflicting commands must satisfy or (visibility).
4. Critical Bugs and Specification Corrections
(Sutra, 2019) demonstrates the consequence of EPaxos's single-ballot-vs-accepted-value conflation:
- Only was maintained, omitting the classic distinction between "highest ballot joined" and "highest ballot at which a value was accepted."
- In ballot transitions during recovery, this allows divergent decisions for the same instance, violating the invariant that all replicas should agree on dependencies for any command.
- The fix is to adopt two variables per instance:
- : highest seen ballot
- : highest ballot at which an accepted value was stored
- : dependency set at
- This amendment restores the "safe value" construction, ensuring that any leader picks dependencies accepted at the maximal lower ballot—a core Paxos safety property.
5. Quorum Size, Fault Tolerance, and Optimality
EPaxos∗ generalizes quorum and failure parameters across the space of possible tradeoffs: where:
- : total replicas
- : crash-tolerance for liveness
- : maximum failures for "fast-path" commit
Theorem 2 (Ryabinin et al., 4 Nov 2025) establishes this as a global lower bound: no SMR protocol (with -crash resilience and -fastness) can do better. This subsumes the original EPaxos setting (, ) but further admits all and satisfying the bound, yielding an optimal protocol.
Failure scenarios and liveness: EPaxos∗ ensures that, after recovery, a single process can always make progress via ballot preemption, and Waiting messages are introduced to detect and resolve cyclic deadlocks during recovery.
6. Comparison with Classic Paxos and Related Work
Classic Paxos and Raft rely on a static leader, introducing latency and throughput issues under asynchronous, partitioned, or geo-distributed deployment. EPaxos dispenses with leader reliance, instead leveraging commutativity for fast-path commits and uniform throughput irrespective of coordinator placement.
Key distinctions:
- Leader elimination: Critical bottleneck removed; all replicas are equivalent.
- Commutativity-exploiting dependency graphs: Only conflicting commands have enforced order; commutative commands can commit in parallel.
- Formal proofs and correction: Early ambiguous specification and incomplete proofs are fully rectified in EPaxos* (Ryabinin et al., 4 Nov 2025).
Several replication protocols are inspired by, or refine, EPaxos's leaderless, dependency-graph paradigm. The rigorous analysis of ballot management and recovery protocol correctness in (Sutra, 2019) further informs protocol design for safe, linearizable replication in asynchronous, adversarial environments.
7. Formal Results and Theoretical Foundations
Theoretical underpinnings of EPaxos* include:
- Theorem 2 (Lower Bound):
is mandatory for -resilient, -fast SMR.
- Theorem 3 (Correctness): EPaxos* is -resilient, -fast, and satisfies both safety (agreement and visibility via ballots, dependencies, and validation) and liveness (via single-coordinator mechanisms and deadlock-avoidance).
- Lemma 4 (Validation Safety): In recovery, if enough pre-accept responses with identical dependencies exist in the quorum, only this dependency set is valid for commit.
- Lemma 5 (Wait Abort): If a process observes "Waiting" from more than processes regarding a command, then that command could not have committed on the fast path, and abort is safe.
These results establish EPaxos* as the first provably safe, leaderless SMR protocol optimal across all relevant parameters, resolving prior ambiguity, suboptimality, and correctness holes.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free