Egalitarian Paxos: A Leaderless Consensus Protocol
- Egalitarian Paxos is a leaderless state-machine replication protocol that orders client commands collaboratively using quorum-based voting.
- It achieves rapid, two-message delays for conflict-free commands while reverting to a slow path for conflicting operations or failures.
- EPaxos* refines the original design with a stateless validation phase that simplifies recovery and ensures rigorous correctness.
Egalitarian Paxos is a leaderless state-machine replication protocol designed to overcome the single-leader bottleneck of classical consensus protocols such as Paxos and Raft. In contrast to leader-based designs, Egalitarian Paxos allows all processes (“replicas”) to submit and order client commands collaboratively by leveraging quorums and exploiting the commutativity of commands. This enables rapid decision-making for conflict-free command submissions, load balancing, and resilience to failures, while also introducing significant design complexity. Successive work has identified subtle errors in its specification, prompting revisions and the development of a simpler, rigorously correct version, referred to as EPaxos*.
1. Protocol Structure and Motivation
Classical state-machine replication (SMR) protocols such as Paxos and Raft sequence client commands via a distinguished leader. This design creates two main drawbacks: (i) the leader becomes a single point of failure; (ii) clients not co-located with the leader incur additional message delays (3 delays vs. 2 for the leader’s own proposals). Egalitarian Paxos (EPaxos) removes this role entirely, permitting any replica to propose commands which are collaboratively ordered via quorum-based voting.
A key innovation is the collaborative determination of dependencies among commands, capitalizing on the observation that many service operations commute (i.e., their order does not affect the state or outputs). When submitted commands commute, EPaxos enables their commit in two message delays (“fast path”). For conflicting (non-commutative) commands or in the presence of failures, EPaxos guarantees progress via a slow (classic Paxos-style) path.
Each process maintains a set of identifiers and tracks the dependencies between commands to construct a partial order. The protocol is parameterized by (the number of replicas), (maximum failures tolerated for slow-path liveness), and (maximum failures tolerated for fast-path commit). A fundamental bound ensures:
This balances the trade-off between rapid agreement (“fast-path” resilience, ) and fault tolerance (), and the bound is provably optimal for any leaderless consensus protocol matching these properties (Ryabinin et al., 4 Nov 2025).
2. System Model and Command Dependencies
The protocol executes on a set of replicas in an asynchronous, reliable message-passing environment with at most crash-stop failures. Replicas may simultaneously propose commands on behalf of clients. Each command has a unique identifier. Two commands and commute if their execution order is irrelevant: .
Each replica maintains, for every command , the payload, a dependency set, and per-command phase and ballot variables. The dependency set for , , comprises all known conflicting commands active at the time of proposal or recovery. The final execution order of commands is determined by a topological sort of the dependency graph formed by dependencies across all committed commands.
3. Fast and Slow Path Commit Logic
3.1 Fast Path
The fast path operates in ballot 0. When a client command is submitted, a proposer selects the initial set of known conflicts and broadcasts a PreAccept message to all replicas. Replicas that have not participated in earlier ballots pre-accept the command, merging any new discovered conflicts, and respond with their computed dependency set.
Pseudocode (abbreviated):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
– Client submission:
proposer p:
id ← newId(p)
initDep[id] ← { id' | p knows cmd[id'] conflicting }
send PreAccept(id, cmd, initDep[id]) to ALL
– On receive PreAccept(id, c, D) if bal[id]=0, phase[id]=none:
cmd[id] ← c
initDep[id] ← D
dep[id] ← D ∪ { id' | cmd[id'] conflicts c }
phase[id] ← PreAccepted
reply PreAcceptOK(id, dep[id]) to sender
– On proposer collecting PreAcceptOK(id, D₁)... from Q:
if |Q| ≥ n−e and ∀q ∈ Q: D_q = initDep[id]:
send Commit(0, id, cmd[id], D) to ALL
else if |Q| ≥ n−f:
send Accept(0, id, cmd[id], D) to ALL |
If the proposer receives matching PreAcceptOK responses, it commits the command in two message delays. Otherwise, it transitions to the slow path. The correctness of this “fast path” requires that no more than processes fail and that all concurrent commands commute.
3.2 Slow Path
When the fast-path quorum () cannot be assembled, or conflicting dependency sets are detected, EPaxos reverts to a classical, ballot-driven Paxos phase using a “slow” quorum of size . This ensures commands are safely chosen and recoverable under up to failures.
4. Failure Recovery and Protocol Refinements
Original EPaxos’s recovery protocol suffered from both technical complexity and substantial correctness pitfalls, notably in the management of dependencies and ballot state during coordinator hand-offs. This complexity was exacerbated by state changes prior to safety checks in the original “TentativePreAccept” phase, occasionally resulting in deadlocks (Sutra, 2019), and, more subtly, by a flawed single-ballot-variable reconstruction that allowed replicas to forget previously voted dependency sets.
The revised protocol, EPaxos*, replaces these mechanisms with a stateless “Validate” phase. During recovery:
- A recovering coordinator queries all replicas with a fresh ballot.
- If a committed or accepted value exists in the highest observed ballot, that value is chosen.
- If a potential fast-path commit is recoverable and validated (i.e., a quorum of PreAccepted states with matching dependencies can be shown), that value is promoted.
- Otherwise, the slot is filled with a no-op.
Key state variables now include both a “last ballot joined” and a “last ballot voted,” as well as the last-voted dependency set, enforcing the invariant that no process “forgets” previously accepted values upon moving to a higher ballot. Failure to observe this distinction permits divergent dependencies and inconsistent orders for conflicting commands, violating linearizability; this issue is explicitly discussed and resolved in rigorous TLA+ formalizations (Sutra, 2019). EPaxos*’s recovery protocol is thus both simpler (single-ballot logic) and correct.
5. Correctness and Optimality
The protocol’s safety is governed by two invariants:
- Agreement: No two commits for the same command identifier differ in payload or dependency set.
- Visibility: For any pair of conflicting committed commands, at least one command appears in the other’s dependency set.
These ensure that every conflicting pair of commands induces a directed edge in the final dependency graph, which, upon topological sorting (with deterministic tie-breaking), yields a unique consistent execution order (i.e., linearizability).
Liveness (“non-blocking progress”) is guaranteed for:
- f-resilience: at most f crash failures, via classic slow-path Paxos-style recovery.
- e-fast path: at most e failures and commuting commands allow two-message-delay commit.
EPaxos* is shown to be optimal: for any protocol achieving f-resilience and e-fast path, is both necessary and sufficient (Ryabinin et al., 4 Nov 2025).
6. Comparison to Classical and Original EPaxos
| Protocol Variant | Leaderless | Fast Path Delays | Fast Quorum Size | Recovery Design | Known Bugs |
|---|---|---|---|---|---|
| Classic Paxos | No | 3 | N/A | Multi-ballot, majority | None reported |
| Original EPaxos | Yes | 2 | n-e (as defined) | TentativePreAccept, state | Ballot confusion, deadlocks |
| EPaxos* (revised) | Yes | 2 | n-e (as defined) | Stateless Validate | None present |
Original EPaxos’s recovery protocol introduced ambiguity between ballot variables and allowed state changes before agreement, enabling correctness violations and deadlocks (Sutra, 2019). EPaxos* fixes these issues by adopting a stateless validate phase and properly accounting for all prior votes during recovery, eliminating deadlock conditions and enabling a single-ballot-variable logic. The resulting protocol simplifies implementation and yields rigorous correctness proofs.
7. Practical Significance and Impact
Egalitarian Paxos’s leaderless, quorum-centric design makes it attractive in wide-area deployments with geographically distributed clients, balancing commit responsibility among processes and minimizing latency for non-conflicting operations. The generalized parameterization in EPaxos* enables practitioners to trade crash resilience for operational speed by choosing and tuned to application requirements. The formal clarification and corrections to the protocol’s recovery logic highlight the necessity of precise specification and careful implementation, even for established distributed systems protocols.
EPaxos has served as the foundation for further research and protocol design in distributed consensus, and EPaxos* now provides a robust, theoretically optimal, and practically implementable reference point for future developments in high-performance, leaderless state-machine replication protocols (Ryabinin et al., 4 Nov 2025, Sutra, 2019).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free