Egalitarian Paxos: A Leaderless Consensus Protocol

Updated 11 November 2025

Egalitarian Paxos is a leaderless state-machine replication protocol that orders client commands collaboratively using quorum-based voting.
It achieves rapid, two-message delays for conflict-free commands while reverting to a slow path for conflicting operations or failures.
EPaxos* refines the original design with a stateless validation phase that simplifies recovery and ensures rigorous correctness.

Egalitarian Paxos is a leaderless state-machine replication protocol designed to overcome the single-leader bottleneck of classical consensus protocols such as Paxos and Raft. In contrast to leader-based designs, Egalitarian Paxos allows all processes (“replicas”) to submit and order client commands collaboratively by leveraging quorums and exploiting the commutativity of commands. This enables rapid decision-making for conflict-free command submissions, load balancing, and resilience to failures, while also introducing significant design complexity. Successive work has identified subtle errors in its specification, prompting revisions and the development of a simpler, rigorously correct version, referred to as EPaxos*.

1. Protocol Structure and Motivation

Classical state-machine replication (SMR) protocols such as Paxos and Raft sequence client commands via a distinguished leader. This design creates two main drawbacks: (i) the leader becomes a single point of failure; (ii) clients not co-located with the leader incur additional message delays (3 delays vs. 2 for the leader’s own proposals). Egalitarian Paxos (EPaxos) removes this role entirely, permitting any replica to propose commands which are collaboratively ordered via quorum-based voting.

A key innovation is the collaborative determination of dependencies among commands, capitalizing on the observation that many service operations commute (i.e., their order does not affect the state or outputs). When submitted commands commute, EPaxos enables their commit in two message delays (“fast path”). For conflicting (non-commutative) commands or in the presence of failures, EPaxos guarantees progress via a slow (classic Paxos-style) path.

Each process maintains a set of identifiers and tracks the dependencies between commands to construct a partial order. The protocol is parameterized by $n$ (the number of replicas), $f$ (maximum failures tolerated for slow-path liveness), and $e$ (maximum failures tolerated for fast-path commit). A fundamental bound ensures:

$n \geq \max\{2e + f - 1,\; 2f + 1\}.$

This balances the trade-off between rapid agreement (“fast-path” resilience, $e$ ) and fault tolerance ( $f$ ), and the bound is provably optimal for any leaderless consensus protocol matching these properties (Ryabinin et al., 4 Nov 2025).

2. System Model and Command Dependencies

The protocol executes on a set of $n$ replicas ${p_1,\ldots,p_n}$ in an asynchronous, reliable message-passing environment with at most $f$ crash-stop failures. Replicas may simultaneously propose commands on behalf of clients. Each command $c$ has a unique identifier. Two commands $c$ and $c'$ commute if their execution order is irrelevant: $c \diamond c'$ .

Each replica maintains, for every command $c$ , the payload, a dependency set, and per-command phase and ballot variables. The dependency set for $c$ , $\mathrm{deps}(c)$ , comprises all known conflicting commands active at the time of proposal or recovery. The final execution order of commands is determined by a topological sort of the dependency graph formed by dependencies across all committed commands.

3. Fast and Slow Path Commit Logic

3.1 Fast Path

The fast path operates in ballot 0. When a client command is submitted, a proposer selects the initial set of known conflicts and broadcasts a PreAccept message to all replicas. Replicas that have not participated in earlier ballots pre-accept the command, merging any new discovered conflicts, and respond with their computed dependency set.

Pseudocode (abbreviated):

– Client submission:
    proposer p: 
    id ← newId(p)
    initDep[id] ← { id' | p knows cmd[id'] conflicting }
    send PreAccept(id, cmd, initDep[id]) to ALL

– On receive PreAccept(id, c, D) if bal[id]=0, phase[id]=none:
    cmd[id] ← c
    initDep[id] ← D
    dep[id] ← D ∪ { id' | cmd[id'] conflicts c }
    phase[id] ← PreAccepted
    reply PreAcceptOK(id, dep[id]) to sender

– On proposer collecting PreAcceptOK(id, D₁)... from Q:
    if |Q| ≥ n−e and ∀q ∈ Q: D_q = initDep[id]:
        send Commit(0, id, cmd[id], D) to ALL
    else if |Q| ≥ n−f:
        send Accept(0, id, cmd[id], D) to ALL

If the proposer receives $n-e$ matching PreAcceptOK responses, it commits the command in two message delays. Otherwise, it transitions to the slow path. The correctness of this “fast path” requires that no more than $e$ processes fail and that all concurrent commands commute.

3.2 Slow Path

When the fast-path quorum ( $n-e$ ) cannot be assembled, or conflicting dependency sets are detected, EPaxos reverts to a classical, ballot-driven Paxos phase using a “slow” quorum of size $n-f$ . This ensures commands are safely chosen and recoverable under up to $f$ failures.

Original EPaxos’s recovery protocol suffered from both technical complexity and substantial correctness pitfalls, notably in the management of dependencies and ballot state during coordinator hand-offs. This complexity was exacerbated by state changes prior to safety checks in the original “TentativePreAccept” phase, occasionally resulting in deadlocks (Sutra, 2019), and, more subtly, by a flawed single-ballot-variable reconstruction that allowed replicas to forget previously voted dependency sets.

The revised protocol, EPaxos*, replaces these mechanisms with a stateless “Validate” phase. During recovery:

A recovering coordinator queries all replicas with a fresh ballot.
If a committed or accepted value exists in the highest observed ballot, that value is chosen.
If a potential fast-path commit is recoverable and validated (i.e., a quorum of PreAccepted states with matching dependencies can be shown), that value is promoted.
Otherwise, the slot is filled with a no-op.

Key state variables now include both a “last ballot joined” and a “last ballot voted,” as well as the last-voted dependency set, enforcing the invariant that no process “forgets” previously accepted values upon moving to a higher ballot. Failure to observe this distinction permits divergent dependencies and inconsistent orders for conflicting commands, violating linearizability; this issue is explicitly discussed and resolved in rigorous TLA+ formalizations (Sutra, 2019). EPaxos*’s recovery protocol is thus both simpler (single-ballot logic) and correct.

5. Correctness and Optimality

The protocol’s safety is governed by two invariants:

Agreement: No two commits for the same command identifier differ in payload or dependency set.
Visibility: For any pair of conflicting committed commands, at least one command appears in the other’s dependency set.

These ensure that every conflicting pair of commands induces a directed edge in the final dependency graph, which, upon topological sorting (with deterministic tie-breaking), yields a unique consistent execution order (i.e., linearizability).

Liveness (“non-blocking progress”) is guaranteed for:

f-resilience: at most f crash failures, via classic slow-path Paxos-style recovery.
e-fast path: at most e failures and commuting commands allow two-message-delay commit.

EPaxos* is shown to be optimal: for any protocol achieving f-resilience and e-fast path, $n \geq \max\{2e + f - 1, 2f + 1\}$ is both necessary and sufficient (Ryabinin et al., 4 Nov 2025).

6. Comparison to Classical and Original EPaxos

Protocol Variant	Leaderless	Fast Path Delays	Fast Quorum Size	Recovery Design	Known Bugs
Classic Paxos	No	3	N/A	Multi-ballot, majority	None reported
Original EPaxos	Yes	2	n-e (as defined)	TentativePreAccept, state	Ballot confusion, deadlocks
EPaxos* (revised)	Yes	2	n-e (as defined)	Stateless Validate	None present

Original EPaxos’s recovery protocol introduced ambiguity between ballot variables and allowed state changes before agreement, enabling correctness violations and deadlocks (Sutra, 2019). EPaxos* fixes these issues by adopting a stateless validate phase and properly accounting for all prior votes during recovery, eliminating deadlock conditions and enabling a single-ballot-variable logic. The resulting protocol simplifies implementation and yields rigorous correctness proofs.

7. Practical Significance and Impact

Egalitarian Paxos’s leaderless, quorum-centric design makes it attractive in wide-area deployments with geographically distributed clients, balancing commit responsibility among processes and minimizing latency for non-conflicting operations. The generalized parameterization in EPaxos* enables practitioners to trade crash resilience for operational speed by choosing $f$ and $e$ tuned to application requirements. The formal clarification and corrections to the protocol’s recovery logic highlight the necessity of precise specification and careful implementation, even for established distributed systems protocols.

EPaxos has served as the foundation for further research and protocol design in distributed consensus, and EPaxos* now provides a robust, theoretically optimal, and practically implementable reference point for future developments in high-performance, leaderless state-machine replication protocols (Ryabinin et al., 4 Nov 2025, Sutra, 2019).

PDF Markdown Chat (Pro)

References (2)

Making Democracy Work: Fixing and Simplifying Egalitarian Paxos (Extended Version) (2025)

On the correctness of Egalitarian Paxos (2019)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Egalitarian Paxos.