Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 114 tok/s
Gemini 3.0 Pro 53 tok/s Pro
Gemini 2.5 Flash 132 tok/s Pro
Kimi K2 176 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Egalitarian Paxos: Leaderless SMR

Updated 10 November 2025
  • EPaxos is a leaderless state-machine replication protocol that enables any replica to coordinate command commits using explicit dependency sets.
  • It leverages workload commutativity to achieve fast-path commits while its EPaxos* variant resolves ballot ambiguities and restoration of linearizability guarantees.
  • EPaxos defines optimal quorum and fault-tolerance bounds, making it a scalable and reliable choice for distributed systems under varying failure scenarios.

Egalitarian Paxos (EPaxos) is a fault-tolerant, leaderless state-machine replication (SMR) protocol that allows any replica to act as coordinator, avoiding the single-point-of-failure and performance bottlenecks inherent in classic Paxos and Raft protocols. EPaxos was designed to exploit commutativity in workloads for lower commit latency, but early formulations were complicated by specification ambiguities, subtle liveness and safety bugs, and incomplete correctness proofs. Subsequent work introduced a rigorous, simplified, and provably optimal variant, denoted EPaxos* (Ryabinin et al., 4 Nov 2025), and clarified the necessary ballot management to restore linearizability guarantees (Sutra, 2019).

1. Design Principles and Motivation

EPaxos departs from leader-based SMR by enabling any replica to initiate and coordinate the commitment of client commands. This design reduces client-to-leader round-trip times and avoids centralized failure bottlenecks. EPaxos employs the following principles:

  • Leaderless coordination: Any replica can coordinate any command.
  • Fast-path commit: If no concurrent conflicting commands exist and the system tolerates at most e=f+12e = \lceil\frac{f+1}{2}\rceil failures (with ff the maximum tolerated number of process crashes), a conflict-free command can be committed in two message delays.
  • Commutativity-aware ordering: Commands are modeled with explicit dependency sets, so nonconflicting (commutative) operations may be executed in any topological order.

This approach is particularly effective in workloads where command conflicts are rare. Given n=2f+1n = 2f+1 replicas, EPaxos maintains nonzero throughput as long as up to ff arbitrary processes crash or become unreachable.

The initial promise of EPaxos in reducing global coordination and latency motivated further analysis of its protocol details and correctness properties.

2. Protocol Workflow and Data Structures

Each replica pp manages, for each command ID id\mathit{id}:

  • cmd[id]{}{payloads}\mathit{cmd}[\mathit{id}] \in \{\bot\} \cup \{\text{payloads}\}
  • dep[id]IDs\mathit{dep}[\mathit{id}] \subseteq \mathit{IDs} (the dependency set)
  • $\mathit{phase}[\mathit{id}] \in \{\pre,\acc,\com\}$ (indicating pre-accepted, accepted, or committed)
  • Ballot tracking: the original protocol used a single bal[id]\text{bal}[\mathit{id}], while the corrected variant uses two: bal[id]\text{bal}[\mathit{id}] (join/prepare ballot) and abal[id]\text{abal}[\mathit{id}] (accepted ballot).

Client submission and normal-case commit:

  1. On receiving a client's command cc, a coordinator pp computes its initial dependency set:

D0={id  cmd[id]id conflicts with c}D_0 = \{\mathit{id}' ~|~ \mathit{cmd}[\mathit{id}'] \neq \bot \wedge \mathit{id}' \text{ conflicts with } c\}

and broadcasts PreAccept(id,c,D0)\mathrm{PreAccept}(\mathit{id}, c, D_0) to all replicas.

  1. Each recipient qq sets local state and may extend D0D_0 by adding new known conflicts, replying PreAcceptOK(id,Dq)\mathrm{PreAcceptOK}(\mathit{id}, D_q).
  2. The coordinator collects enough PreAcceptOK\mathrm{PreAcceptOK} (at least nfn-f for progress). Let D=qQDqD = \bigcup_{q \in Q} D_q.
    • Fast Path: If Qne|Q| \geq n-e and Dq=D0D_q = D_0 for all qq, broadcast Commit(0,id,c,D)\mathrm{Commit}(0, \mathit{id}, c, D) immediately (commit in two message delays).
    • Slow Path: Otherwise, broadcast Accept(0,id,c,D)\mathrm{Accept}(0, \mathit{id}, c, D) and wait for nfn-f responses, then issue Commit\mathrm{Commit}.

Commands are executed in a topological order induced by the dependency graph. If dependency cycles arise, a deterministic rule is used to break them.

3. Recovery Procedures and Correctness

Original versions of EPaxos lacked both rigorous proofs and a robust recovery protocol, leading to safety and liveness violations, particularly in the handling of ballot transitions and concurrent recovery attempts (Ryabinin et al., 4 Nov 2025).

Key steps in simplified recovery (EPaxos∗):

  1. Each command id\mathit{id} is associated with a dynamic leader detector Ω[id]\Omega[\mathit{id}]. If progress is suspected stalled and pΩ[id]p \neq \Omega[\mathit{id}], the process sends TryRecover(id)\mathrm{TryRecover}(\mathit{id}) to Ω[id]\Omega[\mathit{id}].
  2. The coordinator, in a new ballot b>bal[id]b > \mathit{bal}[\mathit{id}], broadcasts Recover(b,id)\mathrm{Recover}(b, \mathit{id}) to all and collects RecoverOK(b,id,abalq,cmdq,depq,initDepq,phaseq)\mathrm{RecoverOK}(b, \mathit{id}, \mathit{abal}_q, \mathit{cmd}_q, \mathit{dep}_q, \mathit{initDep}_q, \mathit{phase}_q) from quorum QQ.
  3. The recovery logic distinguishes three cases:
    • If some qUq \in U (those with maximal abalq\mathit{abal}_q) is committed, re-commit.
    • If some qq is accepted, re-accept.
    • Otherwise, search for a set RQR \subseteq Q, RQe|R| \geq |Q| - e, where $(\text{phase}_q = \pre)$ and (depq=initDepq)(\text{dep}_q = \mathit{initDep}_q). If so, propose from qRq \in R.
  4. The validator broadcasts Validate\mathrm{Validate} to QQ, collecting potential invalidations before proceeding to commit or abort.

Table: Comparison of Recovery Logic

Property Original EPaxos EPaxos*
Ballot variables Single Two per command
Recovery state pollution TentativePreAccept Stateless validation
Deadlock possibility Potential, unresolved Avoided through Waiting

Agreement and visibility invariants are carefully established:

  • No two commits for the same command can select different dependencies.
  • Any two conflicting commands must satisfy iddep[id]id \in dep[id'] or iddep[id]id' \in dep[id] (visibility).

4. Critical Bugs and Specification Corrections

(Sutra, 2019) demonstrates the consequence of EPaxos's single-ballot-vs-accepted-value conflation:

  • Only bal[id]\mathit{bal}[\mathit{id}] was maintained, omitting the classic distinction between "highest ballot joined" and "highest ballot at which a value was accepted."
  • In ballot transitions during recovery, this allows divergent decisions for the same instance, violating the invariant that all replicas should agree on dependencies for any command.
  • The fix is to adopt two variables per instance:
    • balbal: highest seen ballot
    • vbalvbal: highest ballot at which an accepted value was stored
    • vdepvdep: dependency set at vbalvbal
  • This amendment restores the "safe value" construction, ensuring that any leader picks dependencies accepted at the maximal lower ballot—a core Paxos safety property.

5. Quorum Size, Fault Tolerance, and Optimality

EPaxos∗ generalizes quorum and failure parameters across the space of possible tradeoffs: nmax{2e+f1,2f+1}n \geq \max\{2e + f - 1,\, 2f + 1\} where:

  • nn: total replicas
  • ff: crash-tolerance for liveness
  • ee: maximum failures for "fast-path" commit

Theorem 2 (Ryabinin et al., 4 Nov 2025) establishes this as a global lower bound: no SMR protocol (with ff-crash resilience and ee-fastness) can do better. This subsumes the original EPaxos setting (n=2f+1n = 2f + 1, e=f+12e = \lceil\frac{f+1}{2}\rceil) but further admits all ff and ee satisfying the bound, yielding an optimal protocol.

Failure scenarios and liveness: EPaxos∗ ensures that, after recovery, a single process can always make progress via ballot preemption, and Waiting messages are introduced to detect and resolve cyclic deadlocks during recovery.

Classic Paxos and Raft rely on a static leader, introducing latency and throughput issues under asynchronous, partitioned, or geo-distributed deployment. EPaxos dispenses with leader reliance, instead leveraging commutativity for fast-path commits and uniform throughput irrespective of coordinator placement.

Key distinctions:

  • Leader elimination: Critical bottleneck removed; all replicas are equivalent.
  • Commutativity-exploiting dependency graphs: Only conflicting commands have enforced order; commutative commands can commit in parallel.
  • Formal proofs and correction: Early ambiguous specification and incomplete proofs are fully rectified in EPaxos* (Ryabinin et al., 4 Nov 2025).

Several replication protocols are inspired by, or refine, EPaxos's leaderless, dependency-graph paradigm. The rigorous analysis of ballot management and recovery protocol correctness in (Sutra, 2019) further informs protocol design for safe, linearizable replication in asynchronous, adversarial environments.

7. Formal Results and Theoretical Foundations

Theoretical underpinnings of EPaxos* include:

  • Theorem 2 (Lower Bound):

nmax{2e+f1,2f+1}n \geq \max\{2e + f - 1,\, 2f + 1\}

is mandatory for ff-resilient, ee-fast SMR.

  • Theorem 3 (Correctness): EPaxos* is ff-resilient, ee-fast, and satisfies both safety (agreement and visibility via ballots, dependencies, and validation) and liveness (via single-coordinator mechanisms and deadlock-avoidance).
  • Lemma 4 (Validation Safety): In recovery, if enough pre-accept responses with identical dependencies exist in the quorum, only this dependency set is valid for commit.
  • Lemma 5 (Wait Abort): If a process observes "Waiting" from more than nfen-f-e processes regarding a command, then that command could not have committed on the fast path, and abort is safe.

These results establish EPaxos* as the first provably safe, leaderless SMR protocol optimal across all relevant parameters, resolving prior ambiguity, suboptimality, and correctness holes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Egalitarian Paxos (EPaxos).