Papers
Topics
Authors
Recent
2000 character limit reached

Making Democracy Work: Fixing and Simplifying Egalitarian Paxos (Extended Version) (2511.02743v1)

Published 4 Nov 2025 in cs.DC

Abstract: Classical state-machine replication protocols, such as Paxos, rely on a distinguished leader process to order commands. Unfortunately, this approach makes the leader a single point of failure and increases the latency for clients that are not co-located with it. As a response to these drawbacks, Egalitarian Paxos introduced an alternative, leaderless approach, that allows replicas to order commands collaboratively. Not relying on a single leader allows the protocol to maintain non-zero throughput with up to $f$ crashes of any processes out of a total of $n = 2f+1$. The protocol furthermore allows any process to execute a command $c$ fast, in $2$ message delays, provided no more than $e = \lceil\frac{f+1}{2}\rceil$ other processes fail, and all concurrently submitted commands commute with $c$; the latter condition is often satisfied in practical systems. Egalitarian Paxos has served as a foundation for many other replication protocols. But unfortunately, the protocol is very complex, ambiguously specified and suffers from nontrivial bugs. In this paper, we present EPaxos* -- a simpler and correct variant of Egalitarian Paxos. Our key technical contribution is a simpler failure-recovery algorithm, which we have rigorously proved correct. Our protocol also generalizes Egalitarian Paxos to cover the whole spectrum of failure thresholds $f$ and $e$ such that $n \ge \max{2e+f-1, 2f+1}$ -- the number of processes that we show to be optimal.

Summary

  • The paper introduces a simplified, fully specified variant of EPaxos that resolves ambiguity and safety issues in leaderless state-machine replication.
  • The paper rigorously proves safety and liveness using invariant-based methods and matches the known lower bounds for fast consensus.
  • The paper presents a novel recovery mechanism with a validation phase that optimizes fast and slow paths for improved geo-distributed performance.

Fixing and Simplifying Egalitarian Paxos: A Rigorous Approach to Leaderless State-Machine Replication

Introduction and Motivation

The paper addresses the complexity, ambiguity, and correctness issues in the original Egalitarian Paxos (EPaxos) protocol, a leaderless state-machine replication (SMR) protocol designed to overcome the single-leader bottleneck of classic protocols like Paxos and Raft. EPaxos enables any replica to propose and order commands, allowing for lower-latency execution and improved availability, especially in geo-distributed deployments. However, the original EPaxos protocol suffers from intricate and underspecified recovery logic, subtle safety bugs, and a lack of rigorous correctness proofs, particularly for its non-thrifty variant and optimal fast-path quorum sizes.

The authors present a new protocol, also called EPaxos, that retains the democratic, leaderless nature of the original but provides a simpler, fully specified, and provably correct design. The new protocol generalizes the fast-path/slow-path tradeoff, matches known lower bounds for fast consensus, and provides a rigorous proof of safety and liveness.

System Model and Problem Statement

The system consists of nn processes, up to ff of which may crash, communicating over reliable links in a partially synchronous network. The goal is to implement linearizable SMR, where each process maintains a replica of a deterministic state machine and executes client commands in a way that ensures consistency, integrity, validity, and liveness.

A key property of modern SMR protocols is the ability to execute conflict-free commands quickly (in two message delays) under favorable conditions (few failures, no conflicting concurrent commands). The protocol is parameterized by ee (the number of failures tolerated on the fast path) and ff (the total number of failures tolerated for liveness). The paper establishes that any ff-resilient, ee-fast SMR protocol requires nmax{2e+f1,2f+1}n \geq \max\{2e+f-1, 2f+1\} processes, matching the lower bound for fast consensus.

Protocol Overview

Dependency Graph Abstraction

EPaxos achieves SMR by having processes agree on a dependency graph, where vertices are commands and edges represent ordering constraints due to conflicts. Each process maintains local arrays mapping command identifiers to payloads, dependency sets, and phases (pre-accepted, accepted, committed). Commands are executed in an order consistent with the dependency graph, breaking cycles deterministically.

Commit Protocol

When a client submits a command, the initial coordinator assigns it a unique identifier and computes its initial dependencies (conflicting commands known locally). The coordinator broadcasts a PreAccept message to all processes. Recipients may augment the dependency set with additional conflicts and reply with PreAcceptOK. If the coordinator receives nen-e matching responses (a fast quorum), and all dependencies agree, it commits the command immediately (fast path). Otherwise, it enters a slow path, collecting nfn-f AcceptOK responses before committing.

The protocol ensures that for any two conflicting commands committed at ballot 0, at least one is in the dependency set of the other, preserving consistency.

Execution Protocol

A background task at each process executes committed commands in batches, respecting the dependency graph. Strongly connected components are executed in topological order, with cycles broken by command identifiers. This ensures that all processes execute conflicting commands in the same order.

Recovery Protocol

If a coordinator fails, another process can take over by initiating a recovery protocol. The new coordinator selects a higher ballot and collects state from a quorum. If a slow-path decision was previously made, it resumes from the accepted value. If a fast-path decision is suspected, a novel validation phase is used: the coordinator proposes the candidate dependencies to the quorum, which checks for invalidating or potentially invalidating commands (i.e., conflicting commands that could break consistency if the fast path is taken). If any are found, the recovery is aborted (the command is replaced with a no-op); otherwise, the command is accepted.

The validation phase is a key technical contribution, providing a simple and correct mechanism for safe recovery, in contrast to the original EPaxos's complex and error-prone tentative pre-accept phase.

Optimized Protocol and Lower Bound Matching

The protocol is further optimized to match the lower bound nmax{2e+f1,2f+1}n \geq \max\{2e+f-1, 2f+1\} by refining the recovery logic. In particular, the coordinator aborts recovery if the initial coordinator is in the recovery quorum or if certain intersection properties do not hold, ensuring that the fast path could not have been taken. This allows, for example, e=(f+1)/2e = \lceil (f+1)/2 \rceil for n=2f+1n = 2f+1, as originally intended by EPaxos.

Thrifty Variant

A thrifty variant is described, where the coordinator selects a fixed fast quorum a priori and communicates only with it. This allows the fast path even if the rest of the system disagrees, but sacrifices fast-path availability in the presence of failures within the selected quorum.

Correctness and Rigorous Proofs

The paper provides a detailed, invariant-based proof of safety (agreement and visibility invariants) and liveness. The proof covers all protocol variants (baseline, optimized, thrifty) and addresses subtle cases in recovery, including deadlock avoidance and the handling of cycles in dependency waiting. The protocol is shown to be the first to match the optimal fast-path/slow-path tradeoff for SMR.

Comparison with Original EPaxos

The original EPaxos protocol is shown to be ambiguous, incomplete, and buggy, especially in its use of ballot variables and recovery logic. The new protocol avoids these pitfalls by separating the validation phase from dependency commitment, using two ballot variables (as in classic Paxos), and providing a complete specification for both thrifty and non-thrifty variants. The new recovery protocol is both simpler and more robust, avoiding deadlocks and ensuring progress.

Practical and Theoretical Implications

Practically, the new EPaxos protocol provides a drop-in replacement for systems using the original EPaxos, such as distributed databases and transaction systems, with improved correctness and optimal fast-path performance. The protocol is particularly well-suited for geo-distributed deployments with small replica sets, where fast-path availability is critical.

Theoretically, the protocol clarifies the tradeoffs between fast-path and slow-path fault tolerance, matches known lower bounds, and provides a rigorous foundation for future work on leaderless consensus and SMR. The validation phase technique may be applicable to other consensus protocols with fast paths.

Future Directions

Potential future work includes extending the protocol to Byzantine fault tolerance, integrating with reconfiguration and dynamic membership, and further optimizing for wide-area deployments. The validation-based recovery approach may inspire new designs in other leaderless or multi-leader consensus protocols.

Conclusion

The paper presents a new, rigorously specified, and provably correct variant of Egalitarian Paxos, addressing longstanding issues in the original protocol. The new EPaxos matches optimal lower bounds for fast-path consensus, provides a simple and robust recovery mechanism, and is suitable for practical deployment in fault-tolerant, geo-distributed systems. This work establishes a new standard for leaderless SMR protocols, both in terms of theoretical optimality and practical implementability.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

Overview

This paper is about a way for many computers to agree on the order of tasks (like saving data or updating a record) even if some computers crash. It focuses on a “leaderless” method called Egalitarian Paxos, which lets any computer help decide the order, instead of relying on one leader. The authors show that the original method was complicated and had some mistakes, and they present a simpler, fixed, and proven-correct version they call EPaxos. They also show the best possible number of computers you need to make this work quickly and safely.

What questions does the paper try to answer?

  • How can we make a leaderless agreement protocol easier to understand, correctly specified, and free of bugs?
  • Can we design the recovery (what happens when a coordinator fails) to be simpler and provably correct?
  • How many computers do we need to:
    • keep the system running when some crash (ff failures), and
    • still make fast decisions when a smaller number crash (ee failures)?
  • Is there a best (optimal) way to balance “fast decisions” and “overall fault tolerance” with the number of computers nn in the system?

How does the method work? (Explained simply)

Think of a group of students working together on a shared to-do list. Each task has a unique ID number. Some tasks don’t interfere with each other and can be done in any order (they “commute”). Other tasks do interfere and must be done in a specific order (they “conflict”). The protocol helps all students agree on the order so the final outcome is consistent, even if some students leave (crash) or get disconnected.

Here are the key ideas:

  • State-machine replication: Every computer keeps a copy of the same “machine” (think: a program with a current state) and applies tasks in the same logical order so all copies stay in sync.
  • Dependency graph: Picture tasks as dots and arrows between them. An arrow from A to B means “do A before B.” The protocol builds and agrees on this graph. If two tasks conflict, at least one arrow is added between them to force an order. If tasks don’t conflict, no arrow is needed—they can happen in any order.
  • Fast path vs. slow path:
    • Fast path: If there are few failures (up to ee) and no conflicting tasks happen at the same time, the coordinator can confirm a task in just two message steps. This is fast because the coordinator hears consistent answers from a “fast quorum” (a big enough group) and commits immediately.
    • Slow path: If answers disagree or there are more failures, the coordinator collects everyone’s proposals, unifies them, and confirms with another round. This takes longer but is safe.
  • Ballots and recovery: If the original coordinator (the student who started a task) disappears, another computer takes over. “Ballots” are like numbered rounds that say who’s in charge for this task right now. The new coordinator asks others what they know and then safely finishes the decision, making sure it matches any decision already made. The authors fix a bug from the original protocol by using the right kind of ballot bookkeeping (they store the last ballot that voted, not just the current ballot), which prevents inconsistent decisions.
  • Execution: Once a task and its dependencies are fully agreed, every computer runs it in the same logically valid order. If there’s a cycle (like A depends on B and B depends on A), they break ties by comparing task IDs, ensuring everyone is consistent.

Analogy: Tasks are like Lego pieces. Some pieces snap together in any order (commuting tasks), while others need a specific piece before they can be added (conflicting tasks). The system builds a map (the dependency graph) showing which pieces must go first. Then all players follow the same map to build matching models.

What did they find, and why does it matter?

  • Simpler and correct protocol: EPaxos cleans up and simplifies the original design, especially the recovery process. The authors provide rigorous proofs that it works correctly.
  • Bug fixes:
    • They fix an important mistake in the original protocol’s use of ballot variables that could lead to wrong results (non-linearizable behavior).
    • They identify and fix a new deadlock problem that could stall recovery even when there are only a few tasks.
  • Fast decisions with the right number of computers: The paper proves a key rule for how many computers nn you need to tolerate failures and still be fast:
    • To be safe with up to ff crashes and still make fast decisions with up to ee crashes, you need at least:
    • nmax{2e+f1, 2f+1}n \ge \max\{2e + f - 1,\ 2f + 1\}
    • EPaxos achieves this bound, meaning it uses the smallest possible number of computers while meeting these goals. This is what “optimal” means here.
  • Flexible trade-offs: For a fixed number of computers nn, you can trade overall crash tolerance ff for faster decisions tolerance ee. For example, with n=5n=5, you can choose ee and ff to balance performance and safety depending on how your system is used.
  • Practical impact: Leaderless design avoids a single “boss” computer that can become slow or a bottleneck, especially over long distances (for example, servers spread across the world). EPaxos keeps working, even if several machines crash, and can be quicker when tasks don’t conflict.

Why does this research matter in the real world?

  • Faster services: If tasks often don’t conflict (common in many real systems), EPaxos can confirm them quickly, improving user experience.
  • More reliable systems: It keeps going even when multiple computers fail and avoids depending on one leader, which improves availability.
  • Clearer foundations: A simpler, well-specified, proven-correct protocol helps engineers build trustworthy systems. It also guides future research and influences the design of distributed databases and transaction systems.
  • Better planning: The optimal bound helps teams choose how many servers to run and how to balance speed and safety, especially in global, high-latency environments.

In short, this paper makes the “democratic” approach to agreement in distributed systems safer, simpler, and optimally efficient, which can lead to faster and more reliable apps and services that people use every day.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper introduces a simpler EPaxos variant and outlines correctness claims, but leaves several aspects missing, uncertain, or unexplored. Future researchers can address the following concrete gaps:

  • Mechanized verification: Provide a complete, machine-checked specification (e.g., TLA+, Coq/Isabelle) of the full non-thrifty and thrifty variants, including recovery, Validate/Waiting logic, and all edge cases; ensure the proof is aligned with the executable pseudocode.
  • Recovery nomination and contention: Fully specify and prove the coordinator nomination mechanism and its interaction with failure detectors (e.g., Ω), including scenarios with simultaneous recoveries, dueling coordinators, and how ballot ownership is ensured without ambiguity.
  • Nop resubmission semantics: Define the client-side and replica-side protocol for resubmitting commands that were committed as Nop, including deduplication, idempotence, exactly-once semantics, and how to preserve linearizability under retries.
  • Durability and crash-restart: Clarify the persistent state requirements (e.g., durability of bal[], abal[], identifiers, dependency state) to ensure correctness under process crashes and restarts; specify how the system recovers state after reboot without violating invariants.
  • Performance evaluation: Empirically quantify steady-state throughput/latency, fast-path hit rate, and recovery overhead (including Validate and Waiting messages) compared to original EPaxos, Generalized Paxos, and leader-based SMR (e.g., Paxos/Raft), especially for WAN deployments with n ∈ {3,5}.
  • Fast quorum selection strategy: Specify practical algorithms for selecting communication targets in the non-thrifty protocol under failures and varying latency, and evaluate the trade-off between thrifty and non-thrifty modes (robustness vs. fast-path frequency).
  • Dependency-graph maintenance: Design and evaluate efficient, incremental algorithms for SCC computation and execution ordering, including garbage collection policies, memory bounds for dep[], and handling long-lived cycles.
  • Conflict detection in practice: Provide methods and tooling to derive, validate, and enforce commutativity/conflict rules at runtime for real applications; analyze the impact of misclassification (false commutes/conflicts) on safety and liveness.
  • Liveness under infinite submissions: The protocol only guarantees liveness with finitely many submitted commands; investigate protocol modifications or scheduling/backpressure mechanisms that ensure progress (avoid indefinite waiting/Nop churn) under continuous high-conflict workloads.
  • Reconfiguration and dynamic membership: Extend EPaxos to support reconfiguration (joins/leaves, rolling upgrades) while preserving f-resilience and e-fast properties; analyze how the lower bound n ≥ max{2e+f−1, 2f+1} adapts under reconfiguration.
  • Network realism: Assess robustness to message loss, retransmissions, reordering, and partial connectivity (beyond “reliable links”); specify retransmit/timeout policies and their effect on fast-path conditions and recovery correctness.
  • Parameter selection guidance: Provide analytical and empirical guidance on choosing (n, f, e) for target availability and latency profiles, including recommended defaults and sensitivity analysis across geographic placements.
  • Byzantine tolerance: Explore whether the EPaxos approach can be extended to Byzantine faults while preserving leaderless fast decisions; derive appropriate quorum sizes and lower bounds analogous to the crash-fault case.
  • Timing assumptions: Examine how misestimation of Δ and unknown GST in partial synchrony impact fast-path timeliness and failure suspicion; propose adaptive calibration or Δ-free fast-path analyses.
  • Starvation/fairness under contention: Analyze whether contentious commands can be perpetually Nop’ed or delayed due to Waiting/validation dependencies; provide fairness guarantees or bounded retry strategies.
  • Transactional integration: Detail how EPaxos composes with multi-key transactions and cross-shard dependencies (e.g., in systems like Cassandra Accord), including mapping command-level dependencies to transactional graphs.
  • Alternatives to Nop: Investigate safer or more performant conflict-resolution mechanisms than committing Nop (e.g., serialization fences, dynamic priority rules) and quantify their impact on throughput and ordering guarantees.
  • Recovery complexity bounds: Characterize worst-case time and message complexity of the recovery (especially Validate and Waiting phases) as functions of n, f, e, and conflict density; identify scalability bottlenecks.
  • Adversarial workload resilience: Consider denial-of-service scenarios where clients deliberately trigger conflicts to force slow paths or Nops; develop mitigations (rate limiting, prioritization, conflict clustering).
  • Implementation artifacts: Provide a reproducible, open-source reference implementation aligned with the paper’s pseudocode, including tests that cover failure and recovery corner cases; contrast with the original EPaxos artifacts.
  • Comparative analysis depth: Deliver a systematic, quantitative comparison against EPaxos, Generalized Paxos, Caesar, Tempo, and Atlas, beyond bug-fixing—cover feature differences, quorum sizes, fast-path applicability, and fault tolerance envelopes.
  • Geo-distributed quorum composition: Study fast-quorum placement strategies across regions to minimize WAN latency while meeting e-fast guarantees; quantify trade-offs between quorum size, cross-region latency, and failure domains.
  • Extended fault models: Address omission faults, network partitions beyond f, and correlated failures; specify safety/liveness behavior and recovery strategies under such conditions.
  • Application-level commutativity frameworks: Provide reusable patterns/APIs for applications to declare commutativity, with verification support and runtime enforcement to reduce developer burden and prevent specification drift.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Practical Applications

Immediate Applications

The following applications can be deployed now by leveraging EPaxos’s simplified, rigorously proven leaderless state-machine replication, its fast-path for commuting operations, and its optimal trade-off between failure tolerance and latency:

  • Industry (Software/Cloud Databases): Multi-master, geo-distributed replication for key-value stores, document stores, and metadata services
    • What: Replace or augment leader-based SMR (e.g., Paxos/Raft) with EPaxos to reduce client-facing write latency, avoid leader bottlenecks, and sustain throughput under arbitrary crashes.
    • Where: NoSQL (e.g., Cassandra-like architectures), KV stores, object stores, service registries, replicated configuration services.
    • Tools/Workflows: EPaxos library with APIs to submit commands and declare conflict classes; client libraries that route writes to nearest replica; operational dashboards tracking fast-path hit rate and recovery events.
    • Assumptions/Dependencies: Crash fault tolerance only (not Byzantine); partial synchrony model and eventual reliable links; command commutativity classification available in the application; n, e, f chosen to satisfy n ≥ max{2e+f−1, 2f+1}; durable storage of protocol state; a failure detector (e.g., Ω-like) to trigger recovery.
  • Industry (Microservices/SaaS): Active-active, write-anywhere deployments across regions
    • What: Reduce tail latency for write operations and eliminate leader failover stalls in multi-region SaaS platforms.
    • Where: User profile services, feature flags, entitlement stores, configuration distribution.
    • Tools/Workflows: Per-command conflict tagging (e.g., idempotent PUTs, shard-local counters); systematic measurement of commuting-operation rate to drive design and SLOs.
    • Assumptions/Dependencies: Non-concurrent conflicting operations required to hit the fast path; careful definition of conflict sets to avoid false conflicts and ensure safety.
  • Industry (Edge/IoT): Leaderless replication for gateways and small edge clusters
    • What: Keep local services consistent across a small set of devices (n ∈ {3, 5}) in the presence of crashes, with low latency for nearby clients.
    • Where: Industrial IoT control, home automation hubs, retail edge compute.
    • Tools/Workflows: Lightweight EPaxos implementation, device identity-based command IDs, local fast-quorum configuration for n=3 or n=5 clusters.
    • Assumptions/Dependencies: Stable membership, reliable links after GST; crash-only failures; application-level commutativity for common operations.
  • Industry (FinTech/Internal Control Planes): High-availability control plane services with predictable write latency
    • What: Use EPaxos in internal, permissioned environments to replicate configuration and risk policies without leader hot spots.
    • Where: Policy stores, entitlement managers, feature toggles, deployment coordinators.
    • Tools/Workflows: Operational playbooks for quorum sizing (e.g., n=5, f=2, e=1); latency budgets tied to Δ (message delay bound).
    • Assumptions/Dependencies: Not suitable for adversarial settings (no BFT); requires robust conflict detection and durable state.
  • Academia (Education/Research): Teaching and evaluating consensus with rigorous proofs and simpler recovery
    • What: Use EPaxos as a teaching exemplar and research baseline with proven correctness and an optimal e-fast/f-resilient trade-off.
    • Where: Distributed systems courses, research labs.
    • Tools/Workflows: Formal specs (e.g., TLA+)/model-checking assignments; experiment harnesses comparing EPaxos vs Paxos/Raft vs Generalized Paxos under controlled workloads.
    • Assumptions/Dependencies: Availability of formal tools; well-defined conflict relations for test workloads.
  • Industry (Observability/Operations): Runtime invariant checking and recovery telemetry
    • What: Operational monitoring targeting invariants (Agreement, Visibility), fast-path hit rates, and recovery behavior.
    • Where: SRE dashboards, incident response workflows.
    • Tools/Workflows: Metrics exporters for quorum health, ballots, dependency graph sizes; alerts for recovery deadlocks; validation of the “e-fast” assumptions in production.
    • Assumptions/Dependencies: Accurate instrumentation; persistent storage for crash recovery.
  • Daily Life (User Experience): Lower write latency in global apps
    • What: Faster acknowledgement for user actions (e.g., settings changes, content posts) when operations commute.
    • Where: Collaborative apps, social platforms, cloud storage metadata updates.
    • Tools/Workflows: Client-side routing to nearest replica; operation design favoring commutativity (e.g., additive counters, idempotent updates).
    • Assumptions/Dependencies: High fast-path hit rate depends on workload design; partial synchrony and stable quorums; consistent conflict classification.
  • Industry (Compatibility/Modernization): Correct, clarified successor to original EPaxos
    • What: Replace ambiguous or buggy EPaxos variants in existing systems with the simpler, proven EPaxos recovery and correct ballot handling.
    • Where: Systems inspired by EPaxos (e.g., leaderless transaction coordination) that currently suffer from ambiguous recovery semantics.
    • Tools/Workflows: Migration guides; regression tests validating Agreement and Visibility across failure scenarios; phased rollout.
    • Assumptions/Dependencies: Backward-compatible wire formats or adapters; thorough testing under crash scenarios.

Long-Term Applications

The following applications require further research, engineering, scaling, or ecosystem development before broad deployment:

  • Industry (Managed Cloud Services): “Consensus-as-a-Service” with tunable e/f parameters
    • What: A managed leaderless SMR service that lets customers choose n, e, f per cluster to balance fast-path fault tolerance and overall resilience.
    • Where: Cloud platforms offering replicated logs/state machines as a service.
    • Tools/Products: SDKs, service-level dashboards, automatic tuning based on failure patterns and commutativity metrics.
    • Assumptions/Dependencies: Dynamic reconfiguration and membership changes with formal safety proofs; operational maturity across diverse workloads.
  • Industry (Databases/Transactions): Broad integration into mainstream relational/NoSQL systems
    • What: Deep integration of EPaxos into transaction layers (e.g., multi-master SQL), with commutativity-aware concurrency control to maximize fast-path usage.
    • Where: Distributed SQL, OLTP services, hybrid transactional/analytical systems.
    • Tools/Products: Transaction planners that infer/annotate commutativity; dependency-graph-aware schedulers; migration tools from leader-based replication.
    • Assumptions/Dependencies: Robust application-level conflict detection; schema and API design that expose commutativity; thorough performance validation.
  • Academia (Formal Methods): End-to-end verification pipelines and continuous correctness
    • What: CI pipelines combining EPaxos’s invariants with automated model checking and runtime invariants to prevent regressions.
    • Where: Safety-critical or compliance-heavy domains (e.g., healthcare, finance).
    • Tools/Products: Property-driven code generators; runtime monitors that enforce Agreement/Visibility online.
    • Assumptions/Dependencies: Scalable verification for production code; developer adoption of formal annotations.
  • Policy (Standards/Procurement): Requirements for formally proven consensus in public-sector systems
    • What: Guidance that prioritizes protocols with rigorous correctness proofs for national registries, health records, and critical infrastructure control planes.
    • Where: Government digital services, regulated industries.
    • Tools/Products: Reference specifications; certification frameworks evaluating liveness/safety under crash faults and WAN deployments.
    • Assumptions/Dependencies: Policy acceptance and standardization; training and capacity building to adopt leaderless SMR.
  • Industry (Adaptive Systems): Automated commutativity detection and fast-path optimization
    • What: Toolchains that infer conflicts from code or schemas to maximize fast-path execution automatically.
    • Where: Large microservice ecosystems, data platforms with diverse operations.
    • Tools/Products: Static analyzers/DSLs for conflict classes; runtime learning systems that refine conflict sets while preserving safety.
    • Assumptions/Dependencies: Sound analysis to avoid unsafe misclassifications; governance for evolving conflict definitions.
  • Industry (Reconfiguration/Upgrades): Seamless membership changes and rolling updates under EPaxos
    • What: Support for adding/removing replicas and upgrading software without losing fast-path guarantees.
    • Where: Cloud clusters, edge networks, enterprise data centers.
    • Tools/Products: Reconfiguration protocols proven safe for leaderless SMR; operational tooling for phased rollouts.
    • Assumptions/Dependencies: Formal proofs for dynamic membership; careful handling of ballots across configuration epochs.
  • Cross-Sector (Robotics/Energy/Healthcare): Small-cluster control systems needing high availability without leaders
    • What: Consistent, crash-tolerant control state for supervisory systems (e.g., hospital device coordination, microgrid controllers, autonomous fleet managers).
    • Where: Safety-critical operational tech.
    • Tools/Products: Hardened EPaxos implementations with deterministic execution; domain-specific conflict models.
    • Assumptions/Dependencies: Strict latency and synchrony constraints; certification and validation requirements; robust failure detectors and durable state.
  • Research (Beyond Crash Faults): Extending ideas toward Byzantine resilience or hybrid fault models
    • What: Explore whether EPaxos’s recovery simplifications and dependency-graph approach can inform BFT protocols or mixed fault models.
    • Where: Permissioned blockchain or adversarial environments.
    • Tools/Products: Novel protocol designs; proofs of optimality under extended fault models.
    • Assumptions/Dependencies: Significant theoretical advances; performance trade-offs and larger quorums likely.
  • Industry (Network/Hardware Acceleration): Fast-path optimization with programmable networks
    • What: Use SmartNICs or in-network aggregation to accelerate quorum collection and validation steps.
    • Where: Latency-sensitive clusters and WAN deployments.
    • Tools/Products: NIC offloads for message routing and quorum counting; Δ-aware network scheduling.
    • Assumptions/Dependencies: Hardware support; careful co-design preserving safety/liveness.
  • Education (Curriculum/Community): EPaxos as a canonical, teachable, leaderless consensus protocol
    • What: Standardize EPaxos in curricula and community materials as the go-to example of correct leaderless SMR.
    • Where: Universities, professional training.
    • Tools/Products: Labs, visualizers for dependency graphs, recovery simulations.
    • Assumptions/Dependencies: Widespread teaching materials; open-source reference implementations and proofs.

Notes on Key Assumptions and Dependencies

  • Model and Faults: EPaxos is crash-fault tolerant (not Byzantine). It assumes partial synchrony with an unknown GST and a known message-delay bound Δ; reliable links after GST; durable state across crashes.
  • Quorums and Parameters: Fast-path guarantees require n ≥ max{2e+f−1, 2f+1}. Choosing n, e, f involves trade-offs between fast-path robustness and overall resiliency.
  • Fast Path Conditions: A command executes fast when there are ≤ e crashes, the run is synchronous, and no concurrent conflicting commands exist (commutativity is critical).
  • Commutativity and Dependency Graphs: Applications must define conflict relations correctly. Over-approximation reduces fast-path usage; under-approximation endangers safety.
  • Recovery and Ballots: Correct ballot tracking (both current and last accepted) and nomination mechanisms are required; failure detectors identify coordinators to recover commands.
  • Scope of Use: EPaxos is most impactful for small clusters (n ∈ {3, 5}) and WAN deployments where an extra message delay costs hundreds of milliseconds.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • Agreement (Invariant): A consensus property ensuring all processes commit the same payload and dependency set for a command. "If a command iscommittedattwoprocesseswithdependencysetsis committed at two processes with dependency sets Dand and D'andpayloads and payloads cand and c'then then D = D'and and c = c'$." - **ballot**: A numbered round, owned by a coordinator, that partitions the protocol’s progress for a specific command. "As in Paxos~\cite{paxos}, the lifecycle of the command is divided into a series of {\em ballots}, each managed by a designated coordinator." - **commute**: A property of two commands that can be executed in either order yielding the same state and outputs. "provided $ccommuteswithallconcurrentlysubmittedcommands"conflict:Therelationbetweencommandsthatdonotcommuteandthereforemustbeorderedconsistently."If commutes with all concurrently submitted commands" - **conflict**: The relation between commands that do not commute and therefore must be ordered consistently. "If cand and c'$ do not commute, we say that they {\em conflict} and write $#1 {c}{c'}."coordinator:Theprocessresponsiblefordrivingagreementonacommandsdependencies;maychangeviarecovery."InEPaxosaclientsubmitsitscommandtooneoftheprocesses,calledtheinitialcoordinator,whichwilldrivetheagreementonthecommandsdependencies."dependencygraph:Adirectedgraphwhoseverticesarecommandsandedgesconstrainexecutionorder."EPaxosimplementsSMRbygettingtheprocessestoagreeonadependencygraphadirectedgraphwhoseverticesareapplicationcommandsandedgesrepresentconstraintsontheirexecutionorder."dependencyset:Foracommand,thesetofidentifiersofitsdirectdependenciesinthedependencygraph."Processesalsomaintainlocalcopiesofthedependencygraphinanarray." - **coordinator**: The process responsible for driving agreement on a command’s dependencies; may change via recovery. "In EPaxos a client submits its command to one of the processes, called the {\it initial coordinator}, which will drive the agreement on the command's dependencies." - **dependency graph**: A directed graph whose vertices are commands and edges constrain execution order. "EPaxos implements SMR by getting the processes to agree on a {\it dependency graph} -- a directed graph whose vertices are application commands and edges represent constraints on their execution order." - **dependency set**: For a command, the set of identifiers of its direct dependencies in the dependency graph. "Processes also maintain local copies of the dependency graph in an array dep$, which maps the identifier of a command to the set of identifiers of its dependencies (a {\em dependency set})." - **E-faulty synchronous run**: A synchronous execution model where exactly E processes crash at the start, and messages are delivered round-by-round. "A run is $\boldsymbol{E$-faulty synchronous}, if:" - **e-fast**: A property that guarantees conflict-free commands complete in two message delays despite up to e failures. "A protocol is $\boldsymbol{e$-fast} if for all $E \subseteq ofsize of size e,every, every E$-faulty synchronous run of the protocol is fast." - **Egalitarian Paxos (EPaxos)**: A leaderless SMR protocol enabling replicas to collaboratively order commands and achieve fast decisions. "{\em Egalitarian Paxos} (EPaxos)~\cite{epaxos} introduced a leaderless approach that allows processes to order commands collaboratively." - **fast path**: The protocol branch that commits a command in two message delays under favorable conditions. "In this case the protocol takes a {\em fast path}, which contacts {\em a fast quorum} of $n - \lceil\frac{f+1}{2}\rceil = f+\lfloor\frac{f+1}{2}\rfloorprocesses."fastquorum:Aquorumlargeenoughtoenablethefastpath;typicallyofsizeatleastne."quorum processes." - **fast quorum**: A quorum large enough to enable the fast path; typically of size at least n−e. "quorum Q$ is {\em fast}, i.e., contains at least $n-e$ processes" - **fault tolerance**: The ability of a system to continue functioning despite process crashes. "{\em State-machine replication (SMR)} is a classic approach to implementing fault-tolerant services~\cite{clocks,smr}." - **Generalized Paxos**: A Paxos variant supporting fast decisions under fewer failures than EPaxos. "Generalized Paxos~\cite{gpaxos} decides fast when the number of failures does not exceed $\lfloor\frac{f}{2}\rfloor" - **Global Stabilization Time (GST)**: The unknown time after which message delays are bounded and clocks are accurate. "after some global stabilization time (GSTGST) messages take at most Δ\Delta units of time to reach their destinations." - **leaderless consensus**: Consensus achieved without a single distinguished leader process. "We thank Benedict Elliot Smith, whose work on leaderless consensus in Apache Cassandra inspired this paper." - **linearizability**: A correctness criterion where operations appear to occur atomically in an order consistent with real time. "An SMR protocol coordinates the execution of client commands at the processes to ensure that the system is {\em linearizable}~\cite{linearizability}" - **liveness**: The property that commands eventually execute at all correct processes under finite submissions. "{\sf Liveness.} In any execution with finitely many submitted commands, if a command is submitted by a correct process or executed at some process, then it is eventually executed at all correct processes." - **Nop**: A special placeholder payload used during recovery; it is not executed and conflicts with all commands. "committing a commandwithaspecial with a special Nop$ payload that is not executed by the protocol and conflicts with all commands&quot;</li> <li><strong>Omega failure detector</strong>: A standard mechanism used to nominate a single coordinator in asynchronous systems. &quot;this nomination is done using standard techniques based on failure detectors~\cite{omega}&quot;</li> <li><strong>partial synchrony</strong>: A timing model where after GST, message delays are bounded but GST itself is unknown. &quot;The system is partially synchronous~\cite{dls}: after some global stabilization time ($GST)messagestakeatmost) messages take at most \Delta$ units of time to reach their destinations.&quot;</li> <li><strong>quorum</strong>: A set of processes large enough to ensure intersection and agreement, typically of size at least n−f. &quot;After the coordinator receives $PreAcceptOK$ from a {\em quorum} $Qofatleast of at least n-f$ processes&quot;</li> <li><strong>recovery protocol</strong>: The procedure by which a new coordinator takes over to complete a command after a suspected failure. &quot;If the initial coordinator is suspected of failure, another process executes a {\it recovery} protocol to take over as a new coordinator.&quot;</li> <li><strong>slow path</strong>: The protocol branch requiring an extra round to commit when fast-path conditions are not met. &quot;If the coordinator does not manage to assemble a fast quorum agreeing on the dependencies, it takes a {\em slow path}&quot;</li> <li><strong>State-machine replication (SMR)</strong>: Replicating a deterministic state machine across processes to build a fault-tolerant service. &quot;{\em State-machine replication (SMR)} is a classic approach to implementing fault-tolerant services~\cite{clocks,smr}.&quot;</li> <li><strong>strongly connected components (SCC)</strong>: Maximal subgraphs where each vertex is reachable from any other; used to structure execution. &quot;Split $G$ into strongly connected components and sort them in the topological order&quot;</li> <li><strong>thrifty</strong>: A variant where a process selects a fast quorum a priori and communicates only with it. &quot;This version of the protocol, which the authors call {\em thrifty}, takes the fast path more frequently in failure-free scenarios and is simpler to prove correct.&quot;</li> <li><strong>topological order</strong>: An ordering of components consistent with dependency edges, used during execution. &quot;Split $G$ into strongly connected components and sort them in the topological order&quot;</li> <li><strong>Visibility (Invariant)</strong>: Ensures that any two conflicting committed commands are related by a dependency edge. &quot;If distinct commands $%%%%9%%%%'%%%%10%%%%Nop%%%%11%%%%D%%%%12%%%%D'%%%%13%%%% \in D'%%%%14%%%%' \in D$."
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 tweets and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com