Optimizing Distributed Protocols with Query Rewrites [Technical Report] (2404.01593v2)
Abstract: Distributed protocols such as 2PC and Paxos lie at the core of many systems in the cloud, but standard implementations do not scale. New scalable distributed protocols are developed through careful analysis and rewrites, but this process is ad hoc and error-prone. This paper presents an approach for scaling any distributed protocol by applying rule-driven rewrites, borrowing from query optimization. Distributed protocol rewrites entail a new burden: reasoning about spatiotemporal correctness. We leverage order-insensitivity and data dependency analysis to systematically identify correct coordination-free scaling opportunities. We apply this analysis to create preconditions and mechanisms for coordination-free decoupling and partitioning, two fundamental vertical and horizontal scaling techniques. Manual rule-driven applications of decoupling and partitioning improve the throughput of 2PC by $5\times$ and Paxos by $3\times$, and match state-of-the-art throughput in recent work. These results point the way toward automated optimizers for distributed protocols based on correct-by-construction rewrite rules.
- Foundations of Databases. Addison-Wesley. http://webdam.inria.fr/Alice/pdfs/all.pdf
- Revisiting Fast Practical Byzantine Fault Tolerance. CoRR abs/1712.01367 (2017). arXiv:1712.01367 http://arxiv.org/abs/1712.01367
- WPaxos: Wide Area Network Flexible Consensus. IEEE Transactions on Parallel and Distributed Systems 31, 1 (2020), 211–223. https://doi.org/10.1109/TPDS.2019.2929793
- A declarative semantics for Dedalus. UC Berkeley EECS Technical Report 120 (2011), 2011.
- Blazes: Coordination Analysis and Placement for Distributed Programs. ACM Trans. Database Syst. 42, 4, Article 23 (Oct. 2017), 31 pages. https://doi.org/10.1145/3110214
- Consistency Analysis in Bloom: a CALM and Collected Approach. In Fifth Biennial Conference on Innovative Data Systems Research, CIDR 2011, Asilomar, CA, USA, January 9-12, 2011, Online Proceedings. 249–260. http://cidrdb.org/cidr2011/Papers/CIDR11_Paper35.pdf
- Dedalus: Datalog in Time and Space. In Datalog Reloaded, Oege de Moor, Georg Gottlob, Tim Furche, and Andrew Sellers (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 262–281.
- Parallel-Correctness and Transferability for Conjunctive Queries. Journal of the ACM 64, 5 (Oct. 2017), 1–38. https://doi.org/10.1145/3106412
- The bedrock of bft: A unified platform for bft protocol design and implementation. arXiv preprint arXiv:2205.04534 (2022).
- NetKAT: Semantic foundations for networks. Acm sigplan notices 49, 1 (2014), 113–126.
- Log-Structured Protocols in Delos. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (Virtual Event, Germany) (SOSP ’21). Association for Computing Machinery, New York, NY, USA, 538–552. https://doi.org/10.1145/3477132.3483544
- Christian Berger and Hans P Reiser. 2018. Scaling byzantine consensus: A broad analysis. In Proceedings of the 2nd workshop on scalable and resilient infrastructures for distributed ledgers. 13–18.
- Cilk: An efficient multithreaded runtime system. ACM SigPlan Notices 30, 8 (1995), 207–216.
- DAGuE: A generic distributed DAG engine for high performance computing. Parallel Comput. 38, 1-2 (2012), 37–51.
- P4: Programming protocol-independent packet processors. ACM SIGCOMM Computer Communication Review 44, 3 (2014), 87–95.
- Apache flink: Stream and batch processing in a single engine. The Bulletin of the Technical Committee on Data Engineering 38, 4 (2015).
- Edelweiss: Automatic storage reclamation for distributed programming. Proceedings of the VLDB Endowment 7, 6 (2014), 481–492.
- Logic and lattices for distributed programming. In Proceedings of the Third ACM Symposium on Cloud Computing. 1–14.
- Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Conference on Symposium on Operating Systems Design I& Implementation - Volume 6 (San Francisco, CA) (OSDI’04). USENIX Association, USA, 10.
- Adaptive query processing. Foundations and Trends® in Databases 1, 1 (2007), 1–140.
- David DeWitt and Jim Gray. 1992. Parallel database systems. Commun. ACM 35, 6 (June 1992), 85–98. https://doi.org/10.1145/129888.129894
- GAMMA - A High Performance Dataflow Database Machine. In VLDB. 228–237.
- Scalog: Seamless Reconfiguration and Total Order in a Scalable Shared Log. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). USENIX Association, Santa Clara, CA, 325–338. https://www.usenix.org/conference/nsdi20/presentation/ding
- Consensus in the presence of partial synchrony. Journal of the ACM (JACM) 35, 2 (1988), 288–323.
- An Overview of The System Software of A Parallel Relational Database Machine GRACE.. In VLDB, Vol. 86. 209–219.
- A framework for the parallel processing of datalog queries. ACM SIGMOD Record 19, 2 (1990), 143–152.
- Parallel-Correctness and Containment for Conjunctive Queries with Union and Negation. ACM Transactions on Computational Logic 20, 3 (July 2019), 1–24. https://doi.org/10.1145/3329120
- Distribution Constraints: The Chase for Distributed Data. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.ICDT.2020.13
- The next 700 BFT protocols. In Proceedings of the 5th European conference on Computer systems. 363–376.
- Chemistry behind Agreement. In Conference on Innovative Data Systems Research (CIDR).(2023).
- IronFleet: proving practical distributed systems correct. In Proceedings of the 25th Symposium on Operating Systems Principles. 1–17.
- Joseph M. Hellerstein and Peter Alvaro. 2020. Keeping CALM: When Distributed Consistency is Easy. Commun. ACM 63, 9 (Aug. 2020), 72–81. https://doi.org/10.1145/3369736
- Maurice P Herlihy and Jeannette M Wing. 1990. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems (TOPLAS) 12, 3 (1990), 463–492.
- Stream Query Optimization. Springer International Publishing, 1–9.
- Heidi Howard and Ittai Abraham. 2020. Raft does not Guarantee Liveness in the face of Network Faults. https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/.
- Flexible paxos: Quorum intersection revisited. arXiv preprint arXiv:1608.06696 (2016).
- Heidi Howard and Richard Mortier. 2020. Paxos vs Raft. In Proceedings of the 7th Workshop on Principles and Practice of Consistency for Distributed Data. ACM. https://doi.org/10.1145/3380787.3393681
- Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007. 59–72.
- Proteus: A scalable BFT consensus protocol for blockchains. In 2019 IEEE international conference on Blockchain (Blockchain). IEEE, 308–313.
- Bas Ketsman and Christoph Koch. 2020. Datalog with Negation and Monotonicity. In 23rd International Conference on Database Theory (ICDT 2020) (Leibniz International Proceedings in Informatics (LIPIcs), Vol. 155), Carsten Lutz and Jean Christoph Jung (Eds.). Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 19:1–19:18. https://doi.org/10.4230/LIPIcs.ICDT.2020.19
- Modern Datalog Engines. Foundations and Trends® in Databases 12, 1 (2022), 1–68.
- TLA+ model checking made symbolic. Proceedings of the ACM on Programming Languages 3, OOPSLA (2019), 1–30.
- Leslie Lamport. 1998. The Part-Time Parliament. ACM Trans. Comput. Syst. 16, 2 (May 1998), 133–169. https://doi.org/10.1145/279227.279229
- Leslie Lamport. 2002. Specifying systems: the TLA+ language and tools for hardware and software engineers. (2002).
- Declarative networking. Commun. ACM 52, 11 (2009), 87–95.
- Transaction management in the R* distributed database management system. ACM Transactions on Database Systems (TODS) 11, 4 (1986), 378–396.
- There is more consensus in Egalitarian parliaments. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM. https://doi.org/10.1145/2517349.2517350
- Inderpal Singh Mumick and Oded Shmueli. 1995. How expressive is stratified aggregation? Annals of Mathematics and Artificial Intelligence 15 (1995), 407–435.
- Kauri: Scalable bft consensus with pipelined tree-based dissemination and aggregation. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 35–48.
- How Amazon web services uses formal methods. Commun. ACM 58, 4 (2015), 66–73.
- Diego Ongaro. 2014. Consensus : bridging theory and practice. Ph. D. Dissertation. Stanford University.
- Kenneth J. Perry and Sam Toueg. 1986. Distributed agreement in the presence of processor and communication faults. IEEE Transactions on Software Engineering SE-12, 3 (1986), 477–482. https://doi.org/10.1109/TSE.1986.6312888
- George Pirlea. 2023. Errors found in distributed protocols. https://github.com/dranov/protocol-bugs-list.
- Hydroflow: A Model and Runtime for Distributed Systems Programming. (2021).
- Locality-Aware Distribution Schemes. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.ICDT.2021.22
- Basil. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles CD-ROM. ACM. https://doi.org/10.1145/3477132.3483552
- Pierre Sutra. 2020. On the correctness of Egalitarian Paxos. Inform. Process. Lett. 156 (2020), 105901. https://doi.org/10.1016/j.ipl.2019.105901
- Skinnerdb: regret-bounded query evaluation via reinforcement learning. Proceedings of the VLDB Endowment 11, 12 (2018), 2074–2077.
- Robbert Van Renesse and Deniz Altinbuken. 2015. Paxos Made Moderately Complex. ACM Comput. Surv. 47, 3, Article 42 (Feb. 2015), 36 pages. https://doi.org/10.1145/2673577
- Vive La Différence: Paxos vs. Viewstamped Replication vs. Zab. IEEE Transactions on Dependable and Secure Computing 12, 4 (July 2015), 472–484. https://doi.org/10.1109/tdsc.2014.2355848
- On the Parallels between Paxos and Raft, and how to Port Optimizations. In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing. ACM. https://doi.org/10.1145/3293611.3331595
- Michael Whittaker. 2020. mwhittaker/craq_bug. https://github.com/mwhittaker/craq_bug.
- Scaling Replicated State Machines with Compartmentalization. Proc. VLDB Endow. 14, 11 (July 2021), 2203–2215. https://doi.org/10.14778/3476249.3476273
- Scaling Replicated State Machines with Compartmentalization [Technical Report]. arXiv:2012.15762 [cs.DC]
- SoK: A Generalized Multi-Leader State Machine Replication Tutorial. Journal of Systems Research 1, 1 (2021).
- Verdi: a framework for implementing and formally verifying distributed systems. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation. 357–368.
- DistAI: Data-Driven Automated Invariant Learning for Distributed Protocols.. In OSDI. 405–421.
- Spark: Cluster Computing with Working Sets. In 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10). USENIX Association, Boston, MA. https://www.usenix.org/conference/hotcloud-10/spark-cluster-computing-working-sets
- Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the twenty-fourth ACM symposium on operating systems principles. 423–438.
- Incorporating partitioning and parallel plans into the SCOPE optimizer. In 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010). IEEE, 1060–1071.