Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Finding Cross-rule Optimization Bugs in Datalog Engines (2402.12863v1)

Published 20 Feb 2024 in cs.SE

Abstract: Datalog is a popular and widely-used declarative logic programming language. Datalog engines apply many cross-rule optimizations; bugs in them can cause incorrect results. To detect such optimization bugs, we propose an automated testing approach called Incremental Rule Evaluation (IRE), which synergistically tackles the test oracle and test case generation problem. The core idea behind the test oracle is to compare the results of an optimized program and a program without cross-rule optimization; any difference indicates a bug in the Datalog engine. Our core insight is that, for an optimized, incrementally-generated Datalog program, we can evaluate all rules individually by constructing a reference program to disable the optimizations that are performed among multiple rules. Incrementally generating test cases not only allows us to apply the test oracle for every new rule generated-we also can ensure that every newly added rule generates a non-empty result with a given probability and eschew recomputing already-known facts. We implemented IRE as a tool named Deopt, and evaluated Deopt on four mature Datalog engines, namely Souffl\'e, CozoDB, $\mu$Z, and DDlog, and discovered a total of 30 bugs. Of these, 13 were logic bugs, while the remaining were crash and error bugs. Deopt can detect all bugs found by queryFuzz, a state-of-the-art approach. Out of the bugs identified by Deopt, queryFuzz might be unable to detect 5. Our incremental test case generation approach is efficient; for example, for test cases containing 60 rules, our incremental approach can produce 1.17$\times$ (for DDlog) to 31.02$\times$ (for Souffl\'e) as many valid test cases with non-empty results as the naive random method. We believe that the simplicity and the generality of the approach will lead to its wide adoption in practice.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Serge Abiteboul and Richard Hull. 1988. Data Functions, Datalog and Negation. In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD ’88). Association for Computing Machinery, New York, NY, USA, 143–153. https://doi.org/10.1145/50202.50218
  2. Foundations of databases. Vol. 8. Addison-Wesley Reading.
  3. Magic sets for disjunctive datalog programs. Artificial Intelligence 187 (2012), 156–192.
  4. Porting Doop to Soufflé: A Tale of Inter-Engine Portability for Datalog-Based Analyses. In Proceedings of the 6th ACM SIGPLAN International Workshop on State Of the Art in Program Analysis (Barcelona, Spain) (SOAP 2017). Association for Computing Machinery, New York, NY, USA, 25–30. https://doi.org/10.1145/3088515.3088522
  5. Building a Join Optimizer for Soufflé. In Logic-Based Program Synthesis and Transformation, Alicia Villanueva (Ed.). Springer International Publishing, Cham, 83–102.
  6. Design and implementation of the LogicBlox system. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 1371–1382.
  7. Reachability analysis for AWS-based networks. In International Conference on Computer Aided Verification. Springer, 231–241.
  8. Isaac Balbin and Kotagiri Ramamohanarao. 1987. A generalization of the differential approach to recursive query evaluation. The Journal of Logic Programming 4, 3 (1987), 259–262.
  9. Magic Sets and Other Strange Ways to Implement Logic Programs (Extended Abstract). In Proceedings of the Fifth ACM SIGACT-SIGMOD Symposium on Principles of Database Systems (Cambridge, Massachusetts, USA) (PODS ’86). Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/6012.15399
  10. Francois Bancilhon and Raghu Ramakrishnan. 1986. An amateur’s introduction to recursive query processing strategies. In Proceedings of the 1986 ACM SIGMOD international conference on Management of data. 16–52.
  11. Formulog: Datalog for SMT-based static analysis. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1–31.
  12. SolverBlox: algebraic modeling in datalog. In Declarative Logic Programming: Theory, Systems, and Applications, Michael Kifer and Yanhong Annie Liu (Eds.). ACM Books, Vol. 20. ACM / Morgan & Claypool, 331–354. https://doi.org/10.1145/3191315.3191322
  13. Martin Bravenboer and Yannis Smaragdakis. 2009. Strictly declarative specification of sophisticated points-to analyses. In Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications. 243–262.
  14. Vandal: A scalable security analysis framework for smart contracts. arXiv preprint arXiv:1809.03981 (2018).
  15. Strong and weak constraints in disjunctive datalog. In Logic Programming And Nonmonotonic Reasoning, Jürgen Dix, Ulrich Furbach, and Anil Nerode (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 2–17.
  16. Enhancing Disjunctive Datalog by constraints. IEEE Transactions on Knowledge and Data Engineering 12, 5 (2000), 845–860. https://doi.org/10.1109/69.877512
  17. A theoretical framework for the declarative debugging of datalog programs. In International Workshop on Semantics in Data and Knowledge Bases. Springer, 143–159.
  18. Debugging of wrong and missing answers for Datalog programs with constraint handling rules. In Proceedings of the 17th International Symposium on Principles and Practice of Declarative Programming. 55–66.
  19. What you always wanted to know about Datalog(and never dared to ask). IEEE transactions on knowledge and data engineering 1, 1 (1989), 146–166.
  20. Metamorphic testing: a new approach for generating next test cases. arXiv preprint arXiv:2002.12543 (2020).
  21. Metamorphic testing: A review of challenges and opportunities. ACM Computing Surveys (CSUR) 51, 1 (2018), 1–27.
  22. Mariano P. Consens and Alberto O. Mendelzon. 1990. Low complexity aggregation in graphlog and Datalog. In ICDT ’90, Serge Abiteboul and Paris C. Kanellakis (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 379–394.
  23. Logic and lattices for distributed programming. In Proceedings of the Third ACM Symposium on Cloud Computing. 1–14.
  24. Edsger Wybe Dijkstra et al. 1970. Notes on structured programming.
  25. Gigahorse: thorough, declarative decompilation of smart contracts. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 1176–1186.
  26. S. Greco. 1999. Dynamic programming in Datalog with aggregates. IEEE Transactions on Knowledge and Data Engineering 11, 2 (1999), 265–283. https://doi.org/10.1109/69.761663
  27. Datalog and Recursive Query Processing. Foundations and Trends® in Databases 5, 2 (2013), 105–195. https://doi.org/10.1561/1900000017
  28. Datalog and recursive query processing. Foundations and Trends® in Databases 5, 2 (2013), 105–195.
  29. Codequest: Scalable source code queries with datalog. In European Conference on Object-Oriented Programming. Springer, 2–27.
  30. μ𝜇\muitalic_μZ–an efficient engine for fixed points with constraints. In International Conference on Computer Aided Verification. Springer, 457–462.
  31. The Choice Construct in the Soufflé Language. In Asian Symposium on Programming Languages and Systems. Springer, 163–181.
  32. An Efficient Interpreter for Datalog by De-Specializing Relations. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (Virtual, Canada) (PLDI 2021). Association for Computing Machinery, New York, NY, USA, 681–695. https://doi.org/10.1145/3453483.3454070
  33. Ziyang Hu. 2023. CozoDB: Hippocampus for AI, with Embedded Datalog. https://www.cozodb.org/. Accessed: 2023-05-16.
  34. DynSQL: Stateful Fuzzing for Database Management Systems with Complex and Valid SQL Query Generation. In Proceedings of the 32nd USENIX Security Symposium (Security’23).
  35. Soufflé: On synthesis of program analyzers. In International Conference on Computer Aided Verification. Springer, 422–430.
  36. Brie: A specialized trie for concurrent datalog. In Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores. 31–40.
  37. A specialized B-tree for concurrent datalog evaluation. In Proceedings of the 24th symposium on principles and practice of parallel programming. 327–339.
  38. Specializing parallel data structures for Datalog. Concurrency and Computation: Practice and Experience 34, 2 (2022), e5643.
  39. Modern Datalog Engines. Foundations and Trends® in Databases 12, 1 (2022), 1–68.
  40. Werner Kießling and Ulrich Güntzer. 1994. Database reasoning—a deductive framework for solving large and complex problems by means of subsumption. In Workshop on Information Systems and Artificial Intelligence. Springer, 118–138.
  41. Declarative datalog debugging for mere mortals. In International Datalog 2.0 Workshop. Springer, 111–122.
  42. Context-sensitive program analysis as database queries. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 1–12.
  43. Detecting Logical Bugs of {{\{{DBMS}}\}} with Coverage-based Guidance. In 31st USENIX Security Symposium (USENIX Security 22). 4309–4326.
  44. Declarative networking. Commun. ACM 52, 11 (2009), 87–95.
  45. From Datalog to flix: a declarative language for fixed points on lattices. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2016, Santa Barbara, CA, USA, June 13-17, 2016, Chandra Krintz and Emery D. Berger (Eds.). ACM, 194–208. https://doi.org/10.1145/2908080.2908096
  46. Metamorphic Testing of Datalog Engines. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece) (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 639–650. https://doi.org/10.1145/3468264.3468573
  47. Dependency-Aware Metamorphic Testing of Datalog Engines. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, Seattle, WA, USA, July 17-21, 2023, René Just and Gordon Fraser (Eds.). ACM, 236–247. https://doi.org/10.1145/3597926.3598052
  48. William M McKeeman. 1998. Differential testing for software. Digital Technical Journal 10, 1 (1998), 100–107.
  49. Raymond J Mooney. 1996. Inductive logic programming for natural language processing. In International conference on inductive logic programming. Springer, 1–22.
  50. Leonardo de Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340.
  51. Fast parallel equivalence relations in a Datalog compiler. In 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 82–96.
  52. Raghu Ramakrishnan and Jeffrey D Ullman. 1995. A survey of deductive database systems. The journal of logic programming 23, 2 (1995), 125–149.
  53. Test-Case Reduction for C Compiler Bugs. Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation 47, 6 (jun 2012), 335–346. https://doi.org/10.1145/2345156.2254104
  54. Manuel Rigger and Zhendong Su. 2020a. Detecting Optimization Bugs in Database Engines via Non-Optimizing Reference Engine Construction. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA, 1140–1152. https://doi.org/10.1145/3368089.3409710
  55. Manuel Rigger and Zhendong Su. 2020b. Finding bugs in database systems via query partitioning. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1–30.
  56. Manuel Rigger and Zhendong Su. 2020c. Testing database engines via pivoted query synthesis. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 667–682.
  57. Kenneth A. Ross. 1990. Modular Stratification and Magic Sets for DATALOG Programs with Negation. In Proceedings of the Ninth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Nashville, Tennessee, USA) (PODS ’90). Association for Computing Machinery, New York, NY, USA, 161–171. https://doi.org/10.1145/298514.298558
  58. Leonid Ryzhyk and Mihai Budiu. 2019. Differential Datalog. Datalog 2 (2019), 4–5.
  59. XSB as an Efficient Deductive Database Engine. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data (Minneapolis, Minnesota, USA) (SIGMOD ’94). Association for Computing Machinery, New York, NY, USA, 442–453. https://doi.org/10.1145/191839.191927
  60. Bernhard Scholz. 2022. Commercial-Grade Static Analyzers in Datalog. SAS 2022.
  61. On Fast Large-Scale Program Analysis in Datalog. In Proceedings of the 25th International Conference on Compiler Construction (Barcelona, Spain) (CC 2016). Association for Computing Machinery, New York, NY, USA, 196–206. https://doi.org/10.1145/2892208.2892226
  62. A survey on metamorphic testing. IEEE Transactions on software engineering 42, 9 (2016), 805–824.
  63. Andreas Seltenreich. 2023. SQLsmith. https://github.com/anse1/sqlsmith. Accessed: 2023-01-25.
  64. Adding Magic to an Optimising Datalog Compiler. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD ’08). Association for Computing Machinery, New York, NY, USA, 553–566. https://doi.org/10.1145/1376616.1376673
  65. Big data analytics with datalog queries on spark. In Proceedings of the 2016 International Conference on Management of Data. 1135–1149.
  66. Donald R Slutz. 1998. Massive stochastic testing of SQL. In VLDB, Vol. 98. 618–622.
  67. Automatic index selection for large-scale datalog computation. Proceedings of the VLDB Endowment 12, 2 (2018), 141–153.
  68. Securify: Practical security analysis of smart contracts. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. 67–82.
  69. Jeffrey D Ullman. 1989. Bottom-up beats top-down for datalog. In Proceedings of the eighth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. 140–149.
  70. Allen Van Gelder. 1989. The alternating fixpoint of logic programs with negation. In Proceedings of the eighth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. 1–10.
  71. John Whaley and Monica S Lam. 2004. Cloning-based context-sensitive pointer alias analysis using binary decision diagrams. In Proceedings of the ACM SIGPLAN 2004 conference on Programming Language Design and Implementation. 131–144.
  72. Debugging Large-Scale Datalog: A Scalable Provenance Evaluation Strategy. ACM Trans. Program. Lang. Syst. 42, 2, Article 7 (apr 2020), 35 pages. https://doi.org/10.1145/3379446
  73. Squirrel: Testing database management systems with language validity and coverage feedback. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 955–970.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com