Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DyPyBench: A Benchmark of Executable Python Software (2403.00539v1)

Published 1 Mar 2024 in cs.SE

Abstract: Python has emerged as one of the most popular programming languages, extensively utilized in domains such as machine learning, data analysis, and web applications. Python's dynamic nature and extensive usage make it an attractive candidate for dynamic program analysis. However, unlike for other popular languages, there currently is no comprehensive benchmark suite of executable Python projects, which hinders the development of dynamic analyses. This work addresses this gap by presenting DyPyBench, the first benchmark of Python projects that is large scale, diverse, ready to run (i.e., with fully configured and prepared test suites), and ready to analyze (by integrating with the DynaPyt dynamic analysis framework). The benchmark encompasses 50 popular opensource projects from various application domains, with a total of 681k lines of Python code, and 30k test cases. DyPyBench enables various applications in testing and dynamic analysis, of which we explore three in this work: (i) Gathering dynamic call graphs and empirically comparing them to statically computed call graphs, which exposes and quantifies limitations of existing call graph construction techniques for Python. (ii) Using DyPyBench to build a training data set for LExecutor, a neural model that learns to predict values that otherwise would be missing at runtime. (iii) Using dynamically gathered execution traces to mine API usage specifications, which establishes a baseline for future work on specification mining for Python. We envision DyPyBench to provide a basis for other dynamic analyses and for studying the runtime behavior of Python code.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. DDUO: General-Purpose Dynamic Analysis for Differential Privacy. In 34th IEEE Computer Security Foundations Symposium, CSF 2021, Dubrovnik, Croatia, June 21-25, 2021. IEEE, 1–15. https://doi.org/10.1109/CSF51468.2021.00043
  2. Constructing call graphs of Scala programs. In European Conference on Object-Oriented Programming. Springer, 54–79.
  3. Typilus: Neural Type Hints. In PLDI.
  4. Mining specifications. In Symposium on Principles of Programming Languages (POPL). ACM, 4–16.
  5. Dynamic inference of static types for Ruby.. In POPL. 459–472.
  6. David F Bacon and Peter F Sweeney. 1996. Fast static analysis of C++ virtual function calls. In Proceedings of the 11th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications. 324–341.
  7. Triangulating Python Performance Issues with {{\{{SCALENE}}\}}. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 51–64.
  8. SecBench.js: An Executable Security Benchmark Suite for Server-Side JavaScript. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 1059–1070. https://doi.org/10.1109/ICSE48619.2023.00096
  9. Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University.
  10. The DaCapo Benchmarks: Java Benchmarking Development and Analysis. In Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). ACM, 169–190.
  11. An infrastructure for adaptive dynamic optimization. In International Symposium on Code Generation and Optimization, 2003. CGO 2003. IEEE, 265–275.
  12. Automatic root cause quantification for missing edges in javascript call graphs. In 36th European Conference on Object-Oriented Programming (ECOOP 2022). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
  13. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  14. Dynamic Slicing of Python Programs. In IEEE 38th Annual Computer Software and Applications Conference, COMPSAC 2014, Vasteras, Sweden, July 21-25, 2014. IEEE Computer Society, 219–228. https://doi.org/10.1109/COMPSAC.2014.30
  15. Dynamic slicing of Python programs. In 2014 IEEE 38th Annual Computer Software and Applications Conference. IEEE, 219–228.
  16. Dytan: a generic dynamic taint analysis framework. In International Symposium on Software Testing and Analysis (ISSTA). ACM, 196–206.
  17. BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes. CoRR abs/1903.06725 (2019). arXiv:1903.06725 http://arxiv.org/abs/1903.06725
  18. LAVA: Large-Scale Automated Vulnerability Addition. In IEEE Symposium on Security and Privacy, SP 2016, San Jose, CA, USA, May 22-26, 2016. 110–121.
  19. Blanket Execution: Dynamic Similarity Testing for Program Binaries and Components. In Proceedings of the 23rd USENIX Security Symposium, San Diego, CA, USA, August 20-22, 2014. 303–317.
  20. Aryaz Eghbali and Michael Pradel. 2022. DynaPyt: A Dynamic Analysis Framework for Python. In ESEC/FSE ’22: 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM.
  21. Efficient construction of approximate call graphs for JavaScript IDE services. In 35th International Conference on Software Engineering, ICSE ’13, San Francisco, CA, USA, May 18-26, 2013. 752–761.
  22. Cormac Flanagan and Stephen N. Freund. 2004. Atomizer: a dynamic atomicity checker for multithreaded programs. In Symposium on Principles of Programming Languages (POPL). ACM, 256–267.
  23. Cormac Flanagan and Stephen N. Freund. 2010. The RoadRunner dynamic analysis framework for concurrent programs. In Workshop on Program Analysis for Software Tools and Engineering (PASTE). ACM, 1–8.
  24. DLint: Dynamically Checking Bad Coding Practices in JavaScript. In International Symposium on Software Testing and Analysis (ISSTA). 94–105.
  25. Luca Di Grazia and Michael Pradel. 2022. The Evolution of Type Annotations in Python: An Empirical Study.. In ESEC/FSE.
  26. An Empirical Study of Flaky Tests in Python. In 14th IEEE Conference on Software Testing, Verification and Validation, ICST 2021, Porto de Galinhas, Brazil, April 12-16, 2021. IEEE, 148–158. https://doi.org/10.1109/ICST49551.2021.00026
  27. Bugsjs: a benchmark of javascript bugs. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). IEEE, 90–101.
  28. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In proceedings of the 17th international conference on data engineering. IEEE, 215–224.
  29. Magma: A ground-truth fuzzing benchmark. Proceedings of the ACM on Measurement and Analysis of Computing Systems 4, 3 (2020), 1–29.
  30. John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. SIGARCH Computer Architecture News 34, 4 (2006), 1–17.
  31. Matthias Höschele and Andreas Zeller. 2016. Mining input grammars from dynamic taints. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016, Singapore, September 3-7, 2016. 720–725.
  32. Defects4J: a database of existing faults to enable controlled testing studies for Java programs. In International Symposium on Software Testing and Analysis, ISSTA ’14, San Jose, CA, USA - July 21 - 26, 2014. 437–440.
  33. Daniel Lehmann and Michael Pradel. 2019. Wasabi: A Framework for Dynamically Analyzing WebAssembly. In ASPLOS.
  34. That’s a Tough Call: Studying the Challenges of Call Graph Construction for WebAssembly. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023. 892–903. https://doi.org/10.1145/3597926.3598104
  35. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).
  36. Pin: building customized program analysis tools with dynamic instrumentation. Acm sigplan notices 40, 6 (2005), 190–200.
  37. Stephan Lukasczyk. 2019. Generating Tests to Analyse Dynamically-Typed Programs. In 34th IEEE/ACM International Conference on Automated Software Engineering, ASE 2019, San Diego, CA, USA, November 11-15, 2019. IEEE, 1226–1229. https://doi.org/10.1109/ASE.2019.00146
  38. DiSL: a domain-specific language for bytecode instrumentation. In Proceedings of the 11th International Conference on Aspect-oriented Software Development, AOSD 2012, Potsdam, Germany, March 25-30, 2012, Robert Hirschfeld, Éric Tanter, Kevin J. Sullivan, and Richard P. Gabriel (Eds.). ACM, 239–250. https://doi.org/10.1145/2162049.2162077
  39. An empirical study of static call graph extractors. ACM Transactions on Software Engineering and Methodology (TOSEM) 7, 2 (1998), 158–191.
  40. Nicholas Nethercote and Julian Seward. 2007. Valgrind: a framework for heavyweight dynamic binary instrumentation. In Conference on Programming Language Design and Implementation (PLDI). ACM, 89–100.
  41. Robert O’Callahan and Jong-Deok Choi. 2003. Hybrid dynamic data race detection. In Symposium on Principles and Practice of Parallel Programming (PPOPP). ACM, 167–178.
  42. Jibesh Patra and Michael Pradel. 2021. Semantic bug seeding: a learning-based approach for creating realistic bugs. In ESEC/FSE ’21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23-28, 2021, Diomidis Spinellis, Georgios Gousios, Marsha Chechik, and Massimiliano Di Penta (Eds.). ACM, 906–918. https://doi.org/10.1145/3468264.3468623
  43. StateFormer: Fine-grained type recovery from binaries using generative state modeling. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 690–702.
  44. Trex: Learning execution semantics from micro-traces for binary similarity. arXiv preprint arXiv:2012.08680 (2020).
  45. Michael Pradel and Satish Chandra. 2022. Neural software analysis. Commun. ACM 65, 1 (2022), 86–96. https://doi.org/10.1145/3460348
  46. TypeWriter: Neural Type Prediction with Search-based Validation. In ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020. 209–220. https://doi.org/10.1145/3368089.3409715
  47. Michael Pradel and Thomas R. Gross. 2009. Automatic Generation of Object Usage Specifications from Large Method Traces. In International Conference on Automated Software Engineering (ASE). 371–382.
  48. TypeDevil: Dynamic Type Inconsistency Analysis for JavaScript. In International Conference on Software Engineering (ICSE).
  49. Python 3 Types in the Wild: A Tale of Two Type Systems. In DLS.
  50. Judge: identifying, understanding, and evaluating sources of unsoundness in call graphs. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2019, Beijing, China, July 15-19, 2019, Dongmei Zhang and Anders Møller (Eds.). ACM, 251–261. https://doi.org/10.1145/3293882.3330555
  51. Automated API property inference techniques. IEEE Transactions on Software Engineering 39, 5 (2012), 613–637.
  52. Bugs. jar: A large-scale, diverse dataset of real-world java bugs. In Proceedings of the 15th international conference on mining software repositories. 10–13.
  53. PyCG: Practical Call Graph Generation in Python. In 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 1646–1657. https://doi.org/10.1109/ICSE43902.2021.00146
  54. Eraser: A Dynamic Data Race Detector for Multithreaded Programs. ACM Transactions on Computer Systems 15, 4 (1997), 391–411.
  55. Marija Selakovic and Michael Pradel. 2016. Performance Issues and Optimizations in JavaScript: An Empirical Study. In International Conference on Software Engineering (ICSE). 61–72.
  56. Jalangi: A Selective Record-Replay and Dynamic Analysis Framework for JavaScript. In European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 488–498.
  57. Da capo con scala: Design and analysis of a scala benchmark suite for the java virtual machine. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications. 657–676.
  58. Beatriz Souza and Michael Pradel. 2023. LExecutor: Learning-Guided Execution. In FSE.
  59. On the Recall of Static Call Graph Construction in Practice. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (Seoul, South Korea) (ICSE ’20). Association for Computing Machinery, New York, NY, USA, 1049–1060. https://doi.org/10.1145/3377811.3380441
  60. Frank Tip and Jens Palsberg. 2000. Scalable propagation-based call graph construction algorithms. In Proceedings of the 15th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications. 281–293.
  61. Performance Problems You Can Fix: A Dynamic Analysis of Memoization Opportunities. In Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). 607–622.
  62. Striking a Balance: Pruning False-Positives from Static Call Graphs. In ICSE.
  63. Bugsinpy: a database of existing bugs in python programs to enable controlled testing and debugging studies. In Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 1556–1560.
  64. Go with the flow: profiling copies to find runtime bloat. In Conference on Programming Language Design and Implementation (PLDI). ACM, 419–430.
  65. Python predictive analysis for bug detection. In Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, Seattle, WA, USA, November 13-18, 2016, Thomas Zimmermann, Jane Cleland-Huang, and Zhendong Su (Eds.). ACM, 121–132. https://doi.org/10.1145/2950290.2950357
  66. Python probabilistic type inference with natural language support. In Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, Seattle, WA, USA, November 13-18, 2016. 607–618. https://doi.org/10.1145/2950290.2950343
  67. DLInfer: Deep Learning with Static Slicing for Python Type Inference. In ICSE.
  68. Perracotta: Mining temporal API rules from imperfect traces. In International Conference on Software Engineering (ICSE). ACM, 282–291.
  69. Gobench: A benchmark suite of real-world go concurrency bugs. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 187–199.
  70. Faster or Slower? Performance Mystery of Python Idioms Unveiled with Empirical Evidence, In ICSE. arXiv preprint arXiv:2301.12633.
Citations (3)

Summary

We haven't generated a summary for this paper yet.