Scalable and Precise Application-Centered Call Graph Construction for Python (2305.05949v5)
Abstract: Call graph construction is the foundation of inter-procedural static analysis. PYCG is the state-of-the-art approach for constructing call graphs for Python programs. Unfortunately, PyCG does not scale to large programs when adapted to whole-program analysis where application and dependent libraries are both analyzed. Moreover, PyCG is flow-insensitive and does not fully support Python's features, hindering its accuracy. To overcome these drawbacks, we propose a scalable and precise approach for constructing application-centered call graphs for Python programs, and implement it as a prototype tool JARVIS. JARVIS maintains a type graph (i.e., type relations of program identifiers) for each function in a program to allow type inference. Taking one function as an input, JARVIS generates the call graph on-the-fly, where flow-sensitive intra-procedural analysis and inter-procedural analysis are conducted in turn and strong updates are conducted. Our evaluation on a micro-benchmark of 135 small Python programs and a macro-benchmark of 6 real-world Python applications has demonstrated that JARVIS can significantly improve PYCG by at least 67% faster in time, 84% higher in precision, and at least 20% higher in recall.
- TIOBE. (2023) Tiobe index for january 2023. [Online]. Available: https://www.tiobe.com/tiobe-index/
- P. Barros, R. Just, S. Millstein, P. Vines, W. Dietl, M. d’Amorim, and M. D. Ernst, “Static analysis of implicit control flow: Resolving java reflection and android intents (t),” in Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering, 2015, pp. 669–679.
- B. Chess and G. McGraw, “Static analysis for security,” IEEE security & privacy, vol. 2, no. 6, pp. 76–79, 2004.
- V. B. Livshits and M. S. Lam, “Finding security vulnerabilities in java applications with static analysis,” in Proceedings of the 14th Conference on USENIX Security Symposium, 2005, pp. 271–286.
- S. Guarnieri and V. B. Livshits, “Gatekeeper: Mostly static enforcement of security and reliability policies for javascript code,” in Proceedings of the 18th conference on USENIX Security Symposium, vol. 10, 2009, pp. 78–85.
- J. Hejderup, A. van Deursen, and G. Gousios, “Software ecosystem call graph for dependency management,” in Proceedings of the IEEE/ACM 40th International Conference on Software Engineering: New Ideas and Emerging Technologies Results, 2018, pp. 101–104.
- J. C. Santos, R. A. Jones, C. Ashiogwu, and M. Mirakhorli, “Insight exploring cross-ecosystem vulnerability impacts,” in Proceedings of the Automated Software Engineering, 2022, pp. 37–42.
- K. Huang, B. Chen, C. Xu, Y. Wang, B. Shi, X. Peng, Y. Wu, and Y. Liu, “Characterizing usages, updates and risks of third-party libraries in java projects,” Empirical Software Engineering, vol. 27, no. 4, pp. 1–41, 2022.
- A. Quach, A. Prakash, and L. Yan, “Debloating software through piece-wise compilation and loading,” in Proceedings of the 27th USENIX Security Symposium, 2018, pp. 869–886.
- B. R. Bruce, T. Zhang, J. Arora, G. H. Xu, and M. Kim, “Jshrink: In-depth investigation into debloating modern java applications,” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020, p. 135–146.
- S. H. Jensen, A. Møller, and P. Thiemann, “Type analysis for javascript,” in Proceedings of the 16th International Static Analysis Symposium, 2009, pp. 238–255.
- B. Nagy, T. Brunner, and Z. Porkoláb, “Unambiguity of python language elements for static analysis,” in Proceedings of the IEEE 21st International Working Conference on Source Code Analysis and Manipulation, 2021, pp. 70–75.
- B. B. Nielsen, M. T. Torp, and A. Møller, “Modular call graph construction for security scanning of node. js applications,” in Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2021, pp. 29–41.
- G. Zhang and W. Jin. (2018) Depends is a fast, comprehensive code dependency analysis tool. [Online]. Available: https://github.com/multilang-depends/depends
- V. Salis, T. Sotiropoulos, P. Louridas, D. Spinellis, and D. Mitropoulos, “Pycg: Practical call graph generation in python,” in Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering, 2021, pp. 1646–1657.
- D. Rayside and K. Kontogiannis, “A generic worklist algorithm for graph reachability problems in program analysis,” in Proceedings of the Sixth European Conference on Software Maintenance and Reengineering, 2002, pp. 67–76.
- O. Lhoták and L. Hendren, “Scaling java points-to analysis using spark,” in Proceedings of the 12th International Conference on Compiler Construction, 2003, pp. 153–169.
- GitHub. (2023) Bpytop. [Online]. Available: https://github.com/aristocratos/bpytop
- C. Porter, G. Mururu, P. Barua, and S. Pande, “Blankit library debloating: Getting what you want instead of cutting what you don’t,” in Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, 2020, pp. 164–180.
- C. Soto-Valero, N. Harrand, M. Monperrus, and B. Baudry, “A comprehensive study of bloated dependencies in the maven ecosystem,” Empirical Software Engineering, vol. 26, no. 3, p. 45, 2021.
- Q. Xin, Q. Zhang, and A. Orso, “Studying and understanding the tradeoffs between generality and reduction in software debloating,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–13.
- P. S. Foundation. (2023) Abstract syntax tree - python. [Online]. Available: https://docs.python.org/3/library/ast.html#expressions
- GitHub. (2018) Dependabot: Automated dependency updates built into github. [Online]. Available: https://github.com/dependabot
- ——. (2023) Github advisories. [Online]. Available: https://github.com/advisories
- G. Kastrinis and Y. Smaragdakis, “Hybrid context-sensitivity for points-to analysis,” in Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2013, pp. 423–434.
- J. Lu, D. He, and J. Xue, “Eagle: Cfl-reachability-based precision-preserving acceleration of object-sensitive pointer analysis with partial context sensitivity,” ACM Transactions on Software Engineering and Methodology, vol. 30, no. 4, pp. 1–46, 2021.
- O. Lhoták and L. Hendren, “Evaluating the benefits of context-sensitive points-to analysis using a bdd-based implementation,” ACM Transactions on Software Engineering and Methodology, vol. 18, no. 1, pp. 1–53, 2008.
- Y. Sui and J. Xue, “Demand-driven pointer analysis with strong updates via value-flow refinement,” arXiv preprint arXiv:1701.05650, 2017.
- M. Sridharan, D. Gopan, L. Shan, and R. Bodík, “Demand-driven points-to analysis for java,” ACM SIGPLAN Notices, vol. 40, no. 10, pp. 59–76, 2005.
- J. Späth, L. Nguyen Quang Do, K. Ali, and E. Bodden, “Boomerang: Demand-driven flow-and context-sensitive pointer analysis for java,” in Proceedings of the 30th European Conference on Object-Oriented Programming, 2016, pp. 1–26.
- N. Allen, P. Krishnan, and B. Scholz, “Combining type-analysis with points-to analysis for analyzing java library source-code,” in Proceedings of the 4th ACM SIGPLAN International Workshop on State Of the Art in Program Analysis, 2015, pp. 13–18.
- Y. Li, T. Tan, A. Møller, and Y. Smaragdakis, “Precision-guided context sensitivity for pointer analysis,” Proceedings of the ACM on Programming Languages, vol. 2, no. OOPSLA, pp. 1–29, 2018.
- J. Lu, D. He, and J. Xue, “Selective context-sensitivity for k-cfa with cfl-reachability,” in Proceedings of the 28th International Static Analysis Symposium, 2021, pp. 261–285.
- B. Hardekopf and C. Lin, “The ant and the grasshopper: fast and accurate pointer analysis for millions of lines of code,” in Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2007, pp. 290–299.
- R. P. Wilson and M. S. Lam, “Efficient context-sensitive pointer analysis for c programs,” in Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation, 1995, pp. 1–12.
- Pysonar2. (2023) Pysonar2 - an advanced semantic indexer for python. [Online]. Available: https://github.com/yinwang0/pysonar2
- M. O. Source. (2023) Pyre a performant type-checker for python 3n. [Online]. Available: https://pyre-check.org/
- dropbox. (2023) Pyannotate: Auto-generate pep-484 annotations. [Online]. Available: https://github.com/dropbox/pyannotate
- M. Salib, “Starkiller: A static type inferencer and compiler for python,” Ph.D. dissertation, Massachusetts Institute of Technology, 2004.
- Z. Xu, X. Zhang, L. Chen, K. Pei, and B. Xu, “Python probabilistic type inference with natural language support,” in Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering, 2016, pp. 607–618.
- X. Sun, L. Li, T. F. Bissyandé, J. Klein, D. Octeau, and J. Grundy, “Taming reflection: An essential step toward whole-program analysis of android apps,” ACM Transactions on Software Engineering and Methodology, vol. 30, no. 3, pp. 1–36, 2021.
- J. Sawin and A. Rountev, “Assumption hierarchy for a cha call graph construction algorithm,” in Proceedings of the IEEE 11th International Working Conference on Source Code Analysis and Manipulation, 2011, pp. 35–44.
- E. Bodden, A. Sewe, J. Sinschek, H. Oueslati, and M. Mezini, “Taming reflection: Aiding static analysis in the presence of reflection and custom class loaders,” in Proceedings of the 33rd International Conference on Software Engineering, 2011, pp. 241–250.
- J. Liu, Y. Li, T. Tan, and J. Xue, “Reflection analysis for java: Uncovering more reflective targets precisely,” in Proceedings of the IEEE 28th International Symposium on Software Reliability Engineering, 2017, pp. 12–23.
- L. Sui, J. Dietrich, and A. Tahir, “On the use of mined stack traces to improve the soundness of statically constructed call graphs,” in Proceedings of the 24th Asia-Pacific Software Engineering Conference, 2017, pp. 672–676.
- N. Grech, G. Fourtounis, A. Francalanza, and Y. Smaragdakis, “Heaps don’t lie: countering unsoundness with heap snapshots,” Proceedings of the ACM on Programming Languages, vol. 1, no. OOPSLA, pp. 1–27, 2017.
- S. Lee, H. Lee, and S. Ryu, “Broadening horizons of multilingual static analysis: semantic summary extraction from c code for jni program analysis,” in Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 2020, pp. 127–137.
- E. Bodden, “Invokedynamic support in soot,” in Proceedings of the ACM SIGPLAN International Workshop on State of the Art in Java Program analysis, 2012, pp. 51–55.
- J. C. Santos, R. A. Jones, C. Ashiogwu, and M. Mirakhorli, “Serialization-aware call graph construction,” in Proceedings of the 10th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis, 2021, pp. 37–42.
- W. Zhang and B. G. Ryder, “Automatic construction of accurate application call graph with library call abstraction for java,” Journal of Software Maintenance and Evolution: Research and Practice, vol. 19, no. 4, pp. 231–252, 2007.
- K. Ali and O. Lhoták, “Application-only call graph construction,” in Proceedings of the 26th European Conference on Object-Oriented Programming, 2012, pp. 688–712.
- ——, “Averroes: Whole-program analysis without the whole program,” in Proceedings of the 27th European Conference on Object-Oriented Programming, 2013, pp. 378–400.
- G. Agrawal, J. Li, and Q. Su, “Evaluating a demand driven technique for call graph construction,” in Proceedings of the International Conference on Compiler Construction, 2002, pp. 29–45.
- N. Heintze and O. Tardieu, “Demand-driven pointer analysis,” in Proceedings of the ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation, 2001, pp. 24–34.
- G. Agrawal, “Demand-driven construction of call graphs,” in Proceedings of the International Conference on Compiler Construction, 2000, pp. 125–140.
- M. Reif, M. Eichberg, B. Hermann, J. Lerch, and M. Mezini, “Call graph construction for java libraries,” in Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2016, pp. 474–486.
- M. Schäfer, M. Sridharan, J. Dolby, and F. Tip, “Dynamic determinacy analysis,” in Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2013, pp. 165–174.
- M. Madsen, F. Tip, and O. Lhoták, “Static analysis of event-driven node.js javascript applications,” in Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, 2015, pp. 505–519.
- A. Feldthaus, M. Schäfer, M. Sridharan, J. Dolby, and F. Tip, “Efficient construction of approximate call graphs for javascript ide services,” in Proceedings of the 35th International Conference on Software Engineering, 2013, pp. 752–761.
- M. Madsen, B. Livshits, and M. Fanning, “Practical static analysis of javascript applications in the presence of frameworks and libraries,” in Proceedings of the 9th Joint Meeting on Foundations of Software Engineering, 2013, pp. 499–509.
- D. Petrashko, V. Ureche, O. Lhoták, and M. Odersky, “Call graphs for languages with parametric polymorphism,” in Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, 2016, pp. 394–409.
- K. Ali, M. Rapoport, O. Lhoták, J. Dolby, and F. Tip, “Type-based call graph construction algorithms for scala,” ACM Transactions on Software Engineering and Methodology, vol. 25, no. 1, pp. 1–43, 2015.
- M. Hu, Y. Zhang, W. Huang, and Y. Xiong, “Static type inference for foreign functions of python,” in Proceedings of the IEEE 32nd International Symposium on Software Reliability Engineering, 2021, pp. 423–433.
- G. C. Murphy, D. Notkin, W. G. Griswold, and E. S. Lan, “An empirical study of static call graph extractors,” ACM Transactions on Software Engineering and Methodology, vol. 7, no. 2, pp. 158–191, 1998.
- M. Reif, F. Kübler, M. Eichberg, and M. Mezini, “Systematic evaluation of the unsoundness of call graph construction algorithms for java,” in Proceedings of the 7th International Workshop on the State Of the Art in Program Analysis, 2018, pp. 107–112.
- M. Reif, F. Kübler, M. Eichberg, D. Helm, and M. Mezini, “Judge: Identifying, understanding, and evaluating sources of unsoundness in call graphs,” in Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2019, pp. 251–261.
- K. Ali, X. Lai, Z. Luo, O. Lhoták, J. Dolby, and F. Tip, “A study of call graph construction for jvm-hosted languages,” IEEE Transactions on Software Engineering, vol. 47, no. 12, pp. 2644–2666, 2019.
- G. Antal, P. Hegedus, Z. Tóth, R. Ferenc, and T. Gyimóthy, “Static javascript call graphs: A comparative study,” in Proceedings of the IEEE 18th International Working Conference on Source Code Analysis and Manipulation, 2018, pp. 177–186.
- L. Sui, J. Dietrich, M. Emery, S. Rasheed, and A. Tahir, “On the soundness of call graph construction in the presence of dynamic language features-a benchmark and tool evaluation,” in Proceedings of the Asian Symposium on Programming Languages and Systems, 2018, pp. 69–88.
- R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and V. Sundaresan, “Soot - a java bytecode optimization framework,” in Proceedings of the 1999 Conference of the Centre for Advanced Studies on Collaborative Research, 1999.
- IBM, “The t. j. watson libraries for analysis (wala),” 2017.
- M. Bravenboer and Y. Smaragdakis, “Strictly declarative specification of sophisticated points-to analyses,” in Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications, 2009, pp. 243–262.
- L. Sui, J. Dietrich, A. Tahir, and G. Fourtounis, “On the recall of static call graph construction in practice,” in Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering, 2020, pp. 1049–1060.
- M. Eichberg, F. Kübler, D. Helm, M. Reif, G. Salvaneschi, and M. Mezini, “Lattice based modularization of static analyses,” in Proceedings of the 7th International Workshop on the State Of the Art in Program Analysis, 2018, pp. 113–118.
- F. Tip and J. Palsberg, “Scalable propagation-based call graph construction algorithms,” in Proceedings of the 15th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, 2000, pp. 281–293.
- J. Dean, D. Grove, and C. Chambers, “Optimization of object-oriented programs using static class hierarchy analysis,” in Proceedings of the 9th European Conference on Object-Oriented Programming, 1995, pp. 77–101.
- D. F. Bacon and P. F. Sweeney, “Fast static analysis of c++ virtual function calls,” in Proceedings of the 11th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, 1996, pp. 324–341.
- F. E. Allen, “Control flow analysis,” in Proceedings of the Symposium on Compiler Optimization, 1970, pp. 1–19.