Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
91 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

DocTer: Documentation Guided Fuzzing for Testing Deep Learning API Functions (2109.01002v4)

Published 2 Sep 2021 in cs.SE

Abstract: Input constraints are useful for many software development tasks. For example, input constraints of a function enable the generation of valid inputs, i.e., inputs that follow these constraints, to test the function deeper. API functions of deep learning (DL) libraries have DL specific input constraints, which are described informally in the free form API documentation. Existing constraint extraction techniques are ineffective for extracting DL specific input constraints. To fill this gap, we design and implement a new technique, DocTer, to analyze API documentation to extract DL specific input constraints for DL API functions. DocTer features a novel algorithm that automatically constructs rules to extract API parameter constraints from syntactic patterns in the form of dependency parse trees of API descriptions. These rules are then applied to a large volume of API documents in popular DL libraries to extract their input parameter constraints. To demonstrate the effectiveness of the extracted constraints, DocTer uses the constraints to enable the automatic generation of valid and invalid inputs to test DL API functions. Our evaluation on three popular DL libraries (TensorFlow, PyTorch, and MXNet) shows that the precision of DocTer in extracting input constraints is 85.4%. DocTer detects 94 bugs from 174 API functions, including one previously unknown security vulnerability that is now documented in the CVE database, while a baseline technique without input constraints detects only 59 bugs. Most (63) of the 94 bugs are previously unknown, 54 of which have been fixed or confirmed by developers after we report them. In addition, DocTer detects 43 inconsistencies in documents, 39 of which are fixed or confirmed.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. 1999. The Java Modeling Language (JML). "https://www.cs.ucf.edu/~leavens/JML/examples.shtml"
  2. 2004. Beautiful Soup. https://www.crummy.com/software/BeautifulSoup/bs4/doc/
  3. 2013. American Fuzzy Lop. http://lcamtuf.coredump.cx/afl/
  4. 2014. Universal Dependencies. https://universaldependencies.org/
  5. 2015. libFuzzer – a library for coverage-guided fuzz testing. http://llvm.org/docs/LibFuzzer.html
  6. 2016. OSS-Fuzz. https://github.com/google/oss-fuzz
  7. 2016. pytype. "https://github.com/google/pytype"
  8. 2017. What is the best programming language for Machine Learning? https://towardsdatascience.com/what-is-the-best-programming-language-for-machine-learning-a745c156d6b7.
  9. 2019. FuzzFactory: Domain-Specific Fuzzing with Waypoints. "https://github.com/rohanpadhye/fuzzfactory"
  10. 2019. incubator-mxnet. https://github.com/apache/incubator-mxnet/blob/1.6.0/python/mxnet/ndarray/ndarray.py#L64-L74
  11. 2019. torch.Tensor. https://pytorch.org/docs/1.5.0/tensors.html
  12. 2020. tf.dtypes.DType. https://www.tensorflow.org/versions/r2.1/api_docs/python/tf/dtypes/DType
  13. 2022. DocTer’s Supplementary Material. https://github.com/lin-tan/DocTer
  14. Tensorflow: A system for large-scale machine learning. In 12th {normal-{\{{USENIX}normal-}\}} symposium on operating systems design and implementation ({normal-{\{{OSDI}normal-}\}} 16). 265–283.
  15. Natural Language Processing with Python. O’Reilly Media Inc.
  16. Translating code comments to procedure specifications. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. 242–253.
  17. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. arXiv:1512.01274 [cs.DC]
  18. Testing Probabilistic Programming Systems. (ESEC/FSE 2018). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3236024.3236057
  19. Gordon Fraser and Andrea Arcuri. 2013. Whole Test Suite Generation. IEEE Transactions on Software Engineering 39, 2 (feb. 2013), 276 –291.
  20. Fuzz Testing based Data Augmentation to Improve Robustness of Deep Neural Networks. In Proceedings of the 42nd International Conference on Software Engineering (ICSE ’20).
  21. Grammar-based Whitebox Fuzzing. In Proceedings of the ACM SIGPLAN conference on Programming language design and implementation. 206–215.
  22. Automatic generation of oracles for exceptional behaviors. In Proceedings of the 25th international symposium on software testing and analysis. 213–224.
  23. DLFuzz: Differential Fuzzing Testing of Deep Learning Systems. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Lake Buena Vista, FL, USA) (ESEC/FSE 2018). ACM, New York, NY, USA, 739–743. https://doi.org/10.1145/3236024.3264835
  24. Audee: Automated Testing for Deep Learning Frameworks. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).
  25. DeepMutation++: A Mutation Testing Framework for Deep Learning Systems. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1158–1161.
  26. Taxonomy of Real Faults in Deep Learning Systems. In Proceedings of 42nd International Conference on Software Engineering (ICSE ’20). ACM.
  27. A Comprehensive Study on Deep Learning Bug Characteristics. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Tallinn, Estonia) (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 510–520. https://doi.org/10.1145/3338906.3338955
  28. Static Analysis of Shape in TensorFlow Programs.. In ECOOP 2020.
  29. Caroline Lemieux and Koushik Sen. 2018. FairFuzz: a targeted mutation strategy for increasing greybox fuzz testing coverage.. In ASE, Marianne Huchard, Christian Kästner, and Gordon Fraser (Eds.). ACM, 475–485.
  30. Automatic early defects detection in use case documents. In Proceedings of the 29th ACM/IEEE international conference on Automated software engineering. 785–790.
  31. RTFM! Automatic Assumption Discovery and Verification Derivation from Library Document for API Misuse Detection. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 1837–1852.
  32. R. Majumda and R. Xu. 2007. Directed Test Generation Using Symbolic Grammars. In Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering. 134–143.
  33. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. 55–60.
  34. Manish Motwani and Yuriy Brun. 2019. Automatically generating precise Oracles from structured natural language specifications. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 188–199.
  35. M. Nejadgholi and J. Yang. 2019. A Study of Oracle Approximations in Testing Deep Learning Libraries. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 785–796.
  36. TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, Long Beach, California, USA, 4901–4911.
  37. Carlos Pacheco and Michael D. Ernst. 2007. Randoop: Feedback-directed Random Testing for Java. In Companion to the 22Nd ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications Companion (Montreal, Quebec, Canada) (OOPSLA ’07). ACM, New York, NY, USA, 815–816. https://doi.org/10.1145/1297846.1297902
  38. Inferring Method Specifications from Natural Language API Descriptions. In Proceedings of the 34th International Conference on Software Engineering (Zurich, Switzerland) (ICSE ’12). IEEE Press, Piscataway, NJ, USA, 815–825.
  39. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
  40. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
  41. T-Fuzz: Fuzzing by Program Transformation.. In IEEE Symposium on Security and Privacy. IEEE Computer Society, 697–710.
  42. CRADLE: Cross-backend Validation to Detect and Localize Bugs in Deep Learning Libraries. In Proceedings of the 41st International Conference on Software Engineering (Montreal, Quebec, Canada) (ICSE ’19). IEEE Press, Piscataway, NJ, USA, 1027–1038. https://doi.org/10.1109/ICSE.2019.00107
  43. Multiple-Implementation Testing of Supervised Learning Software. In Proc. AAAI-18 Workshop on Engineering Dependable and Secure Machine Learning Systems (EDSMLS).
  44. Robert Swiecki. 2015. Honggfuzz: A general-purpose, easy-to-use fuzzer with interesting analysis options. URl: https://github. com/google/honggfuzz (2015).
  45. /*Icomment: Bugs or Bad Comments?*/. In Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles (Stevenson, Washington, USA) (SOSP ’07). ACM, New York, NY, USA, 145–158. https://doi.org/10.1145/1294261.1294276
  46. aComment: Mining Annotations from Comments and Code to Detect Interrupt Related Concurrency Bugs. In Proceedings of the 33rd International Conference on Software Engineering (Waikiki, Honolulu, HI, USA) (ICSE ’11). ACM, New York, NY, USA, 11–20. https://doi.org/10.1145/1985793.1985796
  47. @tComment: Testing Javadoc Comments to Detect Comment-Code Inconsistencies. In 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation. 260–269. https://doi.org/10.1109/ICST.2012.106
  48. Detecting and understanding real-world differential performance bugs in machine learning libraries. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 189–199.
  49. Sakshi Udeshi and Sudipta Chattopadhyay. 2019. Grammar Based Directed Testing of Machine Learning Systems. CoRR abs/1902.10027 (2019). arXiv:1902.10027
  50. Discovering discrepancies in numerical libraries.. In Proceedings of the 2020 International Symposium on Software Testing and Analysis (ISSTA 2020). 488–501. https://doi.org/10.1145/3395363.3397380
  51. What if we simply swap the two text fragments? a straightforward yet effective way to test the robustness of methods to confounding signals in nature language inference tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7136–7143.
  52. Deep Learning Library Testing via Effective Model Generation. In Proceedings of the 2020 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020).
  53. Dase: Document-assisted symbolic execution for improving automated software testing. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE, 620–631.
  54. Inferring dependency constraints on parameters for web services. In Proceedings of the 22nd international conference on World Wide Web. 1421–1432.
  55. DeepHunter: A Coverage-guided Fuzz Testing Framework for Deep Neural Networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (Beijing, China) (ISSTA 2019). ACM, New York, NY, USA, 146–157. https://doi.org/10.1145/3293882.3330579
  56. Mohammed Javeed Zaki. 2005. Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE transactions on knowledge and data engineering 17, 8 (2005), 1021–1035.
  57. C2S: Translating Natural Language Comments to Formal Program. In Proceedings of the 2020 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020).
  58. An Empirical Study on Program Failures of Deep Learning Jobs. In Proceedings of the 42nd International Conference on Software Engineering (ICSE ’20). IEEE/ACM.
  59. An empirical study on TensorFlow program bugs. In Proceedings of the 2020 International Symposium on Software Testing and Analysis (ISSTA 2018). 129–140. https://doi.org/10.1145/3213846.3213866
  60. Testing Untestable Neural Machine Translation: An Industrial Case. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 314–315.
  61. Inferring resource specifications from natural language API documentation. In 2009 IEEE/ACM International Conference on Automated Software Engineering. IEEE, 307–318.
  62. Automatic detection and repair recommendation of directive defects in Java API documentation. IEEE Transactions on Software Engineering 46, 9 (2018), 1004–1023.
Citations (64)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com