Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-modal Learning for WebAssembly Reverse Engineering (2404.03171v1)

Published 4 Apr 2024 in cs.SE, cs.LG, and cs.PL

Abstract: The increasing adoption of WebAssembly (Wasm) for performance-critical and security-sensitive tasks drives the demand for WebAssembly program comprehension and reverse engineering. Recent studies have introduced ML-based WebAssembly reverse engineering tools. Yet, the generalization of task-specific ML solutions remains challenging, because their effectiveness hinges on the availability of an ample supply of high-quality task-specific labeled data. Moreover, previous works overlook the high-level semantics present in source code and its documentation. Acknowledging the abundance of available source code with documentation, which can be compiled into WebAssembly, we propose to learn representations of them concurrently and harness their mutual relationships for effective WebAssembly reverse engineering. In this paper, we present WasmRev, the first multi-modal pre-trained LLM for WebAssembly reverse engineering. WasmRev is pre-trained using self-supervised learning on a large-scale multi-modal corpus encompassing source code, code documentation and the compiled WebAssembly, without requiring labeled data. WasmRev incorporates three tailored multi-modal pre-training tasks to capture various characteristics of WebAssembly and cross-modal relationships. WasmRev is only trained once to produce general-purpose representations that can broadly support WebAssembly reverse engineering tasks through few-shot fine-tuning with much less labeled data, improving data efficiency. We fine-tune WasmRev onto three important reverse engineering tasks: type recovery, function purpose identification and WebAssembly summarization. Our results show that WasmRev pre-trained on the corpus of multi-modal samples establishes a robust foundation for these tasks, achieving high task accuracy and outperforming the state-of-the-art ML methods for WebAssembly reverse engineering.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. 2018. WebAssembly website. https://webassembly.org/
  2. 2020. GitHub. https://github.com
  3. 2022. Wasm non web usage. https://webassembly.org/docs/non-web/
  4. Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. 143–153.
  5. code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400 (2018).
  6. Efficient two-level homomorphic encryption in prime-order bilinear groups and a fast implementation in webassembly. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security. 685–697.
  7. Enriching word vectors with subword information. Transactions of the association for computational linguistics 5 (2017), 135–146.
  8. Wasmati: An efficient static vulnerability scanner for WebAssembly. Computers & Security 118 (2022), 102745.
  9. Juan Caballero and Zhiqiang Lin. 2016. Type inference on executables. ACM Computing Surveys (CSUR) 48, 4 (2016), 1–35.
  10. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.
  11. Zimin Chen and Martin Monperrus. 2019. A literature study of embeddings on source code. arXiv preprint arXiv:1904.03061 (2019).
  12. Neural nets can learn function type signatures from binaries. In 26th USENIX Security Symposium (USENIX Security 17). 99–116.
  13. Torch: a modular machine learning software library. Technical Report. Idiap.
  14. DWARF Committee. 2017. DWARF Debugging Information Format – Version 5. http://www.dwarfstd.org/doc/DWARF5.pdf
  15. Emscripten Contributors. 2020. Emscripten. https://emscripten.org/index.html
  16. WebAssembly Contributors. 2020. Webassembly Use Cases. https://webassembly.org/docs/use-cases/
  17. A deep language model for software code. arXiv preprint arXiv:1608.02715 (2016).
  18. Nikhil Thorat Daniel Smilkov and Ann Yuan. 2020. Introducing the WebAssem- blybackendforTensorFlow.js. https://blog.tensorflow.org/2020/03/introducing-webassembly-backend-for-tensorflow-js.html
  19. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  20. Yulong Wang Emma Ning and Du Li. 2021. ONNX Runtime Web—running your machine learning model in browser. https://cloudblogs.microsoft.com/opensource/2021/09/02/onnx-runtime-web-running-your-machine-learning-model-in-browser/
  21. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
  22. Taintassembly: Taint-based information flow control tracking for webassembly. arXiv preprint arXiv:1802.01050 (2018).
  23. Sledge: A serverless-first, light-weight wasm runtime for the edge. In Proceedings of the 21st International Middleware Conference. 265–279.
  24. Challenges and opportunities for efficient serverless computing at the edge. In 2019 38th Symposium on Reliable Distributed Systems (SRDS). IEEE, 261–2615.
  25. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 249–256.
  26. W. C. Group. 2019. webassembly/wabt. https://github.com/WebAssembly/wabt
  27. Deep API learning. In Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering. 631–642.
  28. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020).
  29. Bringing the web up to speed with WebAssembly. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. 185–200.
  30. An empirical study of real-world webassembly binaries: Security, languages, use cases. In Proceedings of the web conference 2021. 2696–2708.
  31. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
  32. Query2vec: An evaluation of NLP techniques for generalized workload analytics. arXiv preprint arXiv:1801.05613 (2018).
  33. Not so fast: Analyzing the performance of {{\{{WebAssembly}}\}} vs. native code. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 107–120.
  34. WATT: A novel web-based toolkit to generate WebAssembly-based libraries and applications. In 2018 IEEE International Conference on Consumer Electronics (ICCE). IEEE, 1–2.
  35. Suvarna Kadam and Vinay Vaidya. 2020. Review and analysis of zero, one and few shot learning approaches. In Intelligent Systems Design and Applications: 18th International Conference on Intelligent Systems Design and Applications (ISDA 2018) held in Vellore, India, December 6-8, 2018, Volume 1. Springer, 100–112.
  36. Diederik Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR). San Diega, CA, USA.
  37. Minesweeper: An in-depth look into drive-by cryptocurrency mining and its defense. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. 1714–1730.
  38. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).
  39. A neural model for generating natural language summaries of program subroutines. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 795–806.
  40. Everything old is new again: Binary security of {{\{{WebAssembly}}\}}. In 29th USENIX Security Symposium (USENIX Security 20). 217–234.
  41. Daniel Lehmann and Michael Pradel. 2019. Wasabi: A framework for dynamically analyzing webassembly. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 1045–1058.
  42. Daniel Lehmann and Michael Pradel. 2022. Finding the dwarf: recovering precise types from WebAssembly binaries. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 410–425.
  43. Palmtree: Learning an assembly language model for instruction embedding. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 3236–3251.
  44. Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Transactions on Dependable and Secure Computing 19, 4 (2021), 2244–2258.
  45. Chin-Yew Lin and Franz Josef Och. 2004. Orange: a method for evaluating automatic evaluation metrics for machine translation. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics. 501–507.
  46. α𝛼\alphaitalic_αDiff: Cross-Version Binary Code Similarity Detection with DNN. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (Montpellier, France) (ASE ’18). Association for Computing Machinery, New York, NY, USA, 667–678. https://doi.org/10.1145/3238147.3238199
  47. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).
  48. WebAssembly modules as lightweight containers for liquid IoT applications. In International Conference on Web Engineering. Springer, 328–336.
  49. Vadim Markovtsev and Waren Long. 2018. Public git archive: a big code dataset for all. In Proceedings of the 15th International Conference on Mining Software Repositories. 34–37.
  50. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  51. New Kid on the Web: A Study on the Prevalence of WebAssembly in the Wild. In Detection of Intrusions and Malware, and Vulnerability Assessment: 16th International Conference, DIMVA 2019, Gothenburg, Sweden, June 19–20, 2019, Proceedings 16. Springer, 23–42.
  52. Deep learning meets software engineering: A survey on pre-trained models of source code. arXiv preprint arXiv:2205.11739 (2022).
  53. Nvidia. 2023a. Accelerating the Most Important Work of Our Time. https://www.nvidia.com/en-us/data-center/a100/
  54. Nvidia. 2023b. DeepLearningExamples. https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/README.md#inference-performance-benchmark
  55. Nvidia. 2023c. An Order-of-Magnitude Leap for Accelerated Computing. https://www.nvidia.com/en-us/data-center/h100/
  56. Nvidia. 2023d. Train With Mixed Precision. https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html
  57. OpenAI. 2023. ChatGPT. https://chat.openai.com/chat
  58. StateFormer: Fine-grained type recovery from binaries using generative state modeling. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 690–702.
  59. A review of generalized zero-shot learning methods. IEEE transactions on pattern analysis and machine intelligence (2022).
  60. Michael Pradel and Satish Chandra. 2021. Neural Software Analysis. Commun. ACM 65, 1 (dec 2021), 86–96. https://doi.org/10.1145/3460348
  61. Alan Romano and Weihang Wang. 2023. Automated WebAssembly Function Purpose Identification With Semantics-Aware Analysis. In Proceedings of the ACM Web Conference 2023. 2885–2894.
  62. Minerray: Semantics-aware analysis for ever-evolving cryptojacking detection. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 1129–1140.
  63. T. Rust and W. W. Group. 2020. wasm-bindgen. https://github.com/rustwasm/wasm-bindgen
  64. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014).
  65. Taint tracking for webassembly. arXiv preprint arXiv:1807.08349 (2018).
  66. BinDeep: A deep learning approach to binary code similarity detection. Expert Systems with Applications 168 (2021), 114348.
  67. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
  68. Improving automatic source code summarization via deep reinforcement learning. In Proceedings of the 33rd ACM/IEEE international conference on automated software engineering. 397–407.
  69. jtrans: Jump-aware transformer for binary code similarity. arXiv preprint arXiv:2205.12713 (2022).
  70. Seismic: Secure in-lined script monitors for interrupting cryptojacks. In Computer Security: 23rd European Symposium on Research in Computer Security, ESORICS 2018, Barcelona, Spain, September 3-7, 2018, Proceedings, Part II 23. Springer, 122–142.
  71. CT-Wasm: Type-Driven Secure Cryptography for the Web Ecosystem. Proc. ACM Program. Lang. 3, POPL, Article 77 (jan 2019), 29 pages. https://doi.org/10.1145/3290390
  72. Elliott Wen and Gerald Weber. 2020. Wasmachine: Bring iot up to speed with a webassembly os. In 2020 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops). IEEE, 1–4.
  73. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
  74. Asteria: Deep learning-based AST-encoding for cross-platform binary code similarity detection. In 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 224–236.
  75. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019).
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Reddit Logo Streamline Icon: https://streamlinehq.com