Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 79 tok/s
Gemini 2.5 Pro 60 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 117 tok/s Pro
Kimi K2 201 tok/s Pro
GPT OSS 120B 466 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Exploiting Code Symmetries for Learning Program Semantics (2308.03312v9)

Published 7 Aug 2023 in cs.LG, cs.CR, and cs.PL

Abstract: This paper tackles the challenge of teaching code semantics to LLMs for program analysis by incorporating code symmetries into the model architecture. We introduce a group-theoretic framework that defines code symmetries as semantics-preserving transformations, where forming a code symmetry group enables precise and efficient reasoning of code semantics. Our solution, SymC, develops a novel variant of self-attention that is provably equivariant to code symmetries from the permutation group defined over the program dependence graph. SymC obtains superior performance on five program analysis tasks, outperforming state-of-the-art code models without any pre-training. Our results suggest that code LLMs that encode the code structural prior via the code symmetry group generalize better and faster.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333, 2021.
  2. A convolutional attention network for extreme summarization of source code. In International conference on machine learning, pp.  2091–2100. PMLR, 2016.
  3. Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740, 2017.
  4. code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400, 2018.
  5. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3(POPL):1–29, 2019.
  6. Dos and don’ts of machine learning in computer security. In 31st USENIX Security Symposium (USENIX Security 22), pp.  3971–3988, 2022.
  7. A few billion lines of code later: using static analysis to find bugs in the real world. Commun. ACM, 53:66–75, February 2010.
  8. Algebraic graph theory. Number 67. Cambridge university press, 1993.
  9. Lorentz group equivariant neural network for particle physics. In International Conference on Machine Learning, pp.  992–1002. PMLR, 2020.
  10. Black-box attacks against neural binary function detection. arXiv preprint arXiv:2208.11667, 2022.
  11. Group equivariant convolutional networks. In International conference on machine learning, pp.  2990–2999. PMLR, 2016.
  12. Automatic symmetry discovery with lie algebra convolutional network. Advances in Neural Information Processing Systems, 34:2503–2515, 2021.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  14. Traced: Execution-aware pre-training for source code. arXiv preprint arXiv:2306.07487, 2023.
  15. Learning so (3) equivariant representations with spherical cnns. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  52–68, 2018.
  16. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020.
  17. Structured neural summarization. arXiv preprint arXiv:1811.01824, 2018.
  18. Discrete adversarial attack to models of code. Proceedings of the ACM on Programming Languages, 7(PLDI):172–195, 2023a.
  19. Two sides of the same coin: Exploiting the impact of identifiers in neural code comprehension. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp.  1933–1945. IEEE, 2023b.
  20. Permutation equivariant models for compositional generalization in language. In International Conference on Learning Representations, 2019.
  21. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366, 2020.
  22. Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:2203.03850, 2022.
  23. Global relational models of source code. In International conference on learning representations, 2019.
  24. Semantic robustness of models of source code. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pp.  526–537. IEEE, 2022.
  25. Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230, 2018.
  26. HuggingFace and ServiceNow. BigCode is an open scientific collaboration working on the responsible development and use of large language models for code. https://www.bigcode-project.org/, 2022.
  27. Lietransformer: Equivariant self-attention for lie groups. In International Conference on Machine Learning, pp.  4533–4543. PMLR, 2021.
  28. A mathematical view of attention models in deep learning. Texas A&M University: College Station, TX, USA, 2019.
  29. Symlm: Predicting function names in stripped binaries via context-sensitive execution-aware code embeddings. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pp.  1631–1645, 2022.
  30. Code prediction by feeding trees to transformers. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pp.  150–162. IEEE, 2021.
  31. Knuth, D. Permutations, matrices, and generalized young tableaux. Pacific journal of mathematics, 34(3):709–727, 1970.
  32. Dobf: A deobfuscation pre-training objective for programming languages. Advances in Neural Information Processing Systems, 34:14967–14979, 2021.
  33. Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pp.  3744–3753. PMLR, 2019.
  34. Palmtree: Learning an assembly language model for instruction embedding. In 2021 ACM SIGSAC Conference on Computer and Communications Security, 2021.
  35. AI-Powered Fuzzing: Breaking the Bug Hunting Barrier. https://security.googleblog.com/2023/08/ai-powered-fuzzing-breaking-bug-hunting.html, 2023.
  36. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
  37. Large sequence models for software development activities. https://ai.googleblog.com/2023/05/large-sequence-models-for-software.html?m=1, 2023.
  38. Fairseq: A fast, extensible toolkit for sequence modeling. In 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, 2019.
  39. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  40. Integrating tree path in transformer for code representation. Advances in Neural Information Processing Systems, 34:9343–9354, 2021.
  41. Deepsphere: Efficient spherical convolutional neural network with healpix sampling for cosmological applications. Astronomy and Computing, 27:130–146, 2019.
  42. On the generalizability of neural program models with respect to semantic-preserving program transformations. Information and Software Technology, 135:106552, 2021.
  43. Semantic robustness of models of source code. arXiv preprint arXiv:2002.03043, 2020.
  44. Graph neural networks for materials science and chemistry. Communications Materials, 3(1):93, 2022.
  45. Group equivariant stand-alone self-attention for vision. arXiv preprint arXiv:2010.00977, 2020.
  46. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  47. Treegen: A tree-based transformer architecture for code generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  8984–8991, 2020.
  48. Incorporating symmetry into deep dynamics models for improved generalization. arXiv preprint arXiv:2002.03061, 2020.
  49. Recode: Robustness evaluation of code generation models. arXiv preprint arXiv:2212.10264, 2022.
  50. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859, 2021.
  51. West, D. B. et al. Introduction to graph theory, volume 2. Prentice hall Upper Saddle River, 2001.
  52. Adversarial examples for models of code. Proceedings of the ACM on Programming Languages, 4(OOPSLA):1–30, 2020.
  53. Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, 34:28877–28888, 2021.
  54. Pelican: Exploiting backdoors of naturally trained deep learning models in binary code analysis. In 32nd USENIX Security Symposium, 2023.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 24 likes.