Exploiting Code Symmetries for Learning Program Semantics (2308.03312v9)
Abstract: This paper tackles the challenge of teaching code semantics to LLMs for program analysis by incorporating code symmetries into the model architecture. We introduce a group-theoretic framework that defines code symmetries as semantics-preserving transformations, where forming a code symmetry group enables precise and efficient reasoning of code semantics. Our solution, SymC, develops a novel variant of self-attention that is provably equivariant to code symmetries from the permutation group defined over the program dependence graph. SymC obtains superior performance on five program analysis tasks, outperforming state-of-the-art code models without any pre-training. Our results suggest that code LLMs that encode the code structural prior via the code symmetry group generalize better and faster.
- Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333, 2021.
- A convolutional attention network for extreme summarization of source code. In International conference on machine learning, pp. 2091–2100. PMLR, 2016.
- Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740, 2017.
- code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400, 2018.
- code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3(POPL):1–29, 2019.
- Dos and don’ts of machine learning in computer security. In 31st USENIX Security Symposium (USENIX Security 22), pp. 3971–3988, 2022.
- A few billion lines of code later: using static analysis to find bugs in the real world. Commun. ACM, 53:66–75, February 2010.
- Algebraic graph theory. Number 67. Cambridge university press, 1993.
- Lorentz group equivariant neural network for particle physics. In International Conference on Machine Learning, pp. 992–1002. PMLR, 2020.
- Black-box attacks against neural binary function detection. arXiv preprint arXiv:2208.11667, 2022.
- Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999. PMLR, 2016.
- Automatic symmetry discovery with lie algebra convolutional network. Advances in Neural Information Processing Systems, 34:2503–2515, 2021.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Traced: Execution-aware pre-training for source code. arXiv preprint arXiv:2306.07487, 2023.
- Learning so (3) equivariant representations with spherical cnns. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–68, 2018.
- Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020.
- Structured neural summarization. arXiv preprint arXiv:1811.01824, 2018.
- Discrete adversarial attack to models of code. Proceedings of the ACM on Programming Languages, 7(PLDI):172–195, 2023a.
- Two sides of the same coin: Exploiting the impact of identifiers in neural code comprehension. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp. 1933–1945. IEEE, 2023b.
- Permutation equivariant models for compositional generalization in language. In International Conference on Learning Representations, 2019.
- Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366, 2020.
- Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:2203.03850, 2022.
- Global relational models of source code. In International conference on learning representations, 2019.
- Semantic robustness of models of source code. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 526–537. IEEE, 2022.
- Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230, 2018.
- HuggingFace and ServiceNow. BigCode is an open scientific collaboration working on the responsible development and use of large language models for code. https://www.bigcode-project.org/, 2022.
- Lietransformer: Equivariant self-attention for lie groups. In International Conference on Machine Learning, pp. 4533–4543. PMLR, 2021.
- A mathematical view of attention models in deep learning. Texas A&M University: College Station, TX, USA, 2019.
- Symlm: Predicting function names in stripped binaries via context-sensitive execution-aware code embeddings. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pp. 1631–1645, 2022.
- Code prediction by feeding trees to transformers. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pp. 150–162. IEEE, 2021.
- Knuth, D. Permutations, matrices, and generalized young tableaux. Pacific journal of mathematics, 34(3):709–727, 1970.
- Dobf: A deobfuscation pre-training objective for programming languages. Advances in Neural Information Processing Systems, 34:14967–14979, 2021.
- Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pp. 3744–3753. PMLR, 2019.
- Palmtree: Learning an assembly language model for instruction embedding. In 2021 ACM SIGSAC Conference on Computer and Communications Security, 2021.
- AI-Powered Fuzzing: Breaking the Bug Hunting Barrier. https://security.googleblog.com/2023/08/ai-powered-fuzzing-breaking-bug-hunting.html, 2023.
- Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
- Large sequence models for software development activities. https://ai.googleblog.com/2023/05/large-sequence-models-for-software.html?m=1, 2023.
- Fairseq: A fast, extensible toolkit for sequence modeling. In 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, 2019.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Integrating tree path in transformer for code representation. Advances in Neural Information Processing Systems, 34:9343–9354, 2021.
- Deepsphere: Efficient spherical convolutional neural network with healpix sampling for cosmological applications. Astronomy and Computing, 27:130–146, 2019.
- On the generalizability of neural program models with respect to semantic-preserving program transformations. Information and Software Technology, 135:106552, 2021.
- Semantic robustness of models of source code. arXiv preprint arXiv:2002.03043, 2020.
- Graph neural networks for materials science and chemistry. Communications Materials, 3(1):93, 2022.
- Group equivariant stand-alone self-attention for vision. arXiv preprint arXiv:2010.00977, 2020.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Treegen: A tree-based transformer architecture for code generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 8984–8991, 2020.
- Incorporating symmetry into deep dynamics models for improved generalization. arXiv preprint arXiv:2002.03061, 2020.
- Recode: Robustness evaluation of code generation models. arXiv preprint arXiv:2212.10264, 2022.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859, 2021.
- West, D. B. et al. Introduction to graph theory, volume 2. Prentice hall Upper Saddle River, 2001.
- Adversarial examples for models of code. Proceedings of the ACM on Programming Languages, 4(OOPSLA):1–30, 2020.
- Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, 34:28877–28888, 2021.
- Pelican: Exploiting backdoors of naturally trained deep learning models in binary code analysis. In 32nd USENIX Security Symposium, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.