Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Redundancy and Concept Analysis for Code-trained Language Models (2305.00875v2)

Published 1 May 2023 in cs.SE, cs.AI, and cs.LG

Abstract: Code-trained LLMs have proven to be highly effective for various code intelligence tasks. However, they can be challenging to train and deploy for many software engineering applications due to computational bottlenecks and memory constraints. Implementing effective strategies to address these issues requires a better understanding of these 'black box' models. In this paper, we perform the first neuron-level analysis for source code models to identify \textit{important} neurons within latent representations. We achieve this by eliminating neurons that are highly similar or irrelevant to the given task. This approach helps us understand which neurons and layers can be eliminated (redundancy analysis) and where important code properties are located within the network (concept analysis). Using redundancy analysis, we make observations relevant to knowledge transfer and model optimization applications. We find that over 95\% of the neurons are redundant with respect to our code intelligence tasks and can be eliminated without significant loss in accuracy. We also discover several subsets of neurons that can make predictions with baseline accuracy. Through concept analysis, we explore the traceability and distribution of human-recognizable concepts within latent code representations which could be used to influence model predictions. We trace individual and subsets of important neurons to specific code properties and identify 'number' neurons, 'string' neurons, and higher-level 'text' neurons for token-level tasks and higher-level concepts important for sentence-level downstream tasks. This also helps us understand how decomposable and transferable task-related features are and can help devise better techniques for transfer learning, model compression, and the decomposition of deep neural networks into modules.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. 143–153.
  2. Omer Antverg and Yonatan Belinkov. 2021. On the Pitfalls of Analyzing Individual Neurons in Language Models. arXiv preprint arXiv:2110.07483 (2021).
  3. Correlation clustering. Machine learning 56, 1 (2004), 89–113.
  4. Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics 48, 1 (2022), 207–219.
  5. Autofocus: interpreting attention-based neural networks by code perturbation. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 38–41.
  6. Studying the difference between natural and programming language corpora. Empirical Software Engineering 24 (2019), 1823–1868.
  7. Neural transfer learning for repairing security vulnerabilities in c code. IEEE Transactions on Software Engineering 49, 1 (2022), 147–165.
  8. Counterfactual explanations for models of code. In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice. 125–134.
  9. What is one grain of sand in the desert? analyzing individual neurons in deep nlp models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6309–6317.
  10. Discovering latent concepts learned in BERT. arXiv preprint arXiv:2205.07237 (2022).
  11. NeuroX: A Toolkit for Analyzing Individual Neurons in Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (2019).
  12. Analyzing redundancy in pretrained transformer models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 4908–4926.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  14. Linguistic Correlation Analysis: Discovering Salient Neurons in deepNLP models. arXiv preprint arXiv:2206.13288 (2022).
  15. Analyzing individual neurons in pre-trained language models. arXiv preprint arXiv:2010.02695 (2020).
  16. Toy models of superposition. arXiv preprint arXiv:2209.10652 (2022).
  17. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
  18. A kernel statistical test of independence. Advances in neural information processing systems 20 (2007).
  19. Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:2203.03850 (2022).
  20. GraphCodeBERT: Pre-training Code Representations with Data Flow. arXiv preprint arXiv:2009.08366 (2020).
  21. Finding Neurons in a Haystack: Case Studies with Sparse Probing. arXiv preprint arXiv:2305.01610 (2023).
  22. Visualizing and understanding the effectiveness of BERT. arXiv preprint arXiv:1908.05620 (2019).
  23. John Hewitt and Percy Liang. 2019. Designing and interpreting probes with control tasks. arXiv preprint arXiv:1909.03368 (2019).
  24. Multilayer feedforward networks are universal approximators. Neural networks 2, 5 (1989), 359–366.
  25. CoSQA: 20, 000+ Web Queries for Code Search and Question Answering. In ACL/IJCNLP.
  26. Sarthak Jain and Byron C Wallace. 2019. Attention is not explanation. arXiv preprint arXiv:1902.10186 (2019).
  27. Knod: Domain knowledge distilled tree decoder for automated program repair. arXiv preprint arXiv:2302.01857 (2023).
  28. Assessing the generalizability of code2vec token embeddings. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1–12.
  29. Explainable Automated Debugging via Large Language Model-driven Scientific Debugging. arXiv preprint arXiv:2304.02195 (2023).
  30. Anjan Karmakar and Romain Robbes. 2021. What do pre-trained code models know about code?. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1332–1336.
  31. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  32. Similarity of neural network representations revisited. In International conference on machine learning. PMLR, 3519–3529.
  33. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  34. AST-Probe: Recovering abstract syntax trees from hidden representations of pre-trained language models. arXiv preprint arXiv:2206.11719 (2022).
  35. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. arXiv preprint arXiv:2102.04664 (2021).
  36. An Empirical Comparison of Pre-Trained Models of Source Code. arXiv preprint arXiv:2302.04026 (2023).
  37. Sourabh Pal and Alberto Sillitti. 2022. Cross-Project Defect Prediction : a Literature Review. IEEE Access (2022).
  38. Matteo Paltenghi and Michael Pradel. 2021. Thinking like a developer? comparing the attention of humans with neural models of code. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 867–879.
  39. Rangeet Pan and Hridesh Rajan. 2020. On decomposing a deep neural network into modules. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 889–900.
  40. Md Rafiqul Islam Rabin and Mohammad Amin Alipour. 2022. FeatureExtractor: A tool for extracting key input features of code intelligence models. Software Impacts 14 (2022), 100432.
  41. On the generalizability of Neural Program Models with respect to semantic-preserving program transformations. Information and Software Technology 135 (2021), 106552.
  42. Understanding neural code intelligence through program simplification. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 441–452.
  43. Syntax-guided program reduction for understanding neural code intelligence models. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 70–79.
  44. Memorization and generalization in neural code intelligence models. Information and Software Technology 153 (2023), 107066.
  45. Towards demystifying dimensions of source code embeddings. In Proceedings of the 1st ACM SIGSOFT International Workshop on Representation Learning for Software Engineering and Program Languages. 29–38.
  46. Code smell detection by deep direct-learning and transfer-learning. Journal of Systems and Software 176 (2021), 110936.
  47. Benchmarking Language Models for Code Syntax Understanding. arXiv preprint arXiv:2210.14473 (2022).
  48. Learning program semantics with code representations: An empirical study. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 554–565.
  49. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1631–1642.
  50. Probing model signal-awareness via prediction-preserving input minimization. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 945–955.
  51. Jeffrey Svajlenko and Chanchal K Roy. 2015. Evaluating clone detection tools with bigclonebench. In 2015 IEEE international conference on software maintenance and evolution (ICSME). IEEE, 131–140.
  52. BERT rediscovers the classical NLP pipeline. arXiv preprint arXiv:1905.05950 (2019).
  53. Sergey Troshin and Nadezhda Chirkova. 2022. Probing Pretrained Models of Source Code. arXiv preprint arXiv:2202.08975 (2022).
  54. What Do They Capture?–A Structural Analysis of Pre-Trained Language Models for Source Code. arXiv preprint arXiv:2202.06840 (2022).
  55. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).
  56. Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. arXiv preprint arXiv:1908.04626 (2019).
  57. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426 (2017).
  58. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6
  59. Adversarial examples for models of code. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1–30.
  60. Diet Code is Healthy: Simplifying Programs for Pre-Trained Models of Code. arXiv preprint arXiv:2206.14390 (2022).
  61. Assessing Generalizability of CodeBERT. In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). 425–436. https://doi.org/10.1109/ICSME52107.2021.00044
  62. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in neural information processing systems 32 (2019).
  63. Hui Zou and Trevor Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology) 67, 2 (2005), 301–320.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Arushi Sharma (8 papers)
  2. Zefu Hu (1 paper)
  3. Christopher Quinn (1 paper)
  4. Ali Jannesari (56 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com