Do Machines and Humans Focus on Similar Code? Exploring Explainability of Large Language Models in Code Summarization (2402.14182v1)
Abstract: Recent LLMs have demonstrated proficiency in summarizing source code. However, as in many other domains of machine learning, LLMs of code lack sufficient explainability. Informally, we lack a formulaic or intuitive understanding of what and how models learn from code. Explainability of LLMs can be partially provided if, as the models learn to produce higher-quality code summaries, they also align in deeming the same code parts important as those identified by human programmers. In this paper, we report negative results from our investigation of explainability of LLMs in code summarization through the lens of human comprehension. We measure human focus on code using eye-tracking metrics such as fixation counts and duration in code summarization tasks. To approximate LLM focus, we employ a state-of-the-art model-agnostic, black-box, perturbation-based approach, SHAP (SHapley Additive exPlanations), to identify which code tokens influence that generation of summaries. Using these settings, we find no statistically significant relationship between LLMs' focus and human programmers' attention. Furthermore, alignment between model and human foci in this setting does not seem to dictate the quality of the LLM-generated summaries. Our study highlights an inability to align human focus with SHAP-based model focus measures. This result calls for future investigation of multiple open questions for explainable LLMs for code summarization and software engineering tasks in general, including the training mechanisms of LLMs for code, whether there is an alignment between human and model attention on code, whether human attention can improve the development of LLMs, and what other model focus measures are appropriate for improving explainability.
- Tobii pro fusion user manual, Jun 2023.
- Using developer eye movements to externalize the mental model used in code summarization tasks. In Proceedings of the 11th ACM Symposium on Eye Tracking Research & Applications (2019), pp. 1–9.
- A transformer-based approach for source code summarization. arXiv preprint arXiv:2005.00653 (2020).
- Modeling programmer attention as scanpath prediction. arXiv preprint arXiv:2308.13920 (2023).
- Sequence classification with human attention. In Proceedings of the 22nd conference on computational natural language learning (2018), pp. 302–312.
- srcml: An infrastructure for the exploration, analysis, and manipulation of source code: A tool demonstration. In 2013 IEEE International conference on software maintenance (2013), IEEE, pp. 516–519.
- Attention in natural language processing. IEEE transactions on neural networks and learning systems 32, 10 (2020), 4291–4308.
- Evaluating feature importance estimates.
- Where to look when repairing code? comparing the attention of neural models and developers. arXiv preprint arXiv:2305.07287 (2023).
- Improving sentence compression by learning to predict gaze. arXiv preprint arXiv:1604.03357 (2016).
- Is model attention aligned with human attention? an empirical study on large language models for code generation. arXiv preprint arXiv:2306.01220 (2023).
- Recommendations for datasets for source code summarization. arXiv preprint arXiv:1904.02660 (2019).
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
- On the reliability and explainability of automated code generation approaches. arXiv preprint arXiv:2302.09587 (2023).
- A unified approach to interpreting model predictions. Advances in neural information processing systems 30 (2017).
- Molnar, C. Interpretable machine learning. Lulu. com, 2020.
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
- OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt/, 2023. Accessed: 11/20/2023.
- OpenAI, R. Gpt-4 technical report. arxiv 2303.08774. View in Article 2 (2023).
- Extracting meaningful attention on source code: An empirical study of developer and neural model code exploration. arXiv preprint arXiv:2210.05506 (2022).
- Thinking like a developer? comparing the attention of humans with neural models of code. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2021), IEEE, pp. 867–879.
- An empirical study on the patterns of eye movement during summarization tasks. In 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) (2015), IEEE, pp. 1–10.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
- Eye-tracking metrics in software engineering. In 2015 Asia-Pacific Software Engineering Conference (APSEC) (2015), IEEE, pp. 96–103.
- Learning important features through propagating activation differences. In International conference on machine learning (2017), PMLR, pp. 3145–3153.
- Improving natural language processing tasks with human gaze-guided neural attention. Advances in Neural Information Processing Systems 33 (2020), 6327–6341.
- Spearman, C. The proof and measurement of association between two things.
- Axiomatic attribution for deep networks. In International conference on machine learning (2017), PMLR, pp. 3319–3328.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Wheacha: A method for explaining the predictions of code summarization models. arXiv preprint arXiv:2102.04625 (2021).
- A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (2022), pp. 1–10.
- An extensive study on pre-trained models for program understanding and generation. In Proceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis (2022), pp. 39–51.
- What does transformer learn about source code? arXiv preprint arXiv:2207.08466 (2022).
- Using human attention to extract keyphrase from microblog post. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019), pp. 5867–5872.
- Diet code is healthy: Simplifying programs for pre-trained models of code. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2022), pp. 1073–1084.
- Jiliang Li (10 papers)
- Yifan Zhang (245 papers)
- Zachary Karas (6 papers)
- Collin McMillan (38 papers)
- Kevin Leach (29 papers)
- Yu Huang (176 papers)