Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Perplexed: Understanding When Large Language Models are Confused (2404.06634v1)

Published 9 Apr 2024 in cs.SE

Abstract: LLMs have become dominant in the NLP field causing a huge surge in progress in a short amount of time. However, their limitations are still a mystery and have primarily been explored through tailored datasets to analyze a specific human-level skill such as negation, name resolution, etc. In this paper, we introduce perplexed, a library for exploring where a particular LLM is perplexed. To show the flexibility and types of insights that can be gained by perplexed, we conducted a case study focused on LLMs for code generation using an additional tool we built to help with the analysis of code models called codetokenizer. Specifically, we explore success and failure cases at the token level of code LLMs under different scenarios pertaining to the type of coding structure the model is predicting, e.g., a variable name or operator, and how predicting of internal verses external method invocations impact performance. From this analysis, we found that our studied code LLMs had their worst performance on coding structures where the code was not syntactically correct. Additionally, we found the models to generally perform worse at predicting internal method invocations than external ones. We have open sourced both of these tools to allow the research community to better understand LLMs in general and LLMs for code generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988, 2023.
  2. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72, 2019.
  3. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness, 2022.
  4. Climbing towards nlu: On meaning, form, and understanding in the age of data. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp.  5185–5198, 2020.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  7. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019.
  8. CodeTokenizers. Codetokenizers - https://github.com/ncoop57/code_tokenizers. 2023.
  9. Datasets. Huggingface datasets - https://huggingface.co/docs/datasets/index. 2023.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  11. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999, 2022.
  12. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  13. Are deep neural networks the best choice for modeling source code? In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pp.  763–773, 2017.
  14. When code completion fails: A case study on real-world completions. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp.  960–970. IEEE, 2019.
  15. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021.
  16. Teaching machines to read and comprehend. Advances in neural information processing systems, 28, 2015.
  17. exbert: A visual analysis tool to explore learned representations in transformers models. arXiv preprint arXiv:1910.05276, 2019.
  18. Datasets Hub. Datasets hub - https://huggingface.co/datasets. 2023a.
  19. Model Hub. Model hub - https://huggingface.co/models. 2023b.
  20. Jupyter. jupyter - https://jupyter.org/. 2023.
  21. Andrej Karpathy. When you sort your dataset descending by loss you are guaranteed to find something unexpected, strange and helpful., October 2020. URL https://twitter.com/karpathy/status/1311884485676294151.
  22. The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533, 2022.
  23. Unsupervised translation of programming languages. arXiv preprint arXiv:2006.03511, 2020.
  24. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
  25. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021.
  26. Studying the usage of text-to-text transfer transformer to support code-related tasks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pp.  336–347. IEEE, 2021.
  27. Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019.
  28. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745, 2018.
  29. nbdev. nbdev - https://nbdev.fast.ai/. 2023.
  30. Probing neural network comprehension of natural language arguments. arXiv preprint arXiv:1907.07355, 2019.
  31. NumPy. Numpy - https://numpy.org/. 2023.
  32. Pandas. Pandas - https://pandas.pydata.org/. 2023.
  33. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318, 2002.
  34. Perplexed. Perplexed - https://github.com/ncoop57/perplexed. 2023.
  35. Python. Python - https://www.python.org/. 2023.
  36. PyTorch. Pytorch - https://pytorch.org/. 2023.
  37. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  38. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  39. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297, 2020.
  40. Beyond accuracy: Behavioral testing of nlp models with checklist. arXiv preprint arXiv:2005.04118, 2020.
  41. A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics, 8:842–866, 2021.
  42. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  43. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
  44. Tree Sitter. Tree sitter - https://github.com/tree-sitter/py-tree-sitter. 2023.
  45. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950, 2019.
  46. Transformers. Huggingface transformers - https://huggingface.co/docs/transformers/index. 2023.
  47. Anlizing the adversarial natural language inference dataset. arXiv preprint arXiv:2010.12729, 2020.
  48. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pp.  1–10, 2022.
  49. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Nathan Cooper (35 papers)
  2. Torsten Scholak (14 papers)