ChatGPT vs LLaMA: Impact, Reliability, and Challenges in Stack Overflow Discussions (2402.08801v1)
Abstract: Since its release in November 2022, ChatGPT has shaken up Stack Overflow, the premier platform for developers' queries on programming and software development. Demonstrating an ability to generate instant, human-like responses to technical questions, ChatGPT has ignited debates within the developer community about the evolving role of human-driven platforms in the age of generative AI. Two months after ChatGPT's release, Meta released its answer with its own LLM called LLaMA: the race was on. We conducted an empirical study analyzing questions from Stack Overflow and using these LLMs to address them. This way, we aim to (ii) measure user engagement evolution with Stack Overflow over time; (ii) quantify the reliability of LLMs' answers and their potential to replace Stack Overflow in the long term; (iii) identify and understand why LLMs fails; and (iv) compare LLMs together. Our empirical results are unequivocal: ChatGPT and LLaMA challenge human expertise, yet do not outperform it for some domains, while a significant decline in user posting activity has been observed. Furthermore, we also discuss the impact of our findings regarding the usage and development of new LLMs.
- Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems, 35:38176–38189, 2022.
- Contextual documentation referencing on stack overflow. IEEE Transactions on Software Engineering, 48(1):135–149, 2020.
- What are developers talking about? an analysis of topics and trends in stack overflow. Empirical software engineering, 19:619–654, 2014.
- Understanding the social evolution of the java community in stack overflow: A 10-year study of developer interactions. Future Generation Computer Systems, 105:446–454, 2020.
- Mining successful answers in stack overflow. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pages 430–433. IEEE, 2015.
- Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
- Github copilot ai pair programmer: Asset or liability? Journal of Systems and Software, 203:111734, 2023.
- Are large language models a threat to digital public goods? evidence from activity on stack overflow. arXiv preprint arXiv:2307.07367, 2023.
- Evaluating privacy questions from stack overflow: Can chatgpt compete? arXiv preprint arXiv:2306.11174, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Understanding predictive factors for merge conflicts. Information and Software Technology, 121:106256, 2020.
- Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020.
- Does this apply to me? an empirical study of technical context in stack overflow. In Proceedings of the 19th International Conference on Mining Software Repositories, pages 23–34, 2022.
- Assessing the factual accuracy of generated text. In proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 166–175, 2019.
- Evaluating large language models in generating synthetic hci research data: a case study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–19, 2023.
- Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620, 2023.
- An empirical study assessing source code readability in comprehension. In 2019 IEEE International conference on software maintenance and evolution (ICSME), pages 513–523. IEEE, 2019.
- Who answers it better? an in-depth analysis of chatgpt and stack overflow answers to software engineering questions. arXiv preprint arXiv:2308.02312, 2023.
- Challenges and applications of large language models. arXiv preprint arXiv:2307.10169, 2023.
- Chatgpt for programming numerical methods. Journal of Machine Learning for Modeling and Computing, 4(2), 2023.
- Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274, 2023.
- Klaus Krippendorff. Computing krippendorff’s alpha-reliability. 2011. URL https://repository.upenn.edu/handle/20.500.14332/2089.
- Cosine similarity to determine similarity measure: Study case in online essay assessment. In 2016 4th International Conference on Cyber and IT Service Management, pages 1–6. IEEE, 2016.
- Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023b.
- Can gpt-4 replicate empirical software engineering research? arXiv preprint arXiv:2310.01727, 2023.
- Which is a better programming assistant? a comparative study between chatgpt and stack overflow. arXiv preprint arXiv:2308.13851, 2023.
- Motivation under gamification: An empirical study of developers’ motivations and contributions in stack overflow. IEEE Transactions on Software Engineering, 48(12):4947–4963, 2021.
- On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18(1):50–60, 03 1947. doi: 10.1214/aoms/1177730491. URL https://doi.org/10.1214/aoms/1177730491.
- Team maturity in software engineering teams. In 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 235–240. IEEE, 2017.
- What makes a good code example?: A study of programming q&a in stackoverflow. In 2012 28th IEEE International Conference on Software Maintenance (ICSM), pages 25–34. IEEE, 2012.
- Evaluating code readability and legibility: An examination of human-centric studies. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 348–359. IEEE, 2020.
- Online Appendix, 2023. URL https://anonymous.4open.science/r/chat-stack-C4FE/.
- OpenAI. Code interpreter, 2023. URL https://openai.com/blog/chatgpt-plugins#code-interpreter.
- Ipek Ozkaya. Application of large language models to software engineering tasks: Opportunities, risks, and implications. IEEE Software, 40(3):4–8, 2023.
- Large language models for education: Grading open-ended questions using chatgpt. In Proceedings of the XXXVII Brazilian Symposium on Software Engineering, pages 293–302, 2023.
- Toxic code snippets on stack overflow. IEEE Transactions on Software Engineering, 47(3):560–581, 2019.
- Partha Pratim Ray. Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 2023.
- A field study of api learning obstacles. Empirical Software Engineering, 16:703–732, 2011.
- Postfinder: Mining stack overflow posts to support software developers. Information and Software Technology, 127:106367, 2020.
- Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156, 2023.
- Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523, 1988.
- An analysis of variance test for normality (complete samples). Biometrika, 52(3/4):591–611, 1965.
- Megan Squire. ” should we move to stack overflow?” measuring the utility of social media for developer support. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, volume 2, pages 219–228. IEEE, 2015.
- StackOverflow. Temporary policy: Generative ai (e.g., chatgpt) is banned, 2023a. URL https://meta.stackoverflow.com/questions/421831/temporary-policy-generative-ai-e-g-chatgpt-is-banned.
- StackOverflow. Announcing overflowai, 2023b. URL https://stackoverflow.blog/2023/07/27/announcing-overflowai/.
- Interactive and visual prompt engineering for ad-hoc task adaptation with large language models. IEEE transactions on visualization and computer graphics, 29(1):1146–1156, 2022.
- Nigar M Shafiq Surameery and Mohammed Y Shakor. Use chat gpt to solve programming bugs. International Journal of Information Technology & Computer Engineering (IJITC) ISSN: 2455-5290, 3(01):17–22, 2023.
- Empirical study of the evolution of python questions on stack overflow. e-Informatica Software Engineering Journal, 17(1), 2023.
- What is social debt in software engineering? In 2013 6th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE), pages 93–96. IEEE, 2013.
- The science of detecting llm-generated texts. arXiv preprint arXiv:2303.07205, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Understanding how and why developers seek and analyze api-related opinions. IEEE Transactions on Software Engineering, 47(4):694–735, 2019.
- An empirical study of c++ vulnerabilities in crowd-sourced code examples. IEEE Transactions on Software Engineering, 48(5):1497–1514, 2020.
- A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382, 2023.
- Addressing compiler errors: Stack overflow or large language models? arXiv preprint arXiv:2307.10793, 2023.
- What do developers search for on the web? Empirical Software Engineering, 22:3149–3185, 2017.
- A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pages 1–10, 2022.
- Can chatgpt kill user-generated q&a platforms? Available at SSRN 4448938, 2023.
- How do software development teams manage technical debt?–an empirical study. Journal of Systems and Software, 120:195–218, 2016.
- Are code examples on an online q&a forum reliable? a study of api misuse on stack overflow. In Proceedings of the 40th international conference on software engineering, pages 886–896, 2018.
- Towards an understanding of large language models in software engineering tasks. arXiv preprint arXiv:2308.11396, 2023.
- Leuson Da Silva (7 papers)
- Jordan Samhi (17 papers)
- Foutse Khomh (140 papers)