Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Open-Source AI-based SE Tools: Opportunities and Challenges of Collaborative Software Learning (2404.06201v1)

Published 9 Apr 2024 in cs.SE and cs.AI

Abstract: LLMs have become instrumental in advancing software engineering (SE) tasks, showcasing their efficacy in code understanding and beyond. Like traditional SE tools, open-source collaboration is key in realising the excellent products. However, with AI models, the essential need is in data. The collaboration of these AI-based SE models hinges on maximising the sources of high-quality data. However, data especially of high quality, often holds commercial or sensitive value, making it less accessible for open-source AI-based SE projects. This reality presents a significant barrier to the development and enhancement of AI-based SE tools within the software engineering community. Therefore, researchers need to find solutions for enabling open-source AI-based SE models to tap into resources by different organisations. Addressing this challenge, our position paper investigates one solution to facilitate access to diverse organizational resources for open-source AI models, ensuring privacy and commercial sensitivities are respected. We introduce a governance framework centered on federated learning (FL), designed to foster the joint development and maintenance of open-source AI code models while safeguarding data privacy and security. Additionally, we present guidelines for developers on AI-based SE tool collaboration, covering data requirements, model architecture, updating strategies, and version control. Given the significant influence of data characteristics on FL, our research examines the effect of code data heterogeneity on FL performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. 2022-11. Chatgpt: Optimizing language models for dialogue. https://chat.openai.com
  2. 2024. Huggingface: The AI community building the future. https://huggingface.co/
  3. Agile software development methods: Review and analysis. arXiv preprint arXiv:1709.08439 (2017).
  4. Fedcsd: A federated learning based approach for code-smell detection. arXiv preprint arXiv:2306.00038 (2023).
  5. Software engineering for machine learning: A case study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 291–300.
  6. Anonymous. 2024. Open-Source AI Models. https://anonymous.4open.science/r/collaborative_software_learning-78ED
  7. Andrea Arcuri and Xin Yao. 2008. A novel co-evolutionary approach to automatic software bug fixing. In 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence). IEEE, 162–168.
  8. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  9. Differential privacy-enabled federated learning for sensitive health data. arXiv preprint arXiv:1910.02578 (2019).
  10. An overview of decentralized autonomous organizations on the blockchain. In Proceedings of the 16th international symposium on open collaboration. 1–8.
  11. Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533 (2023).
  12. CausaLM: Causal model explanation through counterfactual language models. Computational Linguistics 47, 2 (2021), 333–386.
  13. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
  14. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020).
  15. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352 (2023).
  16. Parameter-efficient transfer learning for NLP. In International conference on machine learning. PMLR, 2790–2799.
  17. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  18. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
  19. Visual prompt tuning. In European Conference on Computer Vision. Springer, 709–727.
  20. Cure: Code-aware neural machine translation for automatic program repair. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1161–1173.
  21. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
  22. A broad study of pre-training for domain generalization and adaptation. In European Conference on Computer Vision. Springer, 621–638.
  23. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 8 (2016).
  24. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
  25. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems 2 (2020), 429–450.
  26. A blockchain-based decentralized federated learning framework with committee consensus. IEEE Network 35, 1 (2020), 234–241.
  27. Vuldeepecker: A deep learning-based system for vulnerability detection. arXiv preprint arXiv:1801.01681 (2018).
  28. Commitbart: A large pre-trained model for github commits. arXiv preprint arXiv:2208.08100 (2022).
  29. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).
  30. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568 (2023).
  31. Federated learning of deep networks using model averaging. arXiv preprint arXiv:1602.05629 2 (2016), 2.
  32. Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. In 2019 IEEE symposium on security and privacy (SP). IEEE, 739–753.
  33. Automated fabric defect detection—A review. Image and vision computing 29, 7 (2011), 442–458.
  34. Bias in data-driven artificial intelligence systems—An introductory survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10, 3 (2020), e1356.
  35. Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22, 10 (2009), 1345–1359.
  36. A distributed approach to node clustering in decentralized peer-to-peer networks. IEEE Transactions on Parallel and Distributed Systems 16, 9 (2005), 814–829.
  37. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
  38. Smart Contract: Attacks and Protections. IEEE Access 8 (2020), 24416–24427. https://doi.org/10.1109/ACCESS.2020.2970495
  39. Masked language modeling and the distributional hypothesis: Order word matters pre-training for little. arXiv preprint arXiv:2104.06644 (2021).
  40. Evi Suryawati and Kamisah Osman. 2017. Contextual learning: Innovative approach towards the development of students’ scientific attitude and natural science performance. Eurasia Journal of mathematics, science and technology education 14, 1 (2017), 61–76.
  41. A comparative study of fine-tuning deep learning models for plant disease identification. Computers and Electronics in Agriculture 161 (2019), 272–279.
  42. On the validity of pre-trained transformers for natural language processing in the software engineering domain. IEEE Transactions on Software Engineering 49, 4 (2022), 1487–1507.
  43. Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering (2024).
  44. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922 (2023).
  45. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 8696–8708. https://doi.org/10.18653/v1/2021.emnlp-main.685
  46. Zhilin Wang and Qin Hu. 2021. Blockchain-based federated learning: A comprehensive survey. arXiv preprint arXiv:2110.02182 (2021).
  47. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063 (2023).
  48. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 1–10.
  49. SWE-agent: Agent Computer Interfaces Enable Software Engineering Language Models.
  50. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10, 2 (2019), 1–19.
  51. Federated Learning for Software Engineering: A Case Study of Code Clone Detection and Defect Prediction. IEEE Trans. Softw. Eng. 50, 2 (jan 2024), 296–321. https://doi.org/10.1109/TSE.2023.3347898
  52. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing (2024), 100211.
  53. Byzantine-robust distributed learning: Towards optimal statistical rates. In International Conference on Machine Learning. Pmlr, 5650–5659.
  54. A survey on federated learning. , 106775 pages.
  55. Federated learning on non-IID data: A survey. Neurocomputing 465 (2021), 371–390.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Zhihao Lin (16 papers)
  2. Wei Ma (106 papers)
  3. Tao Lin (167 papers)
  4. Yaowen Zheng (9 papers)
  5. Jingquan Ge (3 papers)
  6. Jun Wang (990 papers)
  7. Jacques Klein (89 papers)
  8. Yang Liu (2253 papers)
  9. Li Li (655 papers)
  10. Tegawende Bissyande (3 papers)
Citations (1)