Bridge-Coder: Unlocking LLMs' Potential to Overcome Language Gaps in Low-Resource Code (2410.18957v1)
Abstract: LLMs demonstrate strong proficiency in generating code for high-resource programming languages (HRPLs) like Python but struggle significantly with low-resource programming languages (LRPLs) such as Racket or D. This performance gap deepens the digital divide, preventing developers using LRPLs from benefiting equally from LLM advancements and reinforcing disparities in innovation within underrepresented programming communities. While generating additional training data for LRPLs is promising, it faces two key challenges: manual annotation is labor-intensive and costly, and LLM-generated LRPL code is often of subpar quality. The underlying cause of this issue is the gap between natural language to programming language gap (NL-PL Gap), which is especially pronounced in LRPLs due to limited aligned data. In this work, we introduce a novel approach called Bridge-Coder, which leverages LLMs' intrinsic capabilities to enhance the performance on LRPLs. Our method consists of two key stages. Bridge Generation, where we create high-quality dataset by utilizing LLMs' general knowledge understanding, proficiency in HRPLs, and in-context learning abilities. Then, we apply the Bridged Alignment, which progressively improves the alignment between NL instructions and LRPLs. Experimental results across multiple LRPLs show that Bridge-Coder significantly enhances model performance, demonstrating the effectiveness and generalization of our approach. Furthermore, we offer a detailed analysis of the key components of our method, providing valuable insights for future work aimed at addressing the challenges associated with LRPLs.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021a.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021a.
- Ldb: A large language model debugger via verifying runtime execution step-by-step. arXiv preprint arXiv:2402.16906, 2024.
- Fuzz4all: Universal fuzzing with large language models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024.
- Using an llm to help with code understanding. 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), pages 1184–1196, 2023. URL https://api.semanticscholar.org/CorpusID:259937834.
- Microsoft. Github copilot – your ai pair programmer. GitHub repository, 2023. URL https://github.com/features/copilot.
- Services. A. w. ai code generator - amazon codewhisperer - aws. Amazon Page, 2023. URL https://aws.amazon.com/codewhisperer/.
- Multipl-e: A scalable and extensible approach to benchmarking neural code generation. arXiv preprint arXiv:2208.08227, 2022.
- Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. CoRR, abs/2303.17568, 2023. doi:10.48550/arXiv.2303.17568. URL https://doi.org/10.48550/arXiv.2303.17568.
- Measuring the impact of programming language distribution. In International Conference on Machine Learning, pages 26619–26645. PMLR, 2023.
- Llm augmented llms: Expanding capabilities through composition. arXiv preprint arXiv:2401.02412, 2024.
- LÂ Xue. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020.
- GÂ Lample. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291, 2019.
- Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks. In Conference on Empirical Methods in Natural Language Processing, 2019. URL https://api.semanticscholar.org/CorpusID:202541545.
- Unsupervised cross-lingual representation learning at scale. In Annual Meeting of the Association for Computational Linguistics, 2019. URL https://api.semanticscholar.org/CorpusID:207880568.
- Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. ArXiv, abs/2003.11080, 2020. URL https://api.semanticscholar.org/CorpusID:214641214.
- Improving low-resource languages in pre-trained multilingual language models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11993–12006, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.emnlp-main.822. URL https://aclanthology.org/2022.emnlp-main.822.
- Extending multilingual bert to low-resource languages. In Findings, 2020. URL https://api.semanticscholar.org/CorpusID:216562574.
- Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021b. URL https://arxiv.org/abs/2107.03374.
- Incoder: A generative model for code infilling and synthesis. CoRR, abs/2204.05999, 2022. doi:10.48550/arXiv.2204.05999. URL https://doi.org/10.48550/arXiv.2204.05999.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 8696–8708. Association for Computational Linguistics, 2021. doi:10.18653/v1/2021.emnlp-main.685. URL https://doi.org/10.18653/v1/2021.emnlp-main.685.
- Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024.
- Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120, 2023.
- Wizardcoder: Empowering code large language models with evol-instruct. International Conference on Learning Representations (ICLR), 2024.
- Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation. arXiv preprint arXiv:2312.14187, 2023.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023.
- Ircoder: Intermediate representations make language models robust multilingual code generators. arXiv preprint arXiv:2403.03894, 2024.
- Unicoder: Scaling code large language model via universal code. arXiv preprint arXiv:2406.16441, 2024.
- Knowledge transfer from high-resource to low-resource programming languages for code llms. Proceedings of the ACM on Programming Languages, 8(OOPSLA2):677–708, 2024.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b. doi:10.48550/ARXIV.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
- Saullm-7b: A pioneering large language model for law. arXiv preprint arXiv:2403.03883, 2024.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
- Lmflow: An extensible toolkit for finetuning and inference of large foundation models. arXiv preprint arXiv:2306.12420, 2023.
- Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
- Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Zerogen: Efficient zero-shot learning via dataset generation, 2022. URL https://arxiv.org/abs/2202.07922.
- Self-guided noise-free data generation for efficient zero-shot learning, 2023a. URL https://arxiv.org/abs/2205.12679.
- Theoremllama: Transforming general-purpose llms into lean4 experts, 2024. URL https://arxiv.org/abs/2407.03203.
- Image textualization: An automatic framework for creating accurate and detailed image descriptions, 2024a. URL https://arxiv.org/abs/2406.07502.
- G-llava: Solving geometric problem with multi-modal large language model, 2023b. URL https://arxiv.org/abs/2312.11370.
- Strengthening multimodal large language model with bootstrapped preference optimization, 2024b. URL https://arxiv.org/abs/2403.08730.
- Personalized visual instruction tuning. arXiv preprint arXiv:2410.07113, 2024c.
- Quality and trust in llm-generated code. arXiv preprint arXiv:2402.02047, 2024.
- First: Teach a reliable large language model through efficient trustworthy distillation. arXiv preprint arXiv:2408.12168, 2024.
- Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
- Program synthesis with large language models. CoRR, abs/2108.07732, 2021b. URL https://arxiv.org/abs/2108.07732.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.