Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Across Programming Language Silos: A Study on Cross-Lingual Retrieval-augmented Code Generation (2506.03535v1)

Published 4 Jun 2025 in cs.SE

Abstract: Current research on LLMs with retrieval-augmented code generation (RACG) mainly focuses on single-language settings, leaving cross-lingual effectiveness and security unexplored. Multi-lingual RACG systems are valuable for migrating code-bases across programming languages (PLs), yet face risks from error (e.g. adversarial data corruption) propagation in cross-lingual transfer. We construct a dataset spanning 13 PLs with nearly 14k instances to explore utility and robustness of multi-lingual RACG systems. Our investigation reveals four key insights: (1) Effectiveness: multi-lingual RACG significantly enhances multi-lingual code LLMs generation; (2) Inequality: Java demonstrate superior cross-lingual utility over Python in RACG; (3) Robustness: Adversarial attacks degrade performance significantly in mono-lingual RACG but show mitigated impacts in cross-lingual scenarios; Counterintuitively, perturbed code may improve RACG in cross-lingual scenarios; (4) Specialization: Domain-specific code retrievers outperform significantly general text retrievers. These findings establish foundation for developing effective and secure multi-lingual code assistants.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Qiming Zhu (7 papers)
  2. Jialun Cao (24 papers)
  3. Xuanang Chen (14 papers)
  4. Yaojie Lu (61 papers)
  5. Hongyu Lin (94 papers)
  6. Xianpei Han (103 papers)
  7. Le Sun (111 papers)
  8. Shing-Chi Cheung (54 papers)

Summary

We haven't generated a summary for this paper yet.