Multi-Agent Collaboration for Multilingual Code Instruction Tuning (2502.07487v1)

Published 11 Feb 2025 in cs.CL

Abstract: Recent advancement in code understanding and generation demonstrates that code LLMs fine-tuned on a high-quality instruction dataset can gain powerful capabilities to address wide-ranging code-related tasks. However, most previous existing methods mainly view each programming language in isolation and ignore the knowledge transfer among different programming languages. To bridge the gap among different programming languages, we introduce a novel multi-agent collaboration framework to enhance multilingual instruction tuning for code LLMs, where multiple language-specific intelligent agent components with generation memory work together to transfer knowledge from one language to another efficiently and effectively. Specifically, we first generate the language-specific instruction data from the code snippets and then provide the generated data as the seed data for language-specific agents. Multiple language-specific agents discuss and collaborate to formulate a new instruction and its corresponding solution (A new programming language or existing programming language), To further encourage the cross-lingual transfer, each agent stores its generation history as memory and then summarizes its merits and faults. Finally, the high-quality multilingual instruction data is used to encourage knowledge transfer among different programming languages to train Qwen2.5-xCoder. Experimental results on multilingual programming benchmarks demonstrate the superior performance of Qwen2.5-xCoder in sharing common knowledge, highlighting its potential to reduce the cross-lingual gap.

Summary

The paper introduces a multi-agent collaboration framework where language-specific agents work together to create a high-quality multilingual instruction dataset for code LLMs.
Agents utilize generation memory and self-critique mechanisms to improve data quality and explicitly bridge cross-lingual gaps during the data generation process.
Fine-tuning models like Qwen2.5-xCoder on this collaboratively generated dataset enhances their capability to transfer knowledge between different programming languages, boosting performance on multilingual benchmarks.

The paper "Multi-Agent Collaboration for Multilingual Code Instruction Tuning" (2502.07487) introduces a framework designed to enhance cross-lingual knowledge transfer during the instruction tuning phase of LLMs for code-related tasks. The core issue addressed is the prevalent practice of training or fine-tuning code LLMs in isolation for each programming language, thereby neglecting potential synergies and shared semantic structures across languages.

Multi-Agent Collaboration Framework

The proposed methodology centers around a multi-agent system where each agent specializes in a specific programming language. These language-specific agents collaborate to generate a diverse and high-quality multilingual instruction dataset. The framework aims to bridge the gap between different programming languages by facilitating knowledge transfer during the data generation process itself, prior to the final model fine-tuning stage.

Key components of the framework include:

Language-Specific Agents: Intelligent components, each possessing expertise in a particular programming language (e.g., Python agent, Java agent, C++ agent).
Seed Data Generation: Initial instruction data is generated for each language independently, typically derived from existing code snippets. This serves as the starting point for the collaborative process.
Collaborative Instruction Generation: The core of the framework involves agents "discussing" and collaborating. An agent can propose an instruction relevant to its language. Other agents then attempt to formulate equivalent instructions and solutions in their respective languages. This process explicitly encourages the generation of parallel or related instructions across multiple languages.
Generation Memory: Each agent maintains a memory of its generated instructions and solutions, along with self-critiques summarizing the merits and faults of its contributions. This memory serves two purposes: (a) it prevents redundant generation and (b) it allows agents to refine their generation strategies based on past successes and failures, potentially improving the quality of subsequent instructions.
Data Filtering and Selection: The collaboratively generated multilingual instruction data undergoes a quality control process before being used for fine-tuning. The paper implies that high-quality examples exhibiting successful cross-lingual mapping are prioritized.

Data Generation and Knowledge Transfer Mechanism

The process begins by generating initial language-specific instruction-following data. This can be achieved using techniques similar to self-instruct, where code snippets are used as context to prompt an LLM to generate relevant instructions and corresponding code solutions within that specific language.

function generate_seed_data(language, code_snippets):
    seed_instructions = []
    for snippet in code_snippets:
        # Prompt an LLM to generate instruction & solution based on snippet
        instruction, solution = generate_instruction_from_snippet(language, snippet)
        seed_instructions.append((instruction, solution, language))
    return seed_instructions

python_seeds = generate_seed_data("Python", python_code_corpus)
java_seeds = generate_seed_data("Java", java_code_corpus)

Once seed data is available, the collaborative phase commences. An agent, say the Python agent, might select or generate an instruction like "Write a Python function to calculate the factorial of a number." It shares this instruction (and potentially its solution) with other agents. The Java agent, for instance, would then attempt to create a corresponding instruction and Java code solution: "Write a Java method to calculate the factorial of a number." This might involve translation of the instruction text and adaptation of the code logic to Java syntax and paradigms.

function collaborative_generation(agents, seed_data):
    collaborative_dataset = []
    # Initialize agents with seed data and memory
    for agent in agents:
        agent.initialize_memory(seed_data[agent.language])

    for iteration in range(NUM_COLLABORATION_STEPS):
        proposing_agent = select_agent(agents)
        instruction, solution = proposing_agent.propose_instruction()

        generated_pairs = [(instruction, solution, proposing_agent.language)]

        for responding_agent in agents:
            if responding_agent != proposing_agent:
                # Attempt to generate equivalent instruction/solution in target language
                cross_lingual_instruction, cross_lingual_solution = responding_agent.respond(instruction, solution, proposing_agent.language)

                # Self-critique and update memory
                critique = responding_agent.critique_generation(cross_lingual_instruction, cross_lingual_solution)
                responding_agent.update_memory(cross_lingual_instruction, cross_lingual_solution, critique)

                if quality_check(critique): # Check if the generation is deemed high-quality
                    generated_pairs.append((cross_lingual_instruction, cross_lingual_solution, responding_agent.language))

        # Add the set of related instructions/solutions to the dataset
        if len(generated_pairs) > 1: # Ensure cross-lingual transfer occurred
             collaborative_dataset.append(generated_pairs)

    return collaborative_dataset

The generation memory allows agents to learn from this process. If the Java agent successfully translates a Python concept, it reinforces that pattern. If it fails or produces incorrect code, the critique mechanism flags this, discouraging similar errors in the future. The summarized merits and faults help refine the agent's internal strategies for cross-lingual mapping and code generation.

Model Fine-tuning and Evaluation

The final multilingual instruction dataset, enriched through multi-agent collaboration, is used to fine-tune a base code LLM. In this paper, Qwen2.5-xCoder was the model fine-tuned using this approach. The fine-tuning objective remains standard instruction following, but the dataset now contains explicit cross-lingual links and parallel examples, hypothesized to force the model to learn shared representations and transfer mechanisms between languages.

$\mathcal{L}_{FT} = - \sum_{(I_1, S_1, L_1), ..., (I_k, S_k, L_k) \in \mathcal{D}_{collab}} \sum_{j=1}^{k} \log P(S_j | I_j, \text{Model})$

where $\mathcal{D}_{collab}$ is the collaboratively generated dataset containing tuples of instruction ( $I$ ), solution ( $S$ ), and language ( $L$ ).

The effectiveness of this method was evaluated on multilingual programming benchmarks. The paper reports that Qwen2.5-xCoder, fine-tuned using the multi-agent collaborative data, demonstrated superior performance compared to baseline models trained without this cross-lingual collaborative data generation phase. The results suggest improved capabilities in sharing common programming knowledge across different languages, effectively reducing the cross-lingual gap in code understanding and generation tasks. Specific metrics or benchmark details are not fully elaborated here but are present in the original paper's experimental section. The claim is that the collaborative data generation explicitly fosters the learning of transferable skills.

Conclusion

The multi-agent collaboration framework presents a novel approach to data generation for multilingual code instruction tuning. By simulating collaboration between language-specific agents and leveraging generation memory with self-critique, the method produces a dataset enriched with cross-lingual examples. Fine-tuning on this dataset appears to enhance the underlying code LLM's ability to transfer knowledge between programming languages, leading to improved performance on multilingual benchmarks as demonstrated with Qwen2.5-xCoder. This technique offers a pathway to build more versatile code LLMs that better exploit the inherent similarities across programming languages.