M2G-Eval-Coder Models

Updated 3 January 2026

M2G-Eval-Coder Models are a family of language models that support fine-grained multilingual code generation across multiple structural levels.
They leverage multi-granularity pre-training and group-aware policy optimization using both supervised fine-tuning and reinforcement learning.
The models achieve state-of-the-art performance in code infilling and synthesis across 18 programming languages at Class, Function, Block, and Line levels.

M2G-Eval-Coder Models are a family of LLMs for code generation, designed to achieve fine-grained multilingual performance across multiple structural levels of software. Developed as part of the M2G-Eval framework, these models implement advanced @@@@1@@@@ and reinforcement learning methodologies on top of the Qwen3-8B transformer backbone. M2G-Eval-Coder introduces multi-granularity pre-training and group-aware policy optimization, enabling state-of-the-art results on code infilling and synthesis tasks in 18 programming languages at the Class, Function, Block, and Line levels. This comprehensive approach supports nuanced evaluation and diagnosis of code generation capabilities, particularly in complex and cross-linguistic scenarios (Xu et al., 27 Dec 2025).

1. Model Architecture and Training Paradigm

The M2G-Eval-Coder models are built on Qwen3-8B, an 8B-parameter, decoder-only transformer, pre-trained on an extensive corpus from The-Stack-v2 and web text. The architecture features:

Multi-head self-attention, GELU activations, and standardized layer normalization with causal masking.
A context window extended to 32,768 tokens, facilitating multi-file retrieval and long-span infilling without any architectural changes to the transformer core.
All parameters are subject to tuning during both the supervised and reinforcement learning phases.

Training is conducted in two stages: (i) supervised fine-tuning (SFT) on curated, granular code tasks and (ii) Group Relative Policy Optimization (GRPO) reinforcement learning that directly incorporates structural and linguistic groupings into the optimization procedure (Xu et al., 27 Dec 2025).

2. Multi-granularity and Multilingual Data Curation

M2G-Eval-Coder models rely on a dataset of approximately 17,000 curated code generation tasks, sampled from 150,000 GitHub repositories (pre-2024 for training, post-2024 and hash-disjoint for test/validation). Eighteen programming languages are covered, partitioned into:

Full-granularity languages (e.g., Python, Java, C#, C++, JS): tasks available at the Class, Function, Block, and Line granularity.
Partial-granularity languages (e.g., R, Verilog, HTML, Rust): primarily Block/Line level.

Each data instance $\tau=(\ell,g,P,M,y^*)$ consists of a language label $\ell$ , granularity $g$ , structured prompt $P$ (including in-file/cross-file context and LLM-generated descriptions), masked region $M$ , and ground truth $y^*$ . Natural language descriptions are generated by a strong teacher model (Qwen3-Coder-480B), and tasks are quality-filtered by length-normalized edit similarity:

$S = 1 - \frac{\operatorname{ED}(\hat{y}, y^*)}{\max(|\hat{y}|, |y^*|)}$

where ED is the edit distance between draft $\hat{y}$ and $y^*$ , retaining tasks with $0.1 \leq S \leq 0.45$ for training (Xu et al., 27 Dec 2025).

3. Fine-tuning Procedures: SFT and Group-Relative RL

Supervised Fine-Tuning (SFT)

SFT objectives minimize cross-entropy on the masked region across all training samples:

$L_{\mathrm{SFT}}(\theta) = -\sum_{i=1}^{N} \log p_\theta(y_i \mid x_i)$

Optimizer: AdamW, BF16 precision, grad-accum=2, global batch=16.
Five epochs over the entire dataset; monitored on a 1,286-example, contamination-controlled test set at regular intervals.
Total SFT compute: ~10 GPU-hours on 8×A100 (80GB).

Group Relative Policy Optimization (GRPO)

M2G-Eval-Coder introduces GRPO to target under-performing groups (by language or granularity), dynamically adjusting advantage signals during RL. The optimization target is:

$\max_\theta \mathbb{E}_{\tau \sim \pi_\theta} \left[\sum_g w_g \sum_t A^{(g)}_t \nabla_\theta \log \pi_\theta(a_t|s_t) \right]$

where $A^{(g)}$ is the group-level advantage (normalized by a moving baseline), $w_g$ is the group weight, and reward is the normalized edit similarity $S$ . KL-regularization ( $\lambda = 0.001$ ) constrains divergence from the SFT policy. GRPO is executed for ~300 gradient steps over 5,000 high-quality tasks, for 15 epochs (~90 GPU-hours), yielding the M2G-Eval-Coder "M_RL" variant (Xu et al., 27 Dec 2025).

4. Evaluation Framework and Metrics

Evaluation is performed on a 1,286-sample human-curated, post-2024, hash-disjoint test set, with 10 experts performing language-level reviews. All tasks are evaluated at the following granularity levels:

Class: Complete class infilling.
Function: Function body synthesis.
Block: Block-level completion inside a function or method.
Line: Single-line or small-span infilling.

The primary metric is normalized edit similarity:

$S = 1 - \frac{\operatorname{ED}(\hat{y}, y^*)}{\max(|\hat{y}|, |y^*|)}, \qquad S \in [0,1]$

No pass@k or exact match metrics are used, as $S$ provides a continuous, structure-agnostic measure of output agreement (Xu et al., 27 Dec 2025).

5. Empirical Results and Comparative Analysis

Empirical evaluation across 30 models (28 baselines, SFT, and RL-tuned M2G-Eval-Coder) demonstrates:

Difficulty hierarchy: Mean $S(\mathrm{Line}) > S(\mathrm{Block}) \approx S(\mathrm{Function}) > S(\mathrm{Class})$ across all models. Class-level tasks are consistently the most challenging; few models exceed $S=0.3$ for Class infilling.
Performance increments: M2G-Eval-Coder "M_RL" achieves $\sim 32.2\%$ $S$ across 18 languages, compared to $28.4\%$ for "M_{SFT}" and $\sim 26.1\%$ for the Qwen3-8B base.
Full vs. partial-granularity: Full-granularity languages benefit more from M2G-Eval-Coder training, especially as the structural complexity of the task increases. This gap is smallest at the Line level and largest for Class-level completion.
Cross-language transfer: Pearson correlation coefficients across languages exceed 0.7, indicating that improvements are not isolated to syntactic surface features but reflect deeper code reasoning. Paradigmatic clustering is observed (e.g., OOP languages cluster tightly; DSLs like Verilog/HTML display outlier patterns).
Stability: Violin-boxplots of per-task $S$ show that M2G-Eval-Coder "M_RL" produces robust, low-variance improvements with few language or granularity outliers (Xu et al., 27 Dec 2025).

6. Theoretical and Practical Implications

The systematic ascent from Line to Class reveals that current LLMs, even with advanced fine-tuning and RL techniques, still struggle to synthesize complex software artifacts holistically—especially when required to integrate multi-file, cross-context information. Conversely, high cross-language correlation suggests a strong inductive bias toward generic programming concepts. A plausible implication is that further improvements may require not just more data or scale, but more advanced retrieval, reasoning, or explicit tool-use mechanisms.

7. Limitations and Prospective Directions

M2G-Eval-Coder inherits the following limitations:

Task representativeness: While 18 languages and 1,286 evaluation samples mark a significant advance, this coverage cannot encapsulate the full heterogeneity of modern software projects.
Metric constraints: The continuous edit similarity $S$ captures syntactic and shallow semantic agreement but may underweight deep correctness, efficiency, or security.
Generalization: Strict contamination control is enforced, yet the proximity between train and test distributions remains constrained to post-2024 codebases. There is no explicit evaluation on noisy, partially specified, or truly adversarial prompts.
Compute: The dual-stage tuning pipeline requires significant computational resources (total $\sim$ 100 GPU-hours at 8×A100), possibly limiting replication in resource-constrained settings.

Future research directions suggested in the original study include integration with emerging evaluation standards (e.g., for multi-modal or multi-stage code generation), task expansion to include security auditing or vulnerability detection, and adoption of code-coverage/static analysis in the scoring pipeline. Introducing partially specified or ambiguous prompts is targeted to better simulate real-world development contexts (Xu et al., 27 Dec 2025).

For comprehensive technical details, see "M2G-Eval: Enhancing and Evaluating Multi-granularity Multilingual Code Generation" (Xu et al., 27 Dec 2025).

Markdown Upgrade to Chat

References (1)

M2G-Eval: Enhancing and Evaluating Multi-granularity Multilingual Code Generation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to M2G-Eval-Coder Models.