FineInstructions: Code Completion Paradigm
- FineInstructions is an instruction-aware fill-in-the-middle paradigm that integrates explicit developer instructions with prefix and suffix code context to improve infill accuracy.
- The approach employs a dedicated instruction token to clearly delimit intent, ensuring models respond accurately to natural language guidance.
- Empirical results demonstrate that IFIM achieves an 8–10 percentage-point improvement on benchmarks like IHumanEval and IRME while maintaining robust baseline performance.
Instruction-Aware Fill-in-the-Middle (IFIM) Paradigm for Code Completion
Instruction-aware Fill-in-the-Middle (IFIM) is a code completion paradigm designed to bridge the gap between a developer’s natural language instructions and effective “fill-in-the-middle” (FIM) modeling in LLMs. Traditional FIM approaches leverage code context—prefix and suffix—to predict missing segments but often fail to integrate explicit developer intent, especially when code context is ambiguous. IFIM addresses this shortfall by structurally incorporating instruction spans, allowing models to significantly improve their responsiveness to developer guidance while maintaining robust baseline infill performance in the absence of instructions (Sun et al., 29 Sep 2025).
1. Formal Objective and Model Structure
1.1 Conventional FIM
The standard FIM training objective splits each code completion instance into three contiguous token spans:
- : Prefix (tokens before the edit point)
- : Middle (the region to be predicted)
- : Suffix (tokens after the edit)
The model is trained to maximize the conditional likelihood
where is the code infilling dataset.
1.2 IFIM Extension
IFIM extends this to quadruplets by introducing an explicit instruction span : Combining IFIM and FIM samples in training yields: A dedicated instruction token (INS) delimits 0 to ensure architectural consistency and avoid head reinitialization.
2. Data Generation Pipeline
2.1 Sourcing and Preparation
- Middle span 1: Sampled as 1–3 contiguous code lines from open-source repositories.
- Surrounding context: Remaining lines become 2 (prefix) and 3 (suffix).
2.2 Automated Instruction Synthesis
- GPT-4o is prompted by marking 4 with 5explain6…7/explain8 and tasked to generate a single, concise, intent-focused sentence (average ≈10 tokens) describing the purpose and intent of 9.
- Only one-sentence instructions are retained. Overlap with standard public code completion datasets (HumanEval, MBPP) is removed.
Final dataset scale: 122,900 samples, 70% Python, average instruction ≈10 tokens.
3. Model Architecture, Input Ordering, and Training
- Base LLMs: Deepseek-Coder (6.7B; default FIM mode PMS) and Qwen2.5-Coder (7B; default FIM mode PSM).
- The 0INS1 token denotes the instruction boundary, leveraging low-frequency vocabulary entries to avoid architectural disruption.
- Empirically, input orderings with “I-before-M” maximize performance:
- Deepseek: 2PRE3 4 5INS6 7 8MID9 0 1SUF2 (“PIMS”)
- Qwen2.5: 3PRE4 5 6SUF7 8 9INS0 1 2MID3 (“PSIM”)
Hyperparameters
- Framework: Huggingface Transformers + PyTorch, 2×A6000 GPUs.
- Optimizer: Adafactor, learning rate 4, 15 warmup steps, linear decay.
- Deepseek-Coder: batch 5 (token accumulation), context 1216 tokens.
- Qwen2.5-Coder: batch 6, context 1216 tokens.
- Training: 2 epochs (7100,000 steps). 8 (100% IFIM) optimal for instruction following; 25–75% mixtures explored for ablation.
4. Evaluation Protocol, Benchmarks, and Results
4.1 Benchmarks & Metrics
- IHumanEval: 312 Python infilling problems with docstrings removed.
- IRepoMasterEval (IRME): 256 file-level code infill tasks, context truncated to 20 lines both sides.
- Primary metric: Pass@1 (proportion of completions passing all unit tests).
4.2 Empirical Results
| Model | Setting | IHumanEval | IRME |
|---|---|---|---|
| Deepseek-base | w/ ins. | 84.6% | 10.9% |
| Deepseek-IFIM | w/ ins. | 93.6% | 21.1% |
| Deepseek-base | w/o ins. | 68.6% | 7.4% |
| Deepseek-IFIM | w/o ins. | 78.2% | 16.0% |
| Qwen2.5-base | w/ ins. | 91.0% | 18.4% |
| Qwen2.5-IFIM | w/ ins. | 95.8% | 20.3% |
| Qwen2.5-base | w/o ins. | 76.0% | 10.2% |
| Qwen2.5-IFIM | w/o ins. | 76.3% | 13.3% |
- IFIM yields 9 to 0 percentage-point improvements on IHumanEval and 1 to 2 on IRME when instructions are provided.
- IFIM additionally raises performance in no-instruction settings (e.g., Deepseek: 3 on IHumanEval, 4 on IRME).
- “I-before-M” ordering outperforms alternatives by 3–5 points.
- Mixing in standard FIM samples (5 between 0.25–0.75) can better preserve baseline infilling when no instructions are given, but pure IFIM (6) is optimal for instruction following.
CFIM ablation: Inlining instructions as comments (CFIM) severely degrades performance (e.g., Deepseek: 4.3% on IRME), underscoring the necessity of an explicit 7INS8-delimited instruction span.
5. Design Implications, Limitations, and Best Practices
- Effective IFIM-derived “FineInstructions” should be a single, clear sentence (5–15 tokens), describing what to achieve in the missing code (not how).
- Use an explicit delimiter (e.g., 9INS0 or IDE-friendly
#! ...) for instructions, enabling seamless post-completion removal. - The IFIM dataset as built is Python-focused; testing cross-language robustness and scaling to larger model sizes (30B+) are essential next steps.
- Harvesting “wild” data (e.g., inline developer comments, logs) is promising but requires robust filtering and intent extraction.
6. Impact and Prospects
IFIM provides a backward-compatible, instruction-aware extension of standard FIM pretraining for code LLMs (Sun et al., 29 Sep 2025). It delivers substantial (>8 percentage-point) gains in following finely specified developer intent, while preserving or improving the model’s performance in vanilla infilling scenarios lacking explicit instructions. The approach reconciles the historical trade-off—imposed by standard instruction tuning—between infilling competence and instruction adherence in code completion systems, establishing a new state of the art on both synthetic and real-world programming benchmarks.