FineInstructions: Code Completion Paradigm

Updated 3 July 2026

FineInstructions is an instruction-aware fill-in-the-middle paradigm that integrates explicit developer instructions with prefix and suffix code context to improve infill accuracy.
The approach employs a dedicated instruction token to clearly delimit intent, ensuring models respond accurately to natural language guidance.
Empirical results demonstrate that IFIM achieves an 8–10 percentage-point improvement on benchmarks like IHumanEval and IRME while maintaining robust baseline performance.

Instruction-Aware Fill-in-the-Middle (IFIM) Paradigm for Code Completion

Instruction-aware Fill-in-the-Middle (IFIM) is a code completion paradigm designed to bridge the gap between a developer’s natural language instructions and effective “fill-in-the-middle” (FIM) modeling in LLMs. Traditional FIM approaches leverage code context—prefix and suffix—to predict missing segments but often fail to integrate explicit developer intent, especially when code context is ambiguous. IFIM addresses this shortfall by structurally incorporating instruction spans, allowing models to significantly improve their responsiveness to developer guidance while maintaining robust baseline infill performance in the absence of instructions (Sun et al., 29 Sep 2025).

1. Formal Objective and Model Structure

1.1 Conventional FIM

The standard FIM training objective splits each code completion instance into three contiguous token spans:

$P$ : Prefix (tokens before the edit point)
$M$ : Middle (the region to be predicted)
$S$ : Suffix (tokens after the edit)

The model is trained to maximize the conditional likelihood

$\mathcal{L}_\mathrm{FIM} = -\mathbb{E}_{(P, M, S) \sim \mathcal{D}} [\log p_\theta(M \mid P, S)]$

where $\mathcal{D}$ is the code infilling dataset.

1.2 IFIM Extension

IFIM extends this to quadruplets by introducing an explicit instruction span $I$ : $\mathcal{L}_\mathrm{IFIM} = -\mathbb{E}_{(P, I, S, M) \sim \mathcal{D}} [\log p_\theta(M \mid P, I, S)]$ Combining IFIM and FIM samples in training yields: $\mathcal{L} = \alpha\,\mathcal{L}_{\rm IFIM} + (1-\alpha)\,\mathcal{L}_{\rm FIM},\quad \alpha \in [0,1]$ A dedicated instruction token ( $<$ INS $>$ ) delimits $M$ 0 to ensure architectural consistency and avoid head reinitialization.

2. Data Generation Pipeline

2.1 Sourcing and Preparation

Middle span $M$ 1: Sampled as 1–3 contiguous code lines from open-source repositories.
Surrounding context: Remaining lines become $M$ 2 (prefix) and $M$ 3 (suffix).

2.2 Automated Instruction Synthesis

GPT-4o is prompted by marking $M$ 4 with $M$ 5explain $M$ 6… $M$ 7/explain $M$ 8 and tasked to generate a single, concise, intent-focused sentence (average ≈10 tokens) describing the purpose and intent of $M$ 9.
Only one-sentence instructions are retained. Overlap with standard public code completion datasets (HumanEval, MBPP) is removed.

Final dataset scale: 122,900 samples, 70% Python, average instruction ≈10 tokens.

3. Model Architecture, Input Ordering, and Training

Base LLMs: Deepseek-Coder (6.7B; default FIM mode PMS) and Qwen2.5-Coder (7B; default FIM mode PSM).
The $S$ 0INS $S$ 1 token denotes the instruction boundary, leveraging low-frequency vocabulary entries to avoid architectural disruption.
Empirically, input orderings with “I-before-M” maximize performance:
- Deepseek: $S$ 2PRE $S$ 3 $S$ 4 $S$ 5INS $S$ 6 $S$ 7 $S$ 8MID $S$ 9 $\mathcal{L}_\mathrm{FIM} = -\mathbb{E}_{(P, M, S) \sim \mathcal{D}} [\log p_\theta(M \mid P, S)]$ 0 $\mathcal{L}_\mathrm{FIM} = -\mathbb{E}_{(P, M, S) \sim \mathcal{D}} [\log p_\theta(M \mid P, S)]$ 1SUF $\mathcal{L}_\mathrm{FIM} = -\mathbb{E}_{(P, M, S) \sim \mathcal{D}} [\log p_\theta(M \mid P, S)]$ 2 (“PIMS”)
- Qwen2.5: $\mathcal{L}_\mathrm{FIM} = -\mathbb{E}_{(P, M, S) \sim \mathcal{D}} [\log p_\theta(M \mid P, S)]$ 3PRE $\mathcal{L}_\mathrm{FIM} = -\mathbb{E}_{(P, M, S) \sim \mathcal{D}} [\log p_\theta(M \mid P, S)]$ 4 $\mathcal{L}_\mathrm{FIM} = -\mathbb{E}_{(P, M, S) \sim \mathcal{D}} [\log p_\theta(M \mid P, S)]$ 5 $\mathcal{L}_\mathrm{FIM} = -\mathbb{E}_{(P, M, S) \sim \mathcal{D}} [\log p_\theta(M \mid P, S)]$ 6SUF $\mathcal{L}_\mathrm{FIM} = -\mathbb{E}_{(P, M, S) \sim \mathcal{D}} [\log p_\theta(M \mid P, S)]$ 7 $\mathcal{L}_\mathrm{FIM} = -\mathbb{E}_{(P, M, S) \sim \mathcal{D}} [\log p_\theta(M \mid P, S)]$ 8 $\mathcal{L}_\mathrm{FIM} = -\mathbb{E}_{(P, M, S) \sim \mathcal{D}} [\log p_\theta(M \mid P, S)]$ 9INS $\mathcal{D}$ 0 $\mathcal{D}$ 1 $\mathcal{D}$ 2MID $\mathcal{D}$ 3 (“PSIM”)

Hyperparameters

Framework: Huggingface Transformers + PyTorch, 2×A6000 GPUs.
Optimizer: Adafactor, learning rate $\mathcal{D}$ 4, 15 warmup steps, linear decay.
Deepseek-Coder: batch $\mathcal{D}$ 5 (token accumulation), context 1216 tokens.
Qwen2.5-Coder: batch $\mathcal{D}$ 6, context 1216 tokens.
Training: 2 epochs ( $\mathcal{D}$ 7100,000 steps). $\mathcal{D}$ 8 (100% IFIM) optimal for instruction following; 25–75% mixtures explored for ablation.

4. Evaluation Protocol, Benchmarks, and Results

4.1 Benchmarks & Metrics

IHumanEval: 312 Python infilling problems with docstrings removed.
IRepoMasterEval (IRME): 256 file-level code infill tasks, context truncated to 20 lines both sides.
Primary metric: Pass@1 (proportion of completions passing all unit tests).

4.2 Empirical Results

Model	Setting	IHumanEval	IRME
Deepseek-base	w/ ins.	84.6%	10.9%
Deepseek-IFIM	w/ ins.	93.6%	21.1%
Deepseek-base	w/o ins.	68.6%	7.4%
Deepseek-IFIM	w/o ins.	78.2%	16.0%
Qwen2.5-base	w/ ins.	91.0%	18.4%
Qwen2.5-IFIM	w/ ins.	95.8%	20.3%
Qwen2.5-base	w/o ins.	76.0%	10.2%
Qwen2.5-IFIM	w/o ins.	76.3%	13.3%

IFIM yields $\mathcal{D}$ 9 to $I$ 0 percentage-point improvements on IHumanEval and $I$ 1 to $I$ 2 on IRME when instructions are provided.
IFIM additionally raises performance in no-instruction settings (e.g., Deepseek: $I$ 3 on IHumanEval, $I$ 4 on IRME).
“I-before-M” ordering outperforms alternatives by 3–5 points.
Mixing in standard FIM samples ( $I$ 5 between 0.25–0.75) can better preserve baseline infilling when no instructions are given, but pure IFIM ( $I$ 6) is optimal for instruction following.

CFIM ablation: Inlining instructions as comments (CFIM) severely degrades performance (e.g., Deepseek: 4.3% on IRME), underscoring the necessity of an explicit $I$ 7INS $I$ 8-delimited instruction span.

5. Design Implications, Limitations, and Best Practices

Effective IFIM-derived “FineInstructions” should be a single, clear sentence (5–15 tokens), describing what to achieve in the missing code (not how).
Use an explicit delimiter (e.g., $I$ 9INS $\mathcal{L}_\mathrm{IFIM} = -\mathbb{E}_{(P, I, S, M) \sim \mathcal{D}} [\log p_\theta(M \mid P, I, S)]$ 0 or IDE-friendly #! ...) for instructions, enabling seamless post-completion removal.
The IFIM dataset as built is Python-focused; testing cross-language robustness and scaling to larger model sizes (30B+) are essential next steps.
Harvesting “wild” data (e.g., inline developer comments, logs) is promising but requires robust filtering and intent extraction.

6. Impact and Prospects

IFIM provides a backward-compatible, instruction-aware extension of standard FIM pretraining for code LLMs (Sun et al., 29 Sep 2025). It delivers substantial (>8 percentage-point) gains in following finely specified developer intent, while preserving or improving the model’s performance in vanilla infilling scenarios lacking explicit instructions. The approach reconciles the historical trade-off—imposed by standard instruction tuning—between infilling competence and instruction adherence in code completion systems, establishing a new state of the art on both synthetic and real-world programming benchmarks.

Markdown Report Issue Upgrade to Chat

References (1)

Bridging Developer Instructions and Code Completion Through Instruction-Aware Fill-in-the-Middle Paradigm (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FineInstructions.