Fill-in-the-Middle Code Completion

Updated 23 August 2025

Fill-in-the-Middle (FIM) code completion is a technique that generates code infills between a given prefix and suffix to produce syntactically and semantically valid programs.
It leverages advanced strategies such as AST-aware masking, structured pretraining, and cross-file context integration to enhance code completion accuracy.
Recent innovations, including dual-context conditioning, reinforcement learning, and human-in-loop decoding, have significantly improved FIM model performance in real-world settings.

Fill-in-the-Middle (FIM) code completion is a paradigm in automated program synthesis and development assistance whereby a system generates code that must fit between a given prefix and suffix—i.e., it fills in the “middle” of an incomplete code sequence. FIM models must reconcile both preceding and succeeding code context, a challenge that distinguishes this setting from standard left-to-right completion and fundamentally shapes the required methodologies, evaluation metrics, and empirical outcomes. In recent years, a wide range of technical innovations—including advances in model objectives, syntactic and semantic constraint integration, contextual retrieval, and real-world evaluation—have led to substantial improvements in FIM code completion, with contemporary research focusing on both architectural and dataset-level enhancements.

1. Foundational Concepts and Problem Setting

FIM code completion tasks ask a model to generate code that appears between a given prefix $p$ and suffix $s$ , formally producing a middle segment $x$ such that the completed file $[p; x; s]$ is syntactically and semantically valid. The conditional modeling objective is: $P(x \mid p, s)$ The paradigm’s complexity stems from the bidirectional conditioning: inserted code must interact coherently with definitions, data flow, and control flow in both directions, unlike left-to-right next-word prediction.

The FIM setting generalizes many practical developer workflows—such as editing or extending code in complex files, inserting new blocks, or refactoring inner function bodies—requiring models to capture global program context.

2. Training Objectives and Pretraining Strategies

2.1 Standard and AST-Aware FIM Objectives

The canonical FIM objective, heavily utilized in systems such as DeepSeek-Coder, Stable Code, and aiXcoder-7B, splits code into prefix, middle, and suffix, and then reformulates input/output sequences for autoregressive training. The “Prefix-Suffix-Middle” (PSM) and “Suffix-Prefix-Middle” (SPM) formats are frequently used, with special tokens demarcating boundaries (e.g., <fim_start>, <fim_hole>, <fim_end>): $\text{Sample} = \langle \text{fim\_start} \rangle p \langle \text{fim\_hole} \rangle s \langle \text{fim\_end} \rangle x$ Empirical results indicate a 50% PSM/SPM split provides strong infilling and left-to-right generation (Guo et al., 2024, Pinnaparaju et al., 2024).

Recent advances propose structure-aware masking using Abstract Syntax Trees (ASTs). AST-FIM (Gong et al., 30 May 2025) samples the “middle” as an entire subtree (such as an if block or function definition), strictly preserving authentic boundaries:

Single-node masking: randomly select an AST node (proportional to size), mask it.
Aligned span masking: determine a random character span, and map to the smallest AST subtree covering it (or a maximal set of contiguous child nodes).

This approach increases alignment with real code editing, as validated on the Real-FIM-Eval benchmark (GitHub commits) where AST-FIM models yield up to 5-point improvements over random-FIM (Gong et al., 30 May 2025).

2.2 Structured and Curriculum-Based Pretraining

Structured Fill-In-the-Middle (SFIM) (Jiang et al., 2024) leverages ASTs to mask semantically significant code fragments, randomly selecting a non-leaf node in a function’s tree to create the completion span. The loss function spans both PSM and SPM variants: $\operatorname{loss}_{SFIM} = -\log p([\text{prefix};\text{suffix};\text{middle}]) - \log p([\text{suffix};\text{prefix};\text{middle}])$ Curriculum-based strategies (Sagtani et al., 2024, Yu et al., 21 Aug 2025) improve small-to-mid-scale model efficacy by emphasizing hard-to-complete, high-complexity spans (measured via AST node metrics or code complexity heuristics), sampled preferentially during fine-tuning.

3. Integrating Context and Repository-Level Knowledge

3.1 Cross-File and Graph-Based Augmentation

FIM completion in large repositories often fails if models only observe in-file context. CoCoMIC (Ding et al., 2022) introduces a project-level program context graph, combining in-file tokens and retrieved cross-file entities using a joint attention mechanism. The CCFINDER tool statically analyzes a project, extracting functions, classes, and globals, and links them by import/member/function/variable edges; targeted retrieval is performed via depth-limited first search over this graph.

GraphCoder (Liu et al., 2024) refines this with a Code Context Graph (CCG) encoding control-flow, data, and control dependencies:

Coarse filtering uses bag-of-words measures over context slices.
Fine-grained reranking computes a decay-weighted subgraph edit distance, optimally aligning dependencies to the query context, prioritizing edits closest to the insertion point.

Both methods systematically outperform token-sequence or windowed retrieval, yielding +6% EM or identifier match improvements on repository-scale FIM tasks.

3.2 Telemetry and Metadata Integration

Transformer-based invocation filtering (Moor et al., 2024) demonstrates that combining code context with IDE-derived telemetry (e.g., document length, cursor position, time since last suggestion) in classifier heads or attention layers allows for smart, latency-optimized FIM triggering, outperforming heuristic or logistic regression baselines in real-time settings.

4. Syntax and Output Constraints

4.1 Syntactic Validity via Quotient Parsing

Constrained generation ensures that infills are always syntactically valid in the presence of complex right context. The method of left and right quotienting (Melcer et al., 2024) extends Earley parsing for context-free and even some context-sensitive grammars: $L/R = \{ l \in \Sigma^* \mid \exists r \in R: l \circ r \in L \}$ By computing the quotient language and maintaining incremental lexer/parser state, the generation process can reject any prefix that cannot, when joined with the given suffix, be extended to a full program. Empirical results show constrained FIM decoding boosts syntactic validity (to ≈89.5%) and reduces error over unconstrained autocompletion (65%).

4.2 Subtoken Alignment and Byte-Level Decoding

Character- and token-level FIM can induce “sub-token” prediction issues when splits do not align to token boundaries, resulting in high perplexity and errors. FIM-SE (Ren et al., 2024) addresses this by enforcing line-level alignment and inserting explicit start/end markers, yielding up to +11% improvement on single- and multi-line infilling tasks.

Byte-level decoding (Phan et al., 2024) further solves tokenization bias by interpreting LM output in the byte space rather than token space. Using the Byte-Token Representation Lemma: $P(x_{1}^n) = \sum_{t \in \text{cover}(x_{1}^n)} P(t)$ the next-byte probability is accurately computed, eliminating errors when the prompt ends mid-token and recovering up to 18% performance in standard FIM benchmarks.

5. Evaluation and Real-World Benchmarks

Modern FIM evaluation now occurs on large, domain-representative suites such as SAFIM (Gong et al., 2024), Real-FIM-Eval (Gong et al., 30 May 2025), and aiXcoder/FIM-Eval (Jiang et al., 2024):

SAFIM leverages ASTs to mask meaningful code—blocks, expressions, API calls—for multilingual assessment (Python, Java, C++, C#). Evaluation is via pass@1 (unit tests or syntax matching).
Real-FIM-Eval is derived from 30,000+ real GitHub commits/excerpts across 12 languages.
aiXcoder/FIM-Eval, CrossCodeEval, coLT, and ExecRepoBench capture file-level and cross-file repository settings.

Rigorous metrics—pass@1, exact match (EM), edit similarity (ES), codebleu, prefix match (PM), and latency (L)—ensure multidimensional benchmarking. Empirical studies highlight:

FIM-specific pretraining confers strong generalization, often closing the gap with larger, non-FIM-tuned LLMs; pretraining/data quality can matter more than raw parameter count (Gong et al., 2024, Jiang et al., 2024).
Syntax-aware post-processing (AST-based truncation) increases robust evaluation on real code, especially for models lacking end-of-sequence awareness.
Post-processing of output remains necessary for instruction-tuned models on random-span infilling, but can be omitted for complete-line FIM when models have been sufficiently supervised (Ahmad et al., 24 May 2025).

6. Model Innovations and Comparative Results

DeepSeek-Coder (Guo et al., 2024), Stable Code (Pinnaparaju et al., 2024), aiXcoder-7B (Jiang et al., 2024), and SynthCoder (Yu et al., 21 Aug 2025) implement dual-context FIM objectives (PSM/SPM), structure-aware masking, and large-scale multi-objective training, achieving competitive or state-of-the-art results across FIM benchmarks with parameter counts as low as 3–7B.
Horizon-Length Prediction (HLP) (Ding et al., 2024) augments FIM with lookahead: the model predicts the normalized number of remaining middle tokens as a planning signal. This improves open-domain FIM performance by up to 24%, also improving code reasoning benchmarks, with negligible training and zero inference cost.
Reinforcement learning via immediate rewards (IRCoCo (Li et al., 2024)) further mitigates exposure bias, dynamically adapting completions for both local context and real-time edits; gains include +40% EM and +7.9% edit similarity over strong SFT/DRL baselines.
HiLDe (González et al., 28 May 2025) introduces human-in-the-loop decoding by exposing token-level uncertainty and semantic alternatives during FIM completion. User studies show a 31% reduction in code vulnerabilities and improved decision-making, especially on security-critical tasks.

A summary of architectural and methodological contributions appears below:

Model/System	FIM Type	Notable Feature(s)	Performance Gain(s)
DeepSeek-Coder, Stable Code	PSM/SPM, repo FIM	Large window, dual-context, FIM	81.2% mean infill accuracy (Guo et al., 2024)
aiXcoder-7B, SynthCoder	SFIM, AST-mask	Structured completion, curriculum	Outperforms CodeLlama-34B on FIM (Jiang et al., 2024 Yu et al., 21 Aug 2025)
CoCoMIC, GraphCoder	Cross-file, graph	CCFINDER, CCG for retrieval	+33.9% EM, +6% EM/ID match (Ding et al., 2022 Liu et al., 2024)
FIM-SE, Byte-level, HLP	Char/Byte	Line-level align, horizon planning	+11% multi-line FIM, +18% bytelvl (Ren et al., 2024 Phan et al., 2024 Ding et al., 2024)
IRCoCo	RL-finetuned	Immediate rewards	+40% EM over SFT/DRL (Li et al., 2024)
HiLDe	Human-in-loop	Token-level, uncertainty UI	31% fewer vulnerabilities (González et al., 28 May 2025)

7. Open Challenges and Directions

FIM code completion continues to confront open problems:

The search space for valid completions remains vast. Efficiently ranking and pruning candidates—especially incorporating type systems, accessibility, and lexical constraints—remains an active area (Nguyen et al., 2019, Jiang et al., 2024).
Out-of-vocabulary tokens, project-specific APIs, and cross-language adaptation remain challenging, motivating dynamic retrieval, graph-based, or hybrid symbolic-neural methods.
The trade-off between accuracy and latency, particularly in low-resource or real-time IDE scenarios, drives research in small/efficient models, quantization (Stable Code), and workload-aware invocation filtering (Moor et al., 2024).
Integrating explicit boundary awareness (random span vs. complete-line FIM) (Ahmad et al., 24 May 2025), robust semantic post-processing (AST-based truncation), and better generalization to variable-length, cross-file, and mixed-modality settings (e.g., code and comments or code and math reasoning via MathFimer (Yan et al., 17 Feb 2025)) remains a focus.

Emerging techniques—such as horizon-length awareness (Ding et al., 2024), curriculum/context augmentation (Sagtani et al., 2024), byte-level decoding (Phan et al., 2024), human-interactive decision loops (González et al., 28 May 2025), and direct instruction fine-tuning—are expected to play a key role in the ongoing development of FIM code completion models, as new benchmarks (e.g., SAFIM, Real-FIM-Eval) more closely capture the true complexity of software editing and maintenance scenarios.