Fill-in-the-Middle Code Completion
- Fill-in-the-middle (FIM) code completion is a paradigm that synthesizes missing code fragments between a given prefix and suffix to maintain both syntax and semantics.
- Recent advances leverage dual-channel transformers, AST-based masking, and non-monotonic decoding to capture complex code structures across various programming languages.
- Integrating incremental parsing, repository-level context fusion, and robust post-processing enhances overall reliability and practical performance in modern editing environments.
Fill-in-the-middle (FIM) code completion denotes the task and modeling paradigm wherein a system generates the missing code fragment located between a given prefix (code preceding the gap) and a suffix (code following the gap). This contrasts with classic left-to-right, next-token completion, and is especially relevant for modern code editing environments and repository-scale code assistance, where developers insert or replace code fragments in arbitrary locations. Recent advances in code-oriented LLMs, program analysis integration, and structure-aware data augmentation have enabled robust and contextually nuanced FIM solutions across diverse programming languages and tasks.
1. Formal Problem Definition and Theoretical Foundations
In the FIM completion task, given a code prefix and a code suffix , the model is tasked with synthesizing a “middle” that, when inserted between and , yields a syntactically and (ideally) semantically valid program. This infilling scenario necessitates bidirectional context modeling, as the generated code must connect fluently to both the prefix and suffix. Formally, the model estimates
with variants including line-level, statement-level, and arbitrary-span infilling.
Several empirical and theoretical contributions leverage context-free grammar (CFG) quotienting and incremental parsing to guarantee syntactic validity during generation. For instance, parsing state machines based on incremental Earley parsers can be extended to handle the right quotient of the CFG with respect to the fixed suffix, ensuring that the generated tokens at any generation step constitute a valid prefix with an available completion path to the suffix (Melcer et al., 28 Feb 2024). This approach is formalized by considering whether, for a constructed prefix , there exists a continuation such that , the language of programs.
Additionally, language modeling approaches have drawn from the “naturalness” hypothesis of code, positing that source code, like natural language, exhibits regularities exploitable by statistical models to prioritize likely completions (Nguyen et al., 2019).
2. Model Architectures, Training Paradigms, and Representations
Fundamental advances in FIM code completion emerge from carefully designed neural architectures, masking strategies, and representational choices:
- Transformer-Based and Dual-Channel Models: CodeFill (Izadi et al., 2022) employs a parallel Transformer architecture, separately encoding token names and corresponding AST token types, merging their representations for enhanced syntactic and naming awareness. Multi-task learning objectives (token value prediction, token type prediction, and statement completion) enable the model to capture both fine-grained lexical relationships and long-range grammatical dependencies.
- Syntax- and Structure-Aware Masking: AST-FIM (Gong et al., 30 May 2025) replaces random character-span masking with AST-driven masking at pretraining, selecting entire syntactic structures (e.g., function definitions or control blocks) as the “middle.” This produces cohesive, semantically meaningful training samples, and ensures the model observes realistic code edit patterns, leading to improved performance on real-world infilling benchmarks.
- Self-Infilling and Non-Monotonic Decoding: The self-infilling framework (Zheng et al., 2023) exploits FIM-trained LLMs’ inherent capacity for both left-to-right and fill-in-the-middle decoding. Decoding is made non-monotonic through controlled interruptions (using an uncertainty threshold) and looping (cyclic refinement of prefix–middle–suffix), yielding regularized, less degenerate generations and faithfully integrating both near and distant context.
- Instruction- and Retrieval-Augmented Infilling: Instruction-aware FIM (IFIM) (Sun et al., 29 Sep 2025) extends the completion context to explicitly include developer intent via natural language instructions, allowing models to better resolve underspecified infilling regions. Retrieval-augmented setups (e.g., ProCC (Tan et al., 13 May 2024)) supplement context with semantically similar code fragments, retrieved using combinations of prompt engineering and adaptive selection algorithms, improving the quality of completions especially when the context is ambiguous or sparse.
3. Constraint Handling, Syntax Preservation, and Post-Processing
Ensuring that FIM completions are syntactically valid and contextually appropriate requires integrating program analysis and syntax-aware post-processing:
- Incremental Earley Parsing and Lexical Disambiguation: By coupling token-by-token generation with an incremental parser (adapting Earley's algorithm), completion systems can reject tokens that cannot form a valid continuation in the quotient grammar tailored for the right context (Melcer et al., 28 Feb 2024). This methodology supports context-sensitive features, such as Python’s indentation-based blocks and maximal-munch lexing, by maintaining branched lexer/parser states when ambiguous lexeme boundaries arise.
- Syntax-Aware Post-Processing: For evaluation and practical code integration, SAFIM (Gong et al., 7 Mar 2024) applies rigorous tree-based truncation and expansion so that generated output forms valid AST nodes precisely slotting into the original code structure. This multi-stage consumption and verification reduces compile-time errors and increases unit-test pass rates across LLMs.
- Empirical Observation on Truncation: Post-processing remains essential for “random span” FIM tasks where the middle region is not cleanly line- or block-delimited (Ahmad et al., 24 May 2025). However, for completions targeting entire, well-delimited regions, instruction-tuned and fine-tuned models demonstrate substantial reduction in boundary errors, sometimes outperforming the same outputs after post-processing.
4. Context Utilization and Repository-Level Completion
Modern FIM methods leverage not only the immediate edit neighborhood, but also repository-level information:
- Repository Context Fusion: RepoFusion (Shrivastava et al., 2023) retrieves multiple code snippets from across a repository (via prompt proposals, BM25, or nearest-neighbor search), encoding them in parallel and fusing representations at decoding. This approach enables small models to rival much larger models (e.g., StarCoderBase) by allowing them to reference project-specific definitions, imports, and contextual dependencies, and is validated empirically on extensive Java codebases.
- Context Collection Challenges: Comprehensive studies on context optimization (Ustalov et al., 5 Oct 2025) reveal that the quality of retrieved and assembled context (from project- or repository-wide symbol graphs, code chunks, and cross-file relations) significantly determines infilling performance. Effective approaches combine static program analysis, BM25/FAISS retrieval, and tailored heuristics to supply the most relevant context to LLMs, optimize within token budget constraints, and prevent data leakage from future versions.
5. Datasets, Benchmarks, and Evaluation Metrics
FIM research is rigorously benchmarked across diverse datasets and competitive settings:
- Standardized Benchmarks: SAFIM (Gong et al., 7 Mar 2024) targets multiple structural infilling tasks (algorithmic blocks, control-flow predicates, and API calls) and includes post-April-2022 data to minimize training data contamination, enabling credible evaluation of model generalization. Real-FIM-Eval (Gong et al., 30 May 2025) derives infilling regions from actual GitHub commits, capturing authentic developer edit patterns across twelve languages.
- Metrics Beyond Exact Match: Evaluations use Pass@1 (unit test pass rate on first completion), Edit Similarity, Rouge-L, METEOR, and character-level F-measures (chrF) (Ustalov et al., 5 Oct 2025) to assess both functional equivalence and structural proximity between generated and ground-truth segments. Cosine similarity at the vector level (Zhang et al., 21 Feb 2025), for semantic closeness, and latency, for real-time usability, are also commonly reported.
- Comparative Analyses: Cross-model analyses (Zhang et al., 21 Feb 2025, Gong et al., 7 Mar 2024) uncover trade-offs between model size, speed, and syntactic robustness, finding that well-tuned FIM pretraining and data quality often matter more than sheer parameter count for effective infilling.
6. Curriculum Learning, Horizon Control, and Mitigating Model Limitations
FIM task difficulty is dynamically addressed through curriculum design, horizon-length prediction, and preference optimization:
- Curriculum and Context Emphasis: Extracting and upweighting “difficult-to-complete” patterns based on AST complexity (e.g., deeply nested calls, dependency-resolving expressions) and supplementing each example with symbol-rich context (from BM25 or static analysis) yields disproportionately higher gains in model acceptance and persistence rates, especially in latency-sensitive, small-parameter regimes (Sagtani et al., 21 Dec 2024).
- Horizon-Length Prediction (HLP): Traditional next-token prediction weakly encodes the “when to stop” infillation, leading to off-by-one and over/under-generation errors. HLP (Ding et al., 4 Oct 2024) augments pretraining with an auxiliary head predicting the normalized number of remaining tokens at each step within the infill region, yielding substantial gains (up to 24% relative) in FIM performance without additional inference cost.
- Direct Preference Optimization (DPO) and AST Segmentation: Fine-grained DPO combined with AST-based block splitting enables dense generation of preferred/dispreferred code pairs for better calibration—even when test-case–verified training data is scarce. When DPO loss is restricted to only the generated “middle,” models achieve targeted improvements without harming the correctness of the preserved context (Ren et al., 27 Aug 2025, Yu et al., 21 Aug 2025).
7. Extensions, Emerging Directions, and Cross-Domain Generality
FIM code completion has recently informed related domains and inspired further research directions:
- Mathematical Reasoning Step Expansion: The MathFimer approach (Yan et al., 17 Feb 2025) applies the FIM paradigm to mathematical chain-of-thought data, inserting or expanding reasoning steps between given prefix and suffix chains, leading to gains in accuracy and model reasoning depth.
- Instruction-aware Completion: Integrating explicit natural language instructions (“instruction-aware FIM”) (Sun et al., 29 Sep 2025) addresses the challenge of underspecified developer intent, harmonizing instruction-following and pure code infilling capabilities in a backward-compatible manner.
- Multimodal, Multi-perspective and Plug-and-Play Systems: ProCC and similar frameworks (Tan et al., 13 May 2024) dynamically select among multiple semantically diverse retrievers (e.g., hypothetical lines, textual summaries), chosen by contextual bandit algorithms, for context enrichment, and support plug-and-play augmentation over fine-tuned baselines.
- Limitations and Open Problems: Achieving robust infilling across all programming paradigms, modeling long-range, cross-file dependencies within token constraints, mitigating hallucinations, and eliminating dependence on fragile boundary heuristics remain active areas of research.
In summary, fill-in-the-middle code completion has evolved into a vibrant domain encompassing program analysis, neural modeling, context retrieval, syntax-aware post-processing, and curriculum-guided training. Advances in structure-aware pretraining, horizon-length awareness, direct preference optimization, and repository-scale context fusion yield robust and contextually faithful code infilling, not only narrowing the gap between automated and human-in-the-loop development practices but also extending to domains such as mathematical reasoning and hybrid instruction–context code generation.