Fill-in-the-Middle (FIM) in Language Models
- Fill-in-the-Middle (FIM) is a data transformation technique that restructures input sequences into prefix, middle, and suffix segments for flexible infilling.
- It is applied in domains like code completion, protein engineering, and mathematical reasoning to improve generative quality and contextual reconstruction.
- FIM employs diverse masking strategies and planning objectives to overcome challenges in tokenization and syntax alignment, enhancing model performance.
Fill-in-the-Middle (FIM) is a family of data transformation techniques and training objectives for LLMs—most notably decoder-only transformers—designed to endow models with the ability to infill arbitrary spans between a prefix and suffix, rather than restricting generation to left-to-right completion. Originally motivated by the requirements of code completion, editing, and revision tools, FIM has become a core paradigm for both pretraining and evaluation of LLMs across natural language, code, mathematics, protein sequences, and more. The essential mechanism is to restructure input data into (prefix, middle, suffix) triplets, training the model to reconstruct the missing middle segment conditioned on observed prefix and suffix. Diverse architectural, data, and algorithmic variants have subsequently emerged to address key challenges in syntax alignment, tokenization boundaries, semantic intent ambiguity, and planning horizon.
1. Core Definitions and Theoretical Foundations
The canonical FIM transformation begins with a sequence , and samples two indices to produce:
- Prefix:
- Middle (masked out during infill):
- Suffix:
The model is presented with and (often marked by explicit [PRE] and [SUF] sentinels) alongside a [MID] marker preceding the region to be generated. The learning objective is, for standard autoregressive models,
as in (Bavarian et al., 2022, Guo et al., 25 Jan 2024, Lee et al., 2023), among many others.
This generic form can be instantiated in various prompt formats (Prefix–Suffix–Middle (PSM), Suffix–Prefix–Middle (SPM), document- vs. context-level FIM), and is employed across tasks ranging from code and protein design (Lee et al., 2023), type prediction (Cassano et al., 2023), to mathematical step expansion (Yan et al., 17 Feb 2025).
2. Architectural and Data Augmentation Variants
FIM does not require architectural changes beyond special marker tokens for [PRE], [SUF], and [MID]. Key design choices include:
- Data Transformation Rate (FIM rate): Substantial evidence shows that mixing FIM and standard left-to-right pretraining at ratios between 50% and 90% does not harm (and in some cases improves) canonical generative perplexity, while sharply increasing infilling quality (Bavarian et al., 2022, Guo et al., 25 Jan 2024, Sagtani et al., 21 Dec 2024). 100% FIM may hurt l2r quality.
- Masking Strategy: Line-, token-, or character-level random splitting yields differing robustness to out-of-distribution spans; character-level slicing is the most robust but may produce mid-token splits (Ren et al., 27 May 2024, Bavarian et al., 2022).
- AST-aware Masking: To address the semantic mismatch between randomly masked tokens and syntactic code editing, structure-aware FIM (AST-FIM) selects “middle” spans aligned to AST nodes or contiguous AST child subsequences, preserving parseability and mirroring real-world edits. This yields up to 5 point improvements on FIM benchmarks (Gong et al., 30 May 2025).
- Protein Sequences: In protein engineering, ProtFIM applies FIM to amino acid residues, demonstrating that sequence-aware FIM surpasses standard AR models for reconstructing masked residues under structural preservation constraints (Lee et al., 2023).
- Mathematics: In mathematical reasoning, MathFimer applies FIM to reasoning chains, masking intermediate steps to generate more detailed solutions, leading to 2–5 pp accuracy gains on GSM8K/MATH (Yan et al., 17 Feb 2025).
- Instruction-Aware FIM: IFIM explicitly incorporates developer-supplied natural language instructions into FIM formatting (“PSIM” mode), substantially boosting intent accuracy (+9–10 pp pass@1) without degrading FIM baseline performance (Sun et al., 29 Sep 2025).
| Variant | Main Motivation | Empirical Gain Example |
|---|---|---|
| Random-char FIM | General infilling ability | Strong for line-masked, weaker for random spans (Bavarian et al., 2022) |
| AST-FIM | Syntactically valid completions | +4-7 pts SAFIM pass@1 (Gong et al., 30 May 2025) |
| FIM-SE | Avoid sub-token splits in char-level infill | +4–11 pp HumanEval random-span (Ren et al., 27 May 2024) |
| IFIM | Instruction clarity for intent | +9–10 pp pass@1 instruction infilling (Sun et al., 29 Sep 2025) |
3. Challenges, Post-Processing, and Constrained Decoding
Although FIM models can generate the required middle segments, several practical challenges arise:
- Decoding Boundaries: Many pretrained and instruction-tuned models “bleed” prefix/suffix or over-generate, failing to stop at the true infill boundary. Automatic evaluation scripts thus often rely on dataset-specific truncation or post-processing heuristics, e.g., matching the number of output lines to the reference (Ahmad et al., 24 May 2025).
- Tokenization and Sub-tokens: Mid-token splits (e.g., in SPM prompts) induce out-of-distribution contexts for subword-tokenized models, leading to high perplexity or “tokenization bias.” Byte-level decoding using the Byte-Token Representation Lemma recovers the correct distribution and improves pass@1 on SPM infilling by 18 points (Phan et al., 11 Oct 2024).
- Sub-token Elimination: FIM-SE avoids any sub-token prediction by turning character splits into line-aligned masking, with explicit <START> and <END> constraints for incomplete boundary lines (Ren et al., 27 May 2024).
- Syntax Constraints: Ensuring syntactic correctness can be enforced during generation via incremental LCFL parsers and left/right quotienting of grammars. This reduces Python FIM syntax errors from 35% to 10%, nearly matching checked oracle baselines (Melcer et al., 28 Feb 2024).
- Short Middle Spans: Generic FIM fails when the middle region to be infilled is extremely short (e.g., type annotations). Task-tailored FIM objectives or additional search using program decomposition tree and fine-tuned “fill-in-the-type” training are necessary for high-quality type prediction in TypeScript (Cassano et al., 2023).
4. Evaluation Methodologies and Benchmarks
Evaluation of FIM models employs a mix of exact-match, unit test (pass@1), or structure-aware metrics. Key recent benchmarks include:
- SAFIM (Syntax-Aware FIM): 17,720 example suite targeting code block, control-flow, and API-call completion, combining AST masking, execution-based correctness (unit tests), and multi-language coverage (Gong et al., 7 Mar 2024).
- HumanEval Infilling: Line- and random-span masking for Python code, with automated function tests.
- Real-FIM-Eval: 30K+ GitHub commit-derived infill tasks, measuring cross-context character-level perplexity (Gong et al., 30 May 2025).
- SEIFER: Protein infilling, with secondary structure recovery and AlphaFold2-based validation (Lee et al., 2023).
- Mathematical Reasoning: NuminaMath-FIM for reasoning chain infill and expansion, with GSM8K/MATH benchmarks (Yan et al., 17 Feb 2025).
Model assessment must specify the prompt convention (PSM/SPM), truncation regime, and task granularity (random-span vs line-aligned). Best practice increasingly favors line- or AST-aligned masking and loss computation to facilitate standardized, boundary-aware evaluation (Ahmad et al., 24 May 2025, Gong et al., 30 May 2025).
| Benchmark | Domain | Key Metric | Ref. |
|---|---|---|---|
| SAFIM | Code | Pass@1 | (Gong et al., 7 Mar 2024) |
| HumanEval FIM | Code | Pass@1 | (Bavarian et al., 2022) |
| Real-FIM-Eval | Code | Perplexity | (Gong et al., 30 May 2025) |
| SEIFER | Protein | Retrieval@k, pLDDT | (Lee et al., 2023) |
| NuminaMath-FIM | Mathematics | Benchmark accuracy | (Yan et al., 17 Feb 2025) |
5. Advanced Objectives and Planning in FIM
Vanilla next-token FIM training is myopic: at each decoding step, the model predicts only one token ahead and cannot “plan” when to stop or how to smoothly align left/right context. Two critical directions address this:
- Horizon-Length Prediction (HLP): Adds an auxiliary head to regress the normalized number of tokens remaining in the infill region at each step, transforming FIM into a planning task. HLP eliminates the need for brittle post-processing, increases FIM pass@1 by up to 24% relatively, and endows the model with strong boundary-awareness (Ding et al., 4 Oct 2024).
- Bidirectional and Any-Order Generation: The FiLM approach trains models to mask tokens at variable rates using Beta-distributed noise and enables any-order (not just middle) infilling with bidirectional context via a [MASK] token and full self-attention, achieving ROUGE and human preference gains over left-to-right CLMs (Shen et al., 2023).
6. Applications, Extensions, and Future Directions
FIM methods have permeated numerous LLM application domains:
- Code Completion, Editing, and Refactoring: State-of-the-art models (DeepSeek-Coder, StarCoder2, CodeLlama) universally employ FIM or a variant in pretraining, with AST-FIM recommended for real-world code editing alignment (Guo et al., 25 Jan 2024, Gong et al., 30 May 2025).
- Mathematical Solution Expansion: FIM-based intermediate step expansion in MathFimer boosts mathematical reasoning benchmarks, with negligible computational overhead compared to search or teacher-student expansion (Yan et al., 17 Feb 2025).
- Protein and Type Prediction: Domain-specific FIM (ProtFIM, FIT-fine-tuning) addresses task-unique constraints, such as sub-token-free infilling or brief middle spans (Lee et al., 2023, Cassano et al., 2023).
- Instruction-Following and Contextual Disambiguation: Explicit inclusion of intent comments (IFIM) unlocks the use of FIM in ambiguous or under-specified completion settings (Sun et al., 29 Sep 2025).
Emerging areas of research include automatic curriculum mining for hard FIM cases (Sagtani et al., 21 Dec 2024), integrating static/semantic context (Gong et al., 30 May 2025), parser-coupled and byte-level decoding (Phan et al., 11 Oct 2024), and executing tightly constrained code completions in strongly-typed or syntactically demanding languages (Melcer et al., 28 Feb 2024). Scalability to 100B+ parameter models and adaptation to “wild” instruction/comment sources remain open areas of advancement.
7. Limitations, Controversies, and Best Practices
While FIM is now widely adopted, several important caveats are established:
- FIM-pretrained models improve both infilling and even conventional left-to-right performance (Gong et al., 7 Mar 2024), but excessive FIM rates (90%) can slightly degrade base generative quality (Bavarian et al., 2022).
- Out-of-distribution tokenization and decoding boundary ambiguity require careful evaluation and often benefit from explicit boundary constraints (byte-level or line-aligned) (Phan et al., 11 Oct 2024, Ren et al., 27 May 2024).
- FIM alone does not solve all intent or context ambiguity—expanded variants (instruction-aware, structure-aware, or planning-augmented) yield further accuracy.
- Off-the-shelf instruction-tuned models are not inherently FIM ready; modest fine-tuning on explicit prefix/middle/suffix examples is necessary to teach correct stopping (Ahmad et al., 24 May 2025).
- Evaluation practices must specify post-processing, truncation, and prompt conventions for comparability.
Best-practice recommendations include joint training on multiple FIM prompt formats (PSM/SPM), favoring character- or AST-aligned masking for real-world infill, supplementing with planning objectives (HLP), explicit line or byte boundaries, and structure- or intent-aware augmentation suited to application needs (Bavarian et al., 2022, Yan et al., 17 Feb 2025, Gong et al., 30 May 2025, Ding et al., 4 Oct 2024).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free