Fill-in-the-Middle (FIM) in Language Models

Updated 21 November 2025

Fill-in-the-Middle (FIM) is a data transformation technique that restructures input sequences into prefix, middle, and suffix segments for flexible infilling.
It is applied in domains like code completion, protein engineering, and mathematical reasoning to improve generative quality and contextual reconstruction.
FIM employs diverse masking strategies and planning objectives to overcome challenges in tokenization and syntax alignment, enhancing model performance.

Fill-in-the-Middle (FIM) is a family of data transformation techniques and training objectives for LLMs—most notably decoder-only transformers—designed to endow models with the ability to infill arbitrary spans between a prefix and suffix, rather than restricting generation to left-to-right completion. Originally motivated by the requirements of code completion, editing, and revision tools, FIM has become a core paradigm for both pretraining and evaluation of LLMs across natural language, code, mathematics, protein sequences, and more. The essential mechanism is to restructure input data into (prefix, middle, suffix) triplets, training the model to reconstruct the missing middle segment conditioned on observed prefix and suffix. Diverse architectural, data, and algorithmic variants have subsequently emerged to address key challenges in syntax alignment, tokenization boundaries, semantic intent ambiguity, and planning horizon.

1. Core Definitions and Theoretical Foundations

The canonical FIM transformation begins with a sequence $X = (x_1,\ldots,x_n)$ , and samples two indices $1 \leq i < j \leq n$ to produce:

Prefix: $P = (x_1, \ldots, x_{i-1})$
Middle (masked out during infill): $M = (x_i, \ldots, x_{j-1})$
Suffix: $S = (x_j, \ldots, x_n)$

The model is presented with $P$ and $S$ (often marked by explicit [PRE] and [SUF] sentinels) alongside a [MID] marker preceding the region to be generated. The learning objective is, for standard autoregressive models,

$\mathcal{L}_{\text{FIM}}(\theta) = -\sum_{(P,S,M)} \sum_{t=1}^{|M|} \log p_\theta\big(M_t \mid P, S, M_{<t}\big)$

as in (Bavarian et al., 2022, Guo et al., 2024, Lee et al., 2023), among many others.

This generic form can be instantiated in various prompt formats (Prefix–Suffix–Middle (PSM), Suffix–Prefix–Middle (SPM), document- vs. context-level FIM), and is employed across tasks ranging from code and protein design (Lee et al., 2023), type prediction (Cassano et al., 2023), to mathematical step expansion (Yan et al., 17 Feb 2025).

2. Architectural and Data Augmentation Variants

FIM does not require architectural changes beyond special marker tokens for [PRE], [SUF], and [MID]. Key design choices include:

Data Transformation Rate (FIM rate): Substantial evidence shows that mixing FIM and standard left-to-right pretraining at ratios between 50% and 90% does not harm (and in some cases improves) canonical generative perplexity, while sharply increasing infilling quality (Bavarian et al., 2022, Guo et al., 2024, Sagtani et al., 2024). 100% FIM may hurt l2r quality.
Masking Strategy: Line-, token-, or character-level random splitting yields differing robustness to out-of-distribution spans; character-level slicing is the most robust but may produce mid-token splits (Ren et al., 2024, Bavarian et al., 2022).
AST-aware Masking: To address the semantic mismatch between randomly masked tokens and syntactic code editing, structure-aware FIM (AST-FIM) selects “middle” spans aligned to AST nodes or contiguous AST child subsequences, preserving parseability and mirroring real-world edits. This yields up to 5 point improvements on FIM benchmarks (Gong et al., 30 May 2025).
Protein Sequences: In protein engineering, ProtFIM applies FIM to amino acid residues, demonstrating that sequence-aware FIM surpasses standard AR models for reconstructing masked residues under structural preservation constraints (Lee et al., 2023).
Mathematics: In mathematical reasoning, MathFimer applies FIM to reasoning chains, masking intermediate steps to generate more detailed solutions, leading to 2–5 pp accuracy gains on GSM8K/MATH (Yan et al., 17 Feb 2025).
Instruction-Aware FIM: IFIM explicitly incorporates developer-supplied natural language instructions into FIM formatting (“PSIM” mode), substantially boosting intent accuracy (+9–10 pp pass@1) without degrading FIM baseline performance (Sun et al., 29 Sep 2025).

Variant	Main Motivation	Empirical Gain Example
Random-char FIM	General infilling ability	Strong for line-masked, weaker for random spans (Bavarian et al., 2022)
AST-FIM	Syntactically valid completions	+4-7 pts SAFIM pass@1 (Gong et al., 30 May 2025)
FIM-SE	Avoid sub-token splits in char-level infill	+4–11 pp HumanEval random-span (Ren et al., 2024)
IFIM	Instruction clarity for intent	+9–10 pp pass@1 instruction infilling (Sun et al., 29 Sep 2025)

3. Challenges, Post-Processing, and Constrained Decoding

Although FIM models can generate the required middle segments, several practical challenges arise:

Decoding Boundaries: Many pretrained and instruction-tuned models “bleed” prefix/suffix or over-generate, failing to stop at the true infill boundary. Automatic evaluation scripts thus often rely on dataset-specific truncation or post-processing heuristics, e.g., matching the number of output lines to the reference (Ahmad et al., 24 May 2025).
Tokenization and Sub-tokens: Mid-token splits (e.g., in SPM prompts) induce out-of-distribution contexts for subword-tokenized models, leading to high perplexity or “tokenization bias.” Byte-level decoding using the Byte-Token Representation Lemma recovers the correct distribution and improves pass@1 on SPM infilling by 18 points (Phan et al., 2024).
Sub-token Elimination: FIM-SE avoids any sub-token prediction by turning character splits into line-aligned masking, with explicit <START> and <END> constraints for incomplete boundary lines (Ren et al., 2024).
Syntax Constraints: Ensuring syntactic correctness can be enforced during generation via incremental LCFL parsers and left/right quotienting of grammars. This reduces Python FIM syntax errors from 35% to 10%, nearly matching checked oracle baselines (Melcer et al., 2024).
Short Middle Spans: Generic FIM fails when the middle region to be infilled is extremely short (e.g., type annotations). Task-tailored FIM objectives or additional search using program decomposition tree and fine-tuned “fill-in-the-type” training are necessary for high-quality type prediction in TypeScript (Cassano et al., 2023).

4. Evaluation Methodologies and Benchmarks

Evaluation of FIM models employs a mix of exact-match, unit test (pass@1), or structure-aware metrics. Key recent benchmarks include:

SAFIM (Syntax-Aware FIM): 17,720 example suite targeting code block, control-flow, and API-call completion, combining AST masking, execution-based correctness (unit tests), and multi-language coverage (Gong et al., 2024).
HumanEval Infilling: Line- and random-span masking for Python code, with automated function tests.
Real-FIM-Eval: 30K+ GitHub commit-derived infill tasks, measuring cross-context character-level perplexity (Gong et al., 30 May 2025).
SEIFER: Protein infilling, with secondary structure recovery and AlphaFold2-based validation (Lee et al., 2023).
Mathematical Reasoning: NuminaMath-FIM for reasoning chain infill and expansion, with GSM8K/MATH benchmarks (Yan et al., 17 Feb 2025).

Model assessment must specify the prompt convention (PSM/SPM), truncation regime, and task granularity (random-span vs line-aligned). Best practice increasingly favors line- or AST-aligned masking and loss computation to facilitate standardized, boundary-aware evaluation (Ahmad et al., 24 May 2025, Gong et al., 30 May 2025).

Benchmark	Domain	Key Metric	Ref.
SAFIM	Code	Pass@1	(Gong et al., 2024)
HumanEval FIM	Code	Pass@1	(Bavarian et al., 2022)
Real-FIM-Eval	Code	Perplexity	(Gong et al., 30 May 2025)
SEIFER	Protein	Retrieval@k, pLDDT	(Lee et al., 2023)
NuminaMath-FIM	Mathematics	Benchmark accuracy	(Yan et al., 17 Feb 2025)

5. Advanced Objectives and Planning in FIM

Vanilla next-token FIM training is myopic: at each decoding step, the model predicts only one token ahead and cannot “plan” when to stop or how to smoothly align left/right context. Two critical directions address this:

Horizon-Length Prediction (HLP): Adds an auxiliary head to regress the normalized number of tokens remaining in the infill region at each step, transforming FIM into a planning task. HLP eliminates the need for brittle post-processing, increases FIM pass@1 by up to 24% relatively, and endows the model with strong boundary-awareness (Ding et al., 2024).
Bidirectional and Any-Order Generation: The FiLM approach trains models to mask tokens at variable rates using Beta-distributed noise and enables any-order (not just middle) infilling with bidirectional context via a [MASK] token and full self-attention, achieving ROUGE and human preference gains over left-to-right CLMs (Shen et al., 2023).

6. Applications, Extensions, and Future Directions

FIM methods have permeated numerous LLM application domains:

Code Completion, Editing, and Refactoring: State-of-the-art models (DeepSeek-Coder, StarCoder2, CodeLlama) universally employ FIM or a variant in pretraining, with AST-FIM recommended for real-world code editing alignment (Guo et al., 2024, Gong et al., 30 May 2025).
Mathematical Solution Expansion: FIM-based intermediate step expansion in MathFimer boosts mathematical reasoning benchmarks, with negligible computational overhead compared to search or teacher-student expansion (Yan et al., 17 Feb 2025).
Protein and Type Prediction: Domain-specific FIM (ProtFIM, FIT-fine-tuning) addresses task-unique constraints, such as sub-token-free infilling or brief middle spans (Lee et al., 2023, Cassano et al., 2023).
Instruction-Following and Contextual Disambiguation: Explicit inclusion of intent comments (IFIM) unlocks the use of FIM in ambiguous or under-specified completion settings (Sun et al., 29 Sep 2025).

Emerging areas of research include automatic curriculum mining for hard FIM cases (Sagtani et al., 2024), integrating static/semantic context (Gong et al., 30 May 2025), parser-coupled and byte-level decoding (Phan et al., 2024), and executing tightly constrained code completions in strongly-typed or syntactically demanding languages (Melcer et al., 2024). Scalability to 100B+ parameter models and adaptation to “wild” instruction/comment sources remain open areas of advancement.

7. Limitations, Controversies, and Best Practices

While FIM is now widely adopted, several important caveats are established:

FIM-pretrained models improve both infilling and even conventional left-to-right performance (Gong et al., 2024), but excessive FIM rates ( $>$ 90%) can slightly degrade base generative quality (Bavarian et al., 2022).
Out-of-distribution tokenization and decoding boundary ambiguity require careful evaluation and often benefit from explicit boundary constraints (byte-level or line-aligned) (Phan et al., 2024, Ren et al., 2024).
FIM alone does not solve all intent or context ambiguity—expanded variants (instruction-aware, structure-aware, or planning-augmented) yield further accuracy.
Off-the-shelf instruction-tuned models are not inherently FIM ready; modest fine-tuning on explicit prefix/middle/suffix examples is necessary to teach correct stopping (Ahmad et al., 24 May 2025).
Evaluation practices must specify post-processing, truncation, and prompt conventions for comparability.

Best-practice recommendations include joint training on multiple FIM prompt formats (PSM/SPM), favoring character- or AST-aligned masking for real-world infill, supplementing with planning objectives (HLP), explicit line or byte boundaries, and structure- or intent-aware augmentation suited to application needs (Bavarian et al., 2022, Yan et al., 17 Feb 2025, Gong et al., 30 May 2025, Ding et al., 2024).