Fill-In-the-Middle Objective
- Fill-In-the-Middle (FIM) is a span-based generation objective where models reconstruct a masked segment using both preceding and succeeding context for bidirectional conditioning.
- It employs specialized delimiters and sampling strategies to infill variable-length spans in applications such as code completion, mathematical reasoning, and protein sequence design.
- Empirical results demonstrate FIM’s ability to boost infilling performance, improve syntax accuracy, and enhance reasoning across diverse domains.
Fill-In-the-Middle (FIM) is a span-based generation objective for LLMs that extends standard left-to-right (autoregressive) pretraining by teaching the model to reconstruct an arbitrary masked span (“middle”) given both the preceding (“prefix”) and succeeding (“suffix”) context. Originally motivated by the demands of program synthesis and code completion, FIM-style objectives have enabled a new class of language and code models with strong bidirectional infilling capabilities, robust context conditioning, and practical advantages in tasks ranging from code completion, block/sentence infilling, step expansion in mathematical reasoning, protein sequence design, and beyond.
1. Formal Definition and Mathematical Objective
The canonical FIM objective decomposes a sequence into three contiguous segments: prefix , infill (mask/middle) , and suffix . The model is explicitly trained to reconstruct given both and : where is the training distribution (code files, mathematical solution steps, or biological sequences) and is the -th token in (Sagtani et al., 2024, Gong et al., 2024, Bavarian et al., 2022, Lee et al., 2023, Phan et al., 2024).
In code-oriented models, FIM examples are often constructed by sampling random or structure-aligned spans, then presenting the model with prompts in formats such as Prefix–Suffix–Middle (PSM) or Suffix–Prefix–Middle (SPM), using newly introduced delimiters or sentinel tokens (e.g., fim_start, fim_hole, fim_end) (Guo et al., 2024). The loss is always computed only over the infill span.
The FIM setting belongs to the family of span corruption/infill objectives, generalizing left-to-right (which only conditions on past context), and subsuming classical token-level Masked Language Modeling (MLM), span-infilling for denoising (T5, BART), and permutation-based models (Shen et al., 2023).
2. Motivation and Core Properties
Standard autoregressive (next-token) LLMs only learn to predict the next token given the previous context, which biases them towards rightward generation and makes them inherently left-contextual. This unidirectionality limits the capacity to produce coherent infillings for arbitrary gaps within a sequence unless strong architectural and pretraining changes are made.
FIM modeling directly addresses several critical requirements:
- Bidirectional Conditioning: FIM allows tokens in the infilled region to attend to both previous and future context, leading to more semantically consistent and structurally valid infills (Gong et al., 2024, Sagtani et al., 2024).
- Span Generality: By supporting reconstruction of arbitrary-length masked spans (instead of only single-token or rightmost continuation), FIM empowers models to handle variable-length completions and structural edits (code blocks, mathematical steps, sequence segments) (Gong et al., 30 May 2025, Lee et al., 2023).
- Data Augmentation and Robustness: FIM objectives do not require architectural changes; rather, the dataset is transformed or augmented so that models see a mix of standard next-token and FIM-formatted sequences, achieving “FIM-for-free” bidirectionality with no perceptible loss in autoregressive performance (Bavarian et al., 2022, Guo et al., 2024).
In mathematical reasoning, FIM uniquely enables “step expansion”: interpolating more granular intermediate steps into a human-verified chain-of-thought (CoT) without requiring outputs to be regenerated from scratch, thereby enriching solution quality and improving downstream task performance (Yan et al., 17 Feb 2025).
3. Construction of FIM Training Data
Three main design choices govern practical FIM data construction:
- Span Selection: Random contiguous spans may be sampled at the character, token, or line/block level. Random character-level splits yield generalization across arbitrary substrings, but structure-aware masking—e.g., masking spans aligned to Abstract Syntax Tree (AST) subtrees (Gong et al., 30 May 2025), or homologous protein regions (Lee et al., 2023)—can better match real-world editing operations and improve infilling accuracy.
- Prompt Formatting: FIM relies on special delimiters, with popular formats including:
- PSM: pre prefix suf suffix mid middle
- SPM: suf suffix mid prefix middle
- Context addition: Prepended semantic or repository context may be included, especially in code (e.g., C₁…Cₖ before prefix) (Sagtani et al., 2024).
- Mixing Ratios: Typically, 50–90% of the dataset is processed into FIM format (using the above transformations), with the remainder used for standard left-to-right next-token prediction. This preserves left-to-right capabilities (“for free”) and achieves optimal infilling performance without trade-off (Bavarian et al., 2022, Guo et al., 2024).
4. Extensions, Variants, and Benchmarking
4.1 Structured- and Context-Aware FIM
- AST-FIM: Masks are chosen to align with AST subtrees, ensuring that the infill spans maintain code structure and reflect authentic developer editing patterns; AST-FIM outperforms random-span FIM by up to 5 points on code benchmarks (Gong et al., 30 May 2025).
- Context Augmentation: Augmenting FIM prompts with symbol definitions or repository-level background via static analysis tools (e.g., TypeScript compiler, Tree-sitter) improves infilling, especially on difficult cross-file dependencies and for small models (Sagtani et al., 2024).
- Curriculum FIM: By oversampling difficult code structures (e.g., function/class definitions, call expressions, control flow), curriculum-based span sampling can target model weaknesses and yields the largest gains in low-compute regimes (Sagtani et al., 2024).
4.2 Task-Specific Adaptations
- Instruction-aware FIM (IFIM): Extends FIM to support an explicit natural language instruction segment between prefix and middle, enabling the model to condition infilling on developer intent while preserving standard FIM capabilities (Sun et al., 29 Sep 2025).
- Horizon-Length Prediction (HLP): Augments standard FIM with an auxiliary regression task that at each infill step estimates the normalized distance to the reconnection point (i.e., number of tokens left before suffix), addressing the lack of lookahead and improving both output boundary accuracy and success rates in open-domain code completion (Ding et al., 2024).
4.3 Benchmarks
- SAFIM: Syntax-Aware FIM with ∼18K examples across languages, measuring algorithmic, control-flow, and API-aware block completion; establishes that FIM-pretrained models achieve higher infilling and also improved left-to-right performance (Gong et al., 2024).
- Real-FIM-Eval: >30K real-world edits from GitHub commits across 12 languages, evaluating both “add” (insertion) and “edit” (replace) forms of FIM; AST-FIM achieves best results (Gong et al., 30 May 2025).
- SEIFER: Middle-span protein editing with structure-aware constraints; ProtFIM (FIM-trained) surpasses left-to-right and permutation-based baselines (Lee et al., 2023).
5. Empirical Impact and Model Comparisons
FIM training confers substantial empirical benefits:
- Mathematical Reasoning: MathFimer-style FIM augmentation increases GSM8K accuracy by +5.61 points, MATH by +3.52, and yields up to +15.54 on repeated expansion (Yan et al., 17 Feb 2025).
- Code Completion: On HumanEval infilling and multi-language single-line FIM, DeepSeek-Coder and similar FIM models consistently outperform left-to-right baselines, with mean accuracies of 70%–78% (Python/Java/JS), exceeding comparably sized causal models (Guo et al., 2024). FIM pretraining universally improves both infilling and left-to-right code generation benchmarks (Gong et al., 2024).
- Protein Design: ProtFIM achieves retrieval@5 = 0.73 versus 0.70 for best left-to-right baselines, and superior property prediction on the FLIP benchmark (Lee et al., 2023).
- Tokenization Correction: Byte-level FIM decoding eliminates tokenization bias at span boundaries, raising SPM random-span FIM pass@1 scores from 45.0% (tokenized) to 63.9% (byte-corrected), a ≈18% absolute gain (Phan et al., 2024).
- Syntactic Validity: Constrained FIM decoding via quotient parsing halves syntax error rates for Python FIM code tasks, with 89.5% syntax-valid completions versus 65% unconstrained (Melcer et al., 2024).
Ablations indicate that most FIM advantages cannot be recovered by fine-tuning standard LMs on infilling tasks alone; pretraining or retraining with FIM objectives is crucial (Bavarian et al., 2022). Further, FIM's empirical gains are robust to model size but especially critical for small, latency-sensitive models (e.g., 1–2B parameters) (Sagtani et al., 2024).
6. Limitations, Practical Considerations, and Future Directions
Despite broad success, FIM models have open challenges:
- Boundary Awareness: Next-token objectives cannot by themselves teach models to align infill boundaries precisely, often leading to overwriting or duplication of suffix context. Solutions include post-processing (e.g., syntax-aware truncation (Gong et al., 2024)), horizon-length prediction architectures (Ding et al., 2024), or constrained decoding (Melcer et al., 2024).
- Tokenization Effects: Conventional byte-pair or subword tokenization leads to bias near span boundaries (“tokenization bias”); byte-level sampling resolves but incurs compute overhead (Phan et al., 2024).
- Instruction Integration: Naive insertion of NL instructions as code comments degrades infilling capability; explicit instruction channels (IFIM) are necessary to reconcile instruction-following with accurate FIM (Sun et al., 29 Sep 2025).
- Generalization and Transfer: While FIM does not reduce next-token generation performance, its absolute effectiveness depends on span-choice alignment (random vs. block- or function-aware) and model exposure to real edit distributions (e.g., via AST or commit benchmarks) (Gong et al., 30 May 2025, Gong et al., 2024).
- Evaluation Complexity: Strict evaluation necessitates bespoke post-processing for each language and span type, as free-form model outputs may include extraneous line breaks, comments, or prefix/suffix duplicates (Ahmad et al., 24 May 2025).
Emerging directions include combinatorial FIM objectives (multi-span or recursive infilling), explicit modeling of output horizon (HLP), and grammar-constrained decoding spanning not just syntax but also semantic properties. There is also ongoing work on unifying FIM with any-order generation and scalable variable-rate masking for general-purpose infilling (Shen et al., 2023). The technique is being extended to new domains—biosequence editing, instructional code infilling, mathematical proof refinement—and is central to the next generation of tool-oriented, context-sensitive LLMs.