Fill-in-the-Middle (FIM) Modeling
- Fill-in-the-Middle (FIM) is a sequence modeling paradigm that infills missing segments using both prefix and suffix contexts, enabling robust applications in code completion, document editing, and reasoning.
- The method transforms data with special sentinel tokens and mixed formatting (e.g., PSM/SPM) to effectively condition on both the preceding and succeeding contexts of an omitted span.
- FIM has demonstrated practical improvements across domains, including boosting code reasoning, mathematical chain-of-thought, protein design, and multilingual sequence modeling by significant margins.
Fill-in-the-Middle (FIM) is a sequence modeling paradigm that extends conventional left-to-right (autoregressive) language modeling by enabling models to generate a missing span (“middle”) given both a prefix and a suffix (the “left” and “right” contexts). Originally developed for code and text infilling, FIM now underlies state-of-the-art methods in code completion, code reasoning, mathematical chain-of-thought, type prediction, protein design, and multilingual sequence modeling. Its central mechanism is the reordering of training data and the introduction of special markers to teach models to condition on both sides of an infilling site—an ability essential for robust program repair, document editing, and reasoning with intermediate steps.
1. Canonical FIM Paradigm and Variants
Given a sequence , FIM randomly selects two cutpoints $0 < p < p+m < T$, yielding three contiguous spans:
- Prefix:
- Middle:
- Suffix:
Training transforms the input sequence (using format variants such as PSM/PSM: Prefix-Suffix-Middle/Suffix-Prefix-Middle) by concatenating the prefix and suffix with sentinel tokens, asking the model to infill the middle segment:
- Input:
<pre> prefix <suf> suffix <mid> middle <eoi>
The standard loss is autoregressive cross-entropy only on the middle tokens, while preserving positional information through sentinels. The model is optimized to maximize
In practical training setups, FIM is typically mixed with conventional left-to-right objectives to retain next-token generation abilities (Bavarian et al., 2022, Guo et al., 2024, Gong et al., 30 May 2025).
Format Choices and Best Practices
- Transformation Rate: Data is FIM-transformed with probability (default $0.5$–$0.9$) (Bavarian et al., 2022).
- Format Mixing: Alternating between PSM and SPM templates improves flexibility (Bavarian et al., 2022).
- Span Selection: Character-level random splits maximize transfer across inference settings, outperforming line- or token-level splits (Bavarian et al., 2022).
- Context-Level Transform: Applying FIM after packing maximizes infill performance (Bavarian et al., 2022).
2. Methodological Advances and Variants
Structural and Semantic Extensions
- AST-FIM: Masks entire Abstract Syntax Tree subtrees to create syntactically aligned infilling spans, outperforming random character-level masking by 5–7 points on real-world infilling tasks at both 1B and 8B scale (Gong et al., 30 May 2025).
- FIM-SE: Imposes character-level constraints and introduces line-level markers to eliminate sub-token boundary errors, substantially improving random-span and single-line infilling (by 8–12 points) on benchmarks such as HumanEval (Ren et al., 2024).
- Instruction-aware FIM (IFIM): Incorporates explicit developer-provided instructions into FIM prompts via special
<INS>delimiters, preserving general infilling capacity and dramatically enhancing instruction-following in code completion (Sun et al., 29 Sep 2025). - Horizon-Length Prediction (HLP): Supplementary loss ensures models explicitly predict the length (horizon) of the middle segment at each generation step. HLP increases alignment with input boundaries, eliminating the need for dataset-specific truncation heuristics and improving both file-level and repository-level FIM by up to 24% relative (Ding et al., 2024).
- Search-and-Replace Infilling (SRI): Replaces FIM’s rigid context assumption with a patch-based editing approach that first recalls the target context (SEARCH phase) and then applies replacements (REPLACE phase), preserving latency and improving context-aware code repair (Zhang et al., 19 Jan 2026).
- Direct Preference Optimization (DPO) with AST granularity: Pairs FIM splits with DPO for fine-grained feedback and curriculum schemes based on code block type and difficulty, yielding consistent gains in pass@1 metrics (Ren et al., 27 Aug 2025).
Broad Application Domains
- Mathematical Reasoning: MathFimer applies FIM to chain-of-thought expansion, teaching LLMs to insert missing intermediate steps in mathematical solution chains, reliably improving accuracy on GSM8K, MathInstruct, and MATH datasets by 2–5 points (Yan et al., 17 Feb 2025).
- Protein Design: ProtFIM applies FIM to mask and recover segments in amino-acid sequences, outperforming both standard AR models and even 30 larger models on structure recovery metrics (Lee et al., 2023).
- Type Prediction: FIM fine-tuning for TypeScript/Python type annotation prediction (“fill-in-the-type”) achieves 14.5 points higher type-check success than standard FIM, especially when paired with program decomposition and search (Cassano et al., 2023).
3. Model Architectures and Training Protocols
FIM training, whether on code or text, is architecturally straightforward:
- Base models are standard decoder-only Transformers (e.g., GPT-style), with added sentinels for context separation (Bavarian et al., 2022, Guo et al., 2024, Gong et al., 30 May 2025).
- Structural extensions such as AST-FIM operate entirely at the data pre-processing stage; the model architecture remains unchanged (Gong et al., 30 May 2025).
- Auxiliary heads (e.g., HLP’s linear horizon head) add negligible (<0.01%) parameter overhead and are discarded during inference (Ding et al., 2024).
- FIM is compatible with bidirectional architectures: FiLM enables truly arbitrary-order (non-causal) infilling with global attention, and exhibits competitive perplexity and ROUGE with autoregressive benchmarks (Shen et al., 2023).
Optimizer, learning rate schedules, context window, and tokenizer selection are typically held constant between FIM and standard AR pre-training runs to ensure comparability (Bavarian et al., 2022, Guo et al., 2024).
4. Evaluation Methodologies and Benchmarks
FIM evaluation requires benchmarks designed to probe infilling capabilities:
- HumanEval-Infilling, InCoder tasks, MBPP: Sample (prefix, missing line[s], suffix) tuples from real code function bodies. Metrics: pass@1 (unit test execution), line exact match (EM), CodeBLEU (Guo et al., 2024, Bavarian et al., 2022, Gong et al., 30 May 2025).
- SAFIM: Syntax-aware benchmark masking whole AST nodes (blocks, control-flow, API calls). Reports pass@1 (all tests pass), exact match, token F1, and character-level perplexity (Gong et al., 2024).
- RepoEval, CrossCodeEval: Repository-level, multi-file FIM to measure long-horizon and cross-file infilling (Sagtani et al., 2024, Ding et al., 2024).
- MathFimer, SEIFER: Domain-adapted FIM for stepwise mathematical reasoning or secondary-structure-preserving protein infilling (Yan et al., 17 Feb 2025, Lee et al., 2023).
Post-processing is often required, especially for evaluating extraneous code tokens in raw outputs:
- Complete-line truncation for line-based tasks;
- Overlap removal for random spans (ensuring no duplication of context) (Ahmad et al., 24 May 2025, Gong et al., 2024).
5. Limitations, Pitfalls, and Remedies
Boundary Unawareness: Vanilla FIM-trained models frequently overrun or underrun the target span boundary, particularly when the middle’s size or exact boundaries are unspecified. Heuristic truncation—by line count or syntax-aware AST truncation—has been the default, but this is unreliable in open-domain or non-dataset-aligned settings, leading to 5–14% relative loss in pass@1 if omitted (Ding et al., 2024, Ahmad et al., 24 May 2025).
Tokenization Bias: Standard FIM models degenerate on mid-token cutpoints due to sub-token fragmentation, resulting in invalid completions and low pass rates (e.g., 45% vs 64% for SPM prompts). Exact byte-level sampling algorithms, which marginalize over all aligned tokenizations, restore correct next-byte distributions and boost pass@1 by as much as 18 points (Phan et al., 2024).
Context-Only Rigidness: The “optimal context” assumption in FIM—treating context as ground truth—makes FIM unable to correct contextual errors. SRI (Search-and-Replace Infilling) and IFIM (Instruction-aware FIM) generalize the paradigm to context-aware patching and mixed instruction following without degrading infilling (Zhang et al., 19 Jan 2026, Sun et al., 29 Sep 2025).
Scaling and Transfer: Gains from FIM pretraining saturate at moderate model sizes; data quality, pretraining signals, and prompt engineering (including structural and curriculum design) have higher leverage than raw parameter scaling on practical infilling tasks (Gong et al., 2024).
Architectural Compatibility: FIM is maximally effective in architectures with unrestricted self-attention, but adaptation to strict causal models (Code Llama, SantaCoder, StarCoder) is routine via sentinel markers (Bavarian et al., 2022, Guo et al., 2024).
6. Key Results and Quantitative Impact
| Model or Technique | Evaluation | Pass@1/EM gain | Other Impact |
|---|---|---|---|
| HLP (Horizon-Length) | Repo-level infilling | +24% rel. | Eliminates truncation, improves code reasoning (Ding et al., 2024) |
| AST-FIM | SAFIM/Real-FIM-Eval | +5–7 pts | Generalizes to 100+ languages via Tree-sitter (Gong et al., 30 May 2025) |
| FIM-SE | Random/single-line | +8–12 pts | Addresses sub-token boundary errors in character infilling |
| IFIM | HumanEval-infilling | +9–10 pp | Preserves/increases performance without instruction (Sun et al., 29 Sep 2025) |
| MathFimer | GSM8K, MATH | +2–5 pp | Step expansion in math chain-of-thought (Yan et al., 17 Feb 2025) |
| ProtFIM | SEIFER Benchmark | +0.03 R@K | Outperforms 2.7B ProGen2 on structure infilling (Lee et al., 2023) |
| DPO+AST+Curriculum | HumanEval/MBPP/BigCodeBench | +1–2 pts | Granular pair alignment for code infill feedback (Ren et al., 27 Aug 2025) |
7. Future Directions and Theoretical Significance
- Long-Horizon Planning as Objective: Direct supervision of horizon length (e.g., HLP) encourages internalization of not just language syntax, but also end-of-segment planning—this is empirically linked to improved multi-step reasoning (Ding et al., 2024).
- Structural and Context-aware Prompting: Structure-aligned masking (AST, curriculum mining) and explicit instruction injection (IFIM) show that model-agnostic improvements arise from input formatting, not fundamental changes in architecture (Gong et al., 30 May 2025, Sun et al., 29 Sep 2025).
- Generalization to Non-Code Domains: Step-expansion in math and protein domains demonstrates that FIM is a paradigm for nonmonotonic, bidirectional, and intent-aware generation beyond just code infilling (Yan et al., 17 Feb 2025, Lee et al., 2023).
- Integration with Parsing and Formal Methods: Grammar-constrained decoding and right-quotient parsing further constrain FIM completions to ensure syntactic and, eventually, semantic correctness, with negligible inference cost (Melcer et al., 2024).
- Post-processing Redundancy: As models internalize boundary planning (via objectives like HLP) or character-level constraints (FIM-SE), reliance on output truncation and heuristic boundary defenses is expected to diminish (Ding et al., 2024, Ren et al., 2024).
The FIM paradigm, through progressive data-centric innovations, has reshaped generation objectives in code, reasoning, and sequence modeling, supporting a shift from strictly monotonic (L2R) generation to fully context-aware, structure-sensitive completion with robust stopping criteria and composable, task-driven format specialization.