Fill-in-the-Middle (FIM) Capabilities Overview
- Fill-in-the-Middle (FIM) is a method that generates missing text spans using both preceding and following contexts, enhancing applications like code completion.
- Recent strategies employ AST-based masking and curriculum learning to preserve syntactic structure and improve infilling performance in complex data.
- Advances such as constrained decoding, grammar quotienting, and horizon-length prediction significantly reduce syntax errors and bolster dynamic editing.
Fill-in-the-Middle (FIM) refers to the capability of LLMs, particularly in code and structured sequence generation domains, to generate or infill arbitrary contiguous spans given both prefix (left context) and suffix (right context). Unlike traditional left-to-right or causal autoregressive models, FIM-trained models can condition on both preceding and succeeding context, allowing for powerful applications in code completion, editing, and structured content infilling across several domains.
1. Formalization and Core Objectives
Let a sequence be partitioned into prefix , middle/infill , and suffix . The FIM task is defined as inferring conditioned on and , with the training objective: Special tokens (e.g., PRE, 0SUF1, 2MID3) are inserted to demarcate these regions. The primary variant, Prefix–Suffix–Middle (PSM), presents the model with prefix and suffix as context and supervises generation of the middle (Bavarian et al., 2022, Sun et al., 29 Sep 2025, Guo et al., 2024, Gong et al., 30 May 2025).
FIM examples are mixed with left-to-right (L2R) autoregressive examples during pretraining, preserving both infilling and generative capabilities (Bavarian et al., 2022).
2. Training Strategies and Masking Schemes
Random vs Structure-Aware FIM
Traditionally, FIM “splits” are randomly sampled at the character or token level, yielding robust but sometimes unstructured or semantically broken spans (Bavarian et al., 2022). More recent work proposes AST-FIM, where middle spans are aligned with abstract syntax tree (AST) nodes, preserving complete syntactic units (function, block, or expression subtrees). Masking approaches include:
- Random Span (Rand-FIM): Uniformly samples start/end positions without regard to syntax.
- Single-Node Masking: Samples entire AST subtrees, proportional to span length.
- Aligned-Span Masking: Random interval is expanded to the minimal set of adjacent AST children covering it (Gong et al., 30 May 2025).
AST-FIM yields training examples aligned with real edit operations, improving FIM pass rates by up to 5 points on syntax-aware code infilling benchmarks (Gong et al., 30 May 2025).
Curriculum and Context Augmentation
Curriculum learning strategies identify and upsample “hard” infill cases, such as symbol-rich or structurally complex spans (measured via the number of unique identifiers or low historical acceptance rate). Context augmentation further prepends semantically relevant code definitions, often extracted via static analysis or code search, to assist infilling (Sagtani et al., 2024).
3. Prompting, Decoding, and Post-Processing
Prompt templates encode the FIM task as sequence infilling marked by sentinel tokens. Common formats include:
- PSM (Prefix–Suffix–Middle): 4
- SPM (Suffix–Prefix–Middle): Alternative ordering for left-to-right execution (Bavarian et al., 2022, Gong et al., 2024).
Syntax-aware post-processing is critical for accurately extracting the infilled segment at inference time. Techniques include AST-based truncation, line-based cutoffs, and handling cases where models overshoot/undershoot segment boundaries (Gong et al., 2024, Ahmad et al., 24 May 2025). Supervised FIM-specific fine-tuning substantially reduces the necessity of post-processing for line-aligned spans but not for arbitrary token spans (Ahmad et al., 24 May 2025).
4. Architectural, Algorithmic, and Decoding Advances
Constrained Decoding with Grammar Quotienting
To guarantee syntactic validity, efficient constrained decoding is achieved by quotienting context-free (and sensitive) grammars:
- Left/Right Quotients: Given 5 and right context 6, form 7.
- Incremental Parsing: An Earley-style recognizer is extended to efficiently check if the middle span, combined with prefix and suffix, admits a valid parse (Melcer et al., 2024).
- Context-sensitive quotienting: Handles indentation (Python), leftmost-longest lexing, and parenthesis-checking, allowing interactive FIM decoding with 8 amortized complexity per token for regular grammars.
Constrained FIM decoding reduced syntax errors from 65.0% (unconstrained) to 89.5% pass rate on Python code with constraints, dramatically improving over unconstrained models (Melcer et al., 2024).
Inferential Planning: Horizon-Length Prediction
Horizon-Length Prediction (HLP) introduces a lookahead planning head that estimates the number of tokens left to infill. The auxiliary regression loss over the normalized remaining-length
9
imbues the model with explicit infilling horizon-awareness, yielding up to 24% relative gains on real-world evaluation without rule-based output truncation (Ding et al., 2024).
Tokenization and Byte-Level Decoding
Standard tokenized models exhibit sharp degradation (“tokenization bias”) when FIM prefixes end in mid-token, causing invalid tokenizations. Phan et al. introduce exact byte-level marginalization—zero-shot, inference-time conversion of tokenized LMs to byte-level decoders—eliminating this bias and yielding an 18.9 percentage point gain in pass rate for SPM prompts, especially on mid-token-aligned FIM settings (Phan et al., 2024).
5. Alignment, Instruction-Tuning, and Extensions
Instruction-Aware FIM
Vanilla FIM training fails to leverage inline developer intent (e.g., comments/instructions). Standard instruction tuning—prevalent in chat-style LLMs—improves instruction following but degrades infilling accuracy by up to 10–20 points. The IFIM protocol injects an explicit instruction section 0 so that the model learns on 1 quadruples, maintaining infilling performance while recovering instruction sensitivity. With 100% IFIM data, pass@1 rises from 84.6% (baseline) to 93.6% with instructions, with no regression on zero-instruction tasks (Sun et al., 29 Sep 2025).
Editing Paradigms: Search-and-Replace Infilling
FIM is inherently context-invariant and cannot correct erroneous prefix or suffix. The Search-and-Replace Infilling (SRI) framework models explicit identification (search) and targeted replacement (replace) in one pass, harmonizing edit-centric capabilities with FIM’s low-latency dynamic (Zhang et al., 19 Jan 2026). SRI dramatically reduces adversarial code injection vulnerabilities (>80–90% reduction), robustifies completion in the presence of noisy context (EM: +33 points), and preserves general code generation competency without latency cost.
FIM in New Domains: Protein and Mathematical Sequences
- Protein Design: ProtFIM, trained with FIM transformations on protein sequences, yields improved Recovery@5 from 0.73 to 0.78 over non-FIM LMs for secondary structure-preserving infilling (Lee et al., 2023).
- Mathematical Reasoning: MathFimer applies FIM to insert intermediate CoT steps, improving GSM8K accuracy from 67.78% to 75.21% (Meta-Llama-3.1-8B) and supports multi-round expansion without reliance on stronger external generators (Yan et al., 17 Feb 2025).
6. Evaluation, Benchmarks, and Empirical Findings
Comprehensive FIM evaluation requires syntax-aware and functional correctness benchmarks. SAFIM (Gong et al., 2024) defines three AST-based categories (Algorithmic block, Control-flow, API-call), spanning 17,720 multilingual examples. AST-FIM (Gong et al., 30 May 2025) and curriculum/context-based FIM (Sagtani et al., 2024) demonstrate that structural masking and context retrieval improve pass@1 by several points (up to 5 on SAFIM splits), with curriculum/context-augmented fine-tuning yielding especially strong gains for small models in low-latency scenarios.
Empirical highlights (pass@1 unless otherwise noted):
| Model / Method | SAFIM-Alg | HumanEval-FIM | Multiline (SWE-bench) |
|---|---|---|---|
| Rand-FIM-1B | 28.2 | ||
| AST-FIM-1B | 33.5 | ||
| StarCoder2-7B + curriculum | 13.4 | ||
| DeepSeek-1B + curriculum | 8.5 | ||
| Chat LLMs (GPT-4o, Claude) | 19–23 | ||
| SRI-Coder-32B (editing) | 61.6 |
Syntax- and context-aware decoding, with output boundary supervision and targeted fine-tuning, consistently yields improvements across code and mathematical FIM tasks (Melcer et al., 2024, Ahmad et al., 24 May 2025, Gong et al., 30 May 2025, Sagtani et al., 2024).
7. Limitations and Future Directions
Current FIM approaches exhibit several limitations:
- Models not specifically tuned with FIM data or horizon prediction fail to respect segment boundaries or context, requiring post-processing, especially for random token splits.
- Some FIM decoding frameworks rely on EBNF grammars, which may not fully capture PEG or tab/space semantics (Melcer et al., 2024).
- AST-FIM’s performance depends on the availability and accuracy of AST parses.
Directions for further research include:
- Integration of richer static analysis (e.g., type systems, undefined variable checks) via quotient grammars (Melcer et al., 2024).
- Expansion of instruction-augmented and editing-centric FIM to cross-lingual, multi-domain, and large-model settings (Sun et al., 29 Sep 2025, Zhang et al., 19 Jan 2026).
- Broader benchmarks in real-world, multi-file codebases and stepwise reasoning domains (Yan et al., 17 Feb 2025, Gong et al., 30 May 2025).
- Guaranteeing output boundaries for arbitrary (non-line) spans via hybrid planning/decoding objectives (Ding et al., 2024, Ahmad et al., 24 May 2025).
Fill-in-the-Middle is now the methodological standard for robust, context-aware sequence generation in code and beyond, with ongoing advances unifying structural, instructional, and editing capabilities across parameter scales and domains.