Fill-In-the-Middle (FIM) Modeling
- Fill-In-the-Middle (FIM) is a neural language modeling paradigm that predicts missing interior spans using both prefix and suffix contexts across diverse applications.
- It leverages specialized tokenization, AST-based masking, and KV-cache optimizations to enhance performance and efficiency in code completion and document editing.
- FIM integrates instruction-aware techniques and boundary planning to improve infilling precision in domains such as natural language processing, protein design, and structured code generation.
Fill-In-the-Middle (FIM) is a neural language modeling paradigm that generalizes classic left-to-right generative objectives by training a model to predict and generate a contiguous span removed from the interior of a sequence, conditioned on both its left (“prefix”) and right (“suffix”) contexts. In code, natural language, and even protein sequence modeling, FIM provides a direct mechanism to solve infilling tasks—such as code completion, document editing, or reasoning step augmentation—where future and past context are simultaneously salient. FIM is now integral to state-of-the-art code LLMs and is implemented at scale for both synthetic and real-world workflows.
1. Formal Definition and Training Objective
Let a sequence be partitioned by indices into three spans:
- Prefix
- Middle (the “hole” to infill)
- Suffix
The FIM modeling task is to learn the conditional distribution
where are model parameters. The typical data transformation for a decoder-only transformer appends sentinel tokens to demarcate spans, yielding the prompt: and seeks to autoregressively generate token by token. The cross-entropy loss is minimized over , formally: where is the corpus of triplets. Interleaving FIM-structured and ordinary left-to-right (L2R) training retains both autoregressive sequence modeling and infilling capabilities (Bavarian et al., 2022, Guo et al., 2024).
2. Architectures, Prompt Formats, and Operational Regimes
FIM is natively implemented in decoder-only transformer architectures. Prompt engineering is critical for span demarcation and cache management:
- Tokenization/Delimiters: FIM utilizes special tokens (e.g.,
<PRE>,<SUF>,<MID>) to mark prefix, suffix, and the start of the middle span (Guo et al., 2024, Bavarian et al., 2022). - Prompt Rearrangement: The dominant format is Prefix-Suffix-Middle (PSM), but Suffix-Prefix-Middle (SPM) is also used for inference/serving efficiency (Bavarian et al., 2022, Guo et al., 28 May 2025). A 50/50 PSM+SPM mix provides broad compatibility.
- KV-cache Reuse: The EFIM prompt rearrangement enables maximal reuse of key-value (KV) cache by placing only user-updated increments after static contexts. Simultaneously, fragment-tokenization retraining resolves subtoken-generation at arbitrary boundaries, improving latency by up to 52% and throughput by 98% without loss of infilling performance (Guo et al., 28 May 2025).
- Instruction Augmentation: The Instruction-Aware FIM (IFIM) framework extends the input with a structured instruction (quadruple ), resulting in
and trains the model to incorporate developer intent (Sun et al., 29 Sep 2025).
3. Specialized FIM Strategies and Domain Adaptations
FIM has evolved with structural and contextual enhancements across multiple tasks:
- Structure-Aware FIM: Masking entire Abstract Syntax Tree (AST) subtrees (as opposed to random tokens/chars) aligns masked spans with semantically meaningful code constructs. This structurally coherent masking (AST-FIM) delivers up to +7 Pass@1 gain over random-character FIM on standard code infilling benchmarks, and matches human editing patterns (Gong et al., 30 May 2025).
- Curriculum and Code Context: Incorporating context and hard-to-complete code patterns (curriculum learning) enhances FIM performance, especially for smaller models. Statistics from fine-tuning on curriculum and context-rich datasets report improvements in Pass@1, Prefix Match, and edit similarity on multi-line infilling and CCEval (Sagtani et al., 2024).
- Instruction-Conditioned FIM: IFIM achieves double-digit Pass@1 gains (e.g., Deepseek-Coder: 84.6% to 93.6% on IHumanEval) on instruction-guided infilling, with no loss (even improvement) of core FIM capabilities when instructions are absent. Physically separated instruction tokens (not comments) are critical for accurate instruction following (Sun et al., 29 Sep 2025).
- Horizon Planning: By augmenting the next-token loss with a horizon-length regression objective (HLP), models internalize the “distance-to-suffix” at each infilling step, boosting alignment with infilling boundaries and improving repository-level and file-level pass rates by up to 24% relative, obviating the need for heuristic post-processing (Ding et al., 2024).
- Byte-Level Decoding: Precise handling of mid-token boundaries in random-span infilling is resolved by exact byte-level marginalization over all tokenizations, yielding absolute pass rate gains of ~18% over token-level decoding (Phan et al., 2024).
4. Evaluation Protocols and Benchmarks
FIM evaluation metrics center on syntax, semantics, and boundary control:
- Pass@k: Fraction of generated fills that pass all reference unit tests (code) (Gong et al., 2024).
- Exact Match (EM): Token-wise or character-wise exact equality with ground truth (Ahmad et al., 24 May 2025).
- Perplexity: Exponential average negative log-likelihood over ground-truth (Gong et al., 2024, Gong et al., 30 May 2025).
- Specialized benchmarks:
- SAFIM: Syntax-aware, execution-based code infilling, including block, control-flow, and API call completion (Gong et al., 2024).
- Real-FIM-Eval: Derived from >30,000 GitHub commits across 12 languages, assessing real-world code editing (Gong et al., 30 May 2025).
- HumanEval-infilling and RepoMasterEval: Single/multi-line and real-world repo infilling for code (Sun et al., 29 Sep 2025).
- SEIFER: Secondary structure infilling for protein engineering (Lee et al., 2023).
- Others: CCEval (acceptance and persistence in IDEs), CrossCodeEval (context-aware completion), Multi-line Infilling from SWE-bench (Sagtani et al., 2024, Zhang et al., 19 Jan 2026).
5. Empirical Findings and Best Practices
A cross-paper synthesis yields these high-level insights:
- Infilling does not degrade L2R: FIM pretraining at moderate rates (0.5) does not harm left-to-right performance and is “free” in terms of perplexity and sample quality on L2R tasks (Bavarian et al., 2022, Guo et al., 2024, Gong et al., 30 May 2025).
- Boundary Awareness Is Central: Post-processing of generated output (to remove extraneous lines or ensure alignment with suffix) is necessary for random-span infilling, but superfluous for line-aligned tasks when FIM is trained with explicit span boundaries (Ahmad et al., 24 May 2025, Ding et al., 2024).
- FIM is critical for context-sensitive code completion: Models lacking FIM objectives underperform even when scaling up, and data quality in pretraining (syntax/alignment, AST-aware masks) outweighs raw parameter count (Gong et al., 2024, Gong et al., 30 May 2025).
- AST-based masking and curriculum: Realistic structure masking converges faster and achieves higher accuracy than random spans; curriculum and context add synergy (Sagtani et al., 2024, Ren et al., 27 Aug 2025).
- Instruction integration: IFIM, with explicit special-token instructions, closes the gap between code LLMs and natural developer workflows, far outperforming comment-based or inline instruction schemes (Sun et al., 29 Sep 2025).
- Domain transfer: FIM supports protein design (recovering mid-chain amino acids; ProtFIM matches or outperforms larger CLM/PLM baselines (Lee et al., 2023)), math reasoning step expansion (MathFimer consistently lifts benchmark scores by up to 8 percentage points; (Yan et al., 17 Feb 2025)), and general text tasks (FiLM; (Shen et al., 2023)).
| Model/Paper | Domain | FIM Variant | Notable Result(s) |
|---|---|---|---|
| DeepSeek-Coder (Guo et al., 2024) | Code | PSM (50%) | SOTA open-source infilling |
| IFIM (Sun et al., 29 Sep 2025) | Code | Instruction-aware FIM | +9 to +12 pp Pass@1 |
| AST-FIM (Gong et al., 30 May 2025) | Code | AST-structure masking | +4–7 pts pass@1 over Rand-FIM |
| ProtFIM (Lee et al., 2023) | Protein | [PRE]/[SUF]/[MID] FIM | Outperforms 2–30x CLMs |
| FiLM (Shen et al., 2023) | Text | Any-order masked infilling | +5–14 ROUGE-PPL gap vs AR |
| EFIM (Guo et al., 28 May 2025) | Code | KV-cache-optimized FIM | –52% latency, +98% throughput |
| MathFimer (Yan et al., 17 Feb 2025) | Math Reason | Step-infill in solution chain | Up to +8pp on GSM8K/MATH |
6. Limitations, Extensions, and Future Directions
FIM is robust but has known boundaries and open research directions:
- Contextual Repair: Standard FIM cannot correct errors in the conditioning context (prefix/suffix). Methods like SRI (Search-and-Replace Infilling) internalize editing/verification cycles, enabling bug-fixing in the context at FIM-level latency (Zhang et al., 19 Jan 2026).
- Syntax Guarantee: Unconstrained decoders still admit syntax errors. Left/right quotient-based constrained decoding using context-sensitive grammars can boost syntactic correctness from 65%→89.5% in Python FIM, with minor inference overhead (Melcer et al., 2024).
- Subtoken and Byte Handling: Fragment-tokenization and byte-level marginalization remove pitfalls near token boundaries, markedly improving random-span fill (Guo et al., 28 May 2025, Phan et al., 2024).
- Post-processing: Needed only for random/partial-line tasks; high-quality FIM + supervised fine-tuning yields models that learn exact output boundaries (Ahmad et al., 24 May 2025).
- Scaling Laws: FiLM and AST-FIM show that the infilling–autoreg gap shrinks at scale and with code-structure alignment, suggesting further gains with increased compute or bidirectional generation (Shen et al., 2023, Gong et al., 30 May 2025).
- Expanded Curriculum, Context, and Instructions: Combining structural, context-aware, and instruction-rich examples is essential for high infilling accuracy, persistence, and human alignment (Sagtani et al., 2024, Sun et al., 29 Sep 2025, Ren et al., 27 Aug 2025).
7. Impact, Applications, and Best Practices
FIM is now standard in foundation code LLMs, code assistants, and editing tools. Key best-practices distilled from the literature include:
- Use moderate FIM rates (50%), character-level span selection, and context-level masking (Bavarian et al., 2022, Guo et al., 2024).
- For enhanced realism and efficiency, mask AST subtrees rather than random tokens (Gong et al., 30 May 2025).
- For performance-critical applications, apply EFIM for cache reuse and fragment tokenization for subtoken robustness (Guo et al., 28 May 2025).
- For instruction-guided flows, employ explicit structured instruction tokens (IFIM) rather than in-line comments (Sun et al., 29 Sep 2025).
- Use HLP loss for robust boundary planning, especially as post-processing is phased out in evaluation (Ding et al., 2024).
- To guarantee syntax, incorporate constrained decoding using left/right grammar quotients (Melcer et al., 2024).
- For domain transfer (proteins, math, text), adapt prompt and masking formats to respect semantic units—secondary structure, reasoning steps, or paragraphs (Lee et al., 2023, Yan et al., 17 Feb 2025, Shen et al., 2023).
Fill-In-the-Middle thus provides a general, extensible, and empirically validated paradigm for sequence infilling across domains, with structural, efficiency, and instruction-following enhancements emerging as the main determinants of state-of-the-art performance (Sun et al., 29 Sep 2025, Ding et al., 2024, Gong et al., 30 May 2025, Ren et al., 2024, Sagtani et al., 2024, Zhang et al., 19 Jan 2026, Ren et al., 27 Aug 2025, Gong et al., 2024, Guo et al., 28 May 2025, Ahmad et al., 24 May 2025, Phan et al., 2024, Melcer et al., 2024, Lee et al., 2023, Shen et al., 2023, Yan et al., 17 Feb 2025).