Fill-In-the-Middle (FIM) Modeling

Updated 25 March 2026

Fill-In-the-Middle (FIM) is a neural language modeling paradigm that predicts missing interior spans using both prefix and suffix contexts across diverse applications.
It leverages specialized tokenization, AST-based masking, and KV-cache optimizations to enhance performance and efficiency in code completion and document editing.
FIM integrates instruction-aware techniques and boundary planning to improve infilling precision in domains such as natural language processing, protein design, and structured code generation.

Fill-In-the-Middle (FIM) is a neural language modeling paradigm that generalizes classic left-to-right generative objectives by training a model to predict and generate a contiguous span removed from the interior of a sequence, conditioned on both its left (“prefix”) and right (“suffix”) contexts. In code, natural language, and even protein sequence modeling, FIM provides a direct mechanism to solve infilling tasks—such as code completion, document editing, or reasoning step augmentation—where future and past context are simultaneously salient. FIM is now integral to state-of-the-art code LLMs and is implemented at scale for both synthetic and real-world workflows.

1. Formal Definition and Training Objective

Let a sequence $X = (x_1, \dots, x_n)$ be partitioned by indices $1\leq a < b \leq n$ into three spans:

Prefix $P = (x_1, ..., x_a)$
Middle $M = (x_{a+1}, ..., x_b)$ (the “hole” to infill)
Suffix $S = (x_{b+1}, ..., x_n)$

The FIM modeling task is to learn the conditional distribution

$P_\theta(M \mid P, S)$

where $\theta$ are model parameters. The typical data transformation for a decoder-only transformer appends sentinel tokens to demarcate spans, yielding the prompt: $\langle\mathrm{PRE}\rangle\,P\,\langle\mathrm{SUF}\rangle\,S\,\langle\mathrm{MID}\rangle$ and seeks to autoregressively generate $M$ token by token. The cross-entropy loss is minimized over $M$ , formally: $1\leq a < b \leq n$ 0 where $1\leq a < b \leq n$ 1 is the corpus of $1\leq a < b \leq n$ 2 triplets. Interleaving FIM-structured and ordinary left-to-right (L2R) training retains both autoregressive sequence modeling and infilling capabilities (Bavarian et al., 2022, Guo et al., 2024).

2. Architectures, Prompt Formats, and Operational Regimes

FIM is natively implemented in decoder-only transformer architectures. Prompt engineering is critical for span demarcation and cache management:

Tokenization/Delimiters: FIM utilizes special tokens (e.g., <PRE>, <SUF>, <MID>) to mark prefix, suffix, and the start of the middle span (Guo et al., 2024, Bavarian et al., 2022).
Prompt Rearrangement: The dominant format is Prefix-Suffix-Middle (PSM), but Suffix-Prefix-Middle (SPM) is also used for inference/serving efficiency (Bavarian et al., 2022, Guo et al., 28 May 2025). A 50/50 PSM+SPM mix provides broad compatibility.
KV-cache Reuse: The EFIM prompt rearrangement enables maximal reuse of key-value (KV) cache by placing only user-updated increments after static contexts. Simultaneously, fragment-tokenization retraining resolves subtoken-generation at arbitrary boundaries, improving latency by up to 52% and throughput by 98% without loss of infilling performance (Guo et al., 28 May 2025).
Instruction Augmentation: The Instruction-Aware FIM (IFIM) framework extends the input with a structured instruction (quadruple $1\leq a < b \leq n$ 3), resulting in

$1\leq a < b \leq n$ 4

and trains the model to incorporate developer intent (Sun et al., 29 Sep 2025).

3. Specialized FIM Strategies and Domain Adaptations

FIM has evolved with structural and contextual enhancements across multiple tasks:

Structure-Aware FIM: Masking entire Abstract Syntax Tree (AST) subtrees (as opposed to random tokens/chars) aligns masked spans with semantically meaningful code constructs. This structurally coherent masking (AST-FIM) delivers up to +7 Pass@1 gain over random-character FIM on standard code infilling benchmarks, and matches human editing patterns (Gong et al., 30 May 2025).
Curriculum and Code Context: Incorporating context and hard-to-complete code patterns (curriculum learning) enhances FIM performance, especially for smaller models. Statistics from fine-tuning on curriculum and context-rich datasets report improvements in Pass@1, Prefix Match, and edit similarity on multi-line infilling and CCEval (Sagtani et al., 2024).
Instruction-Conditioned FIM: IFIM achieves double-digit Pass@1 gains (e.g., Deepseek-Coder: 84.6% to 93.6% on IHumanEval) on instruction-guided infilling, with no loss (even improvement) of core FIM capabilities when instructions are absent. Physically separated instruction tokens (not comments) are critical for accurate instruction following (Sun et al., 29 Sep 2025).
Horizon Planning: By augmenting the next-token loss with a horizon-length regression objective (HLP), models internalize the “distance-to-suffix” at each infilling step, boosting alignment with infilling boundaries and improving repository-level and file-level pass rates by up to 24% relative, obviating the need for heuristic post-processing (Ding et al., 2024).
Byte-Level Decoding: Precise handling of mid-token boundaries in random-span infilling is resolved by exact byte-level marginalization over all tokenizations, yielding absolute pass rate gains of ~18% over token-level decoding (Phan et al., 2024).

4. Evaluation Protocols and Benchmarks

FIM evaluation metrics center on syntax, semantics, and boundary control:

Pass@k: Fraction of generated fills that pass all reference unit tests (code) (Gong et al., 2024).
Exact Match (EM): Token-wise or character-wise exact equality with ground truth (Ahmad et al., 24 May 2025).
Perplexity: Exponential average negative log-likelihood over ground-truth $1\leq a < b \leq n$ 5(Gong et al., 2024, Gong et al., 30 May 2025).
Specialized benchmarks:
- SAFIM: Syntax-aware, execution-based code infilling, including block, control-flow, and API call completion (Gong et al., 2024).
- Real-FIM-Eval: Derived from >30,000 GitHub commits across 12 languages, assessing real-world code editing (Gong et al., 30 May 2025).
- HumanEval-infilling and RepoMasterEval: Single/multi-line and real-world repo infilling for code (Sun et al., 29 Sep 2025).
- SEIFER: Secondary structure infilling for protein engineering (Lee et al., 2023).
- Others: CCEval (acceptance and persistence in IDEs), CrossCodeEval (context-aware completion), Multi-line Infilling from SWE-bench (Sagtani et al., 2024, Zhang et al., 19 Jan 2026).

5. Empirical Findings and Best Practices

A cross-paper synthesis yields these high-level insights:

Infilling does not degrade L2R: FIM pretraining at moderate rates ( $1\leq a < b \leq n$ 60.5) does not harm left-to-right performance and is “free” in terms of perplexity and sample quality on L2R tasks (Bavarian et al., 2022, Guo et al., 2024, Gong et al., 30 May 2025).
Boundary Awareness Is Central: Post-processing of generated output (to remove extraneous lines or ensure alignment with suffix) is necessary for random-span infilling, but superfluous for line-aligned tasks when FIM is trained with explicit span boundaries (Ahmad et al., 24 May 2025, Ding et al., 2024).
FIM is critical for context-sensitive code completion: Models lacking FIM objectives underperform even when scaling up, and data quality in pretraining (syntax/alignment, AST-aware masks) outweighs raw parameter count (Gong et al., 2024, Gong et al., 30 May 2025).
AST-based masking and curriculum: Realistic structure masking converges faster and achieves higher accuracy than random spans; curriculum and context add synergy (Sagtani et al., 2024, Ren et al., 27 Aug 2025).
Instruction integration: IFIM, with explicit special-token instructions, closes the gap between code LLMs and natural developer workflows, far outperforming comment-based or inline instruction schemes (Sun et al., 29 Sep 2025).
Domain transfer: FIM supports protein design (recovering mid-chain amino acids; ProtFIM matches or outperforms larger CLM/PLM baselines (Lee et al., 2023)), math reasoning step expansion (MathFimer consistently lifts benchmark scores by up to 8 percentage points; (Yan et al., 17 Feb 2025)), and general text tasks (FiLM; (Shen et al., 2023)).

Model/Paper	Domain	FIM Variant	Notable Result(s)
DeepSeek-Coder (Guo et al., 2024)	Code	PSM (50%)	SOTA open-source infilling
IFIM (Sun et al., 29 Sep 2025)	Code	Instruction-aware FIM	+9 to +12 pp Pass@1
AST-FIM (Gong et al., 30 May 2025)	Code	AST-structure masking	+4–7 pts pass@1 over Rand-FIM
ProtFIM (Lee et al., 2023)	Protein	[PRE]/[SUF]/[MID] FIM	Outperforms 2–30x CLMs
FiLM (Shen et al., 2023)	Text	Any-order masked infilling	+5–14 ROUGE-PPL gap vs AR
EFIM (Guo et al., 28 May 2025)	Code	KV-cache-optimized FIM	–52% latency, +98% throughput
MathFimer (Yan et al., 17 Feb 2025)	Math Reason	Step-infill in solution chain	Up to +8pp on GSM8K/MATH

6. Limitations, Extensions, and Future Directions

FIM is robust but has known boundaries and open research directions:

Contextual Repair: Standard FIM cannot correct errors in the conditioning context (prefix/suffix). Methods like SRI (Search-and-Replace Infilling) internalize editing/verification cycles, enabling bug-fixing in the context at FIM-level latency (Zhang et al., 19 Jan 2026).
Syntax Guarantee: Unconstrained decoders still admit syntax errors. Left/right quotient-based constrained decoding using context-sensitive grammars can boost syntactic correctness from 65%→89.5% in Python FIM, with minor inference overhead (Melcer et al., 2024).
Subtoken and Byte Handling: Fragment-tokenization and byte-level marginalization remove pitfalls near token boundaries, markedly improving random-span fill (Guo et al., 28 May 2025, Phan et al., 2024).
Post-processing: Needed only for random/partial-line tasks; high-quality FIM + supervised fine-tuning yields models that learn exact output boundaries (Ahmad et al., 24 May 2025).
Scaling Laws: FiLM and AST-FIM show that the infilling–autoreg gap shrinks at scale and with code-structure alignment, suggesting further gains with increased compute or bidirectional generation (Shen et al., 2023, Gong et al., 30 May 2025).
Expanded Curriculum, Context, and Instructions: Combining structural, context-aware, and instruction-rich examples is essential for high infilling accuracy, persistence, and human alignment (Sagtani et al., 2024, Sun et al., 29 Sep 2025, Ren et al., 27 Aug 2025).

7. Impact, Applications, and Best Practices

FIM is now standard in foundation code LLMs, code assistants, and editing tools. Key best-practices distilled from the literature include:

Use moderate FIM rates ( $1\leq a < b \leq n$ 750%), character-level span selection, and context-level masking (Bavarian et al., 2022, Guo et al., 2024).
For enhanced realism and efficiency, mask AST subtrees rather than random tokens (Gong et al., 30 May 2025).
For performance-critical applications, apply EFIM for cache reuse and fragment tokenization for subtoken robustness (Guo et al., 28 May 2025).
For instruction-guided flows, employ explicit structured instruction tokens (IFIM) rather than in-line comments (Sun et al., 29 Sep 2025).
Use HLP loss for robust boundary planning, especially as post-processing is phased out in evaluation (Ding et al., 2024).
To guarantee syntax, incorporate constrained decoding using left/right grammar quotients (Melcer et al., 2024).
For domain transfer (proteins, math, text), adapt prompt and masking formats to respect semantic units—secondary structure, reasoning steps, or paragraphs (Lee et al., 2023, Yan et al., 17 Feb 2025, Shen et al., 2023).