Structured Fill-In-the-Middle (SFIM)
- Structured Fill-In-the-Middle (SFIM) is a sequence modeling technique that infills missing segments by leveraging syntactic, semantic, or user-defined structural boundaries.
- SFIM employs domain-specific strategies like AST extraction for code and residue windowing for proteins to ensure the generated infills are contextually consistent.
- The methodology enhances controllability and performance in tasks such as editing, reasoning, and code synthesis by integrating dual-loss frameworks and bidirectional modeling.
Structured Fill-In-the-Middle (SFIM) is a class of sequence modeling objectives and architectures in which a model is explicitly trained to generate or infill a central missing segment, or multiple internal masked sections, of a sequence with consideration for structural or semantic boundaries. SFIM stands apart from generic fill-in-the-middle (FIM) by its use of underlying structure—syntactic, semantic, or user-defined—to select, constrain, or guide both the span to be infilled and the manner of infilling. SFIM methodologies have been adopted across natural language, code, protein sequence, and tabular application domains, with implications for controllability, diversity, and real-world usability in editing, reasoning, and code synthesis tasks.
1. Structural Approaches to SFIM: Syntax, Trees, and Block-Level Masking
The key innovation in SFIM lies in moving beyond masking arbitrary spans to leveraging structural information. In code domains, SFIM implementations employ Abstract Syntax Tree (AST) analysis to select infill regions that correspond to complete, independently meaningful blocks—such as functions, control-flow statements, or logical groupings—rather than character-aligned or token-aligned random spans. Methodologies involve parsing code (e.g., via Tree-sitter), sampling a non-leaf internal node (representing an entire syntactic subunit), and ensuring that both the prefix and suffix present the global context on both sides of the omitted structure (Jiang et al., 17 Oct 2024, Gong et al., 30 May 2025, Ren et al., 27 Aug 2025).
In natural language, SFIM can include specifying a middle word (as in middle-out decoders for captioning (Mehri et al., 2018)), or splitting documents at semantically coherent boundaries for structured editing tasks.
For protein engineering, SFIM uses biologically meaningful residue windowing, ensuring that both local and global sequence context around the mutable region supports function and structure (Lee et al., 2023).
Table: Span Selection Strategies in SFIM
Domain | Span Selection Strategy | Structural Unit |
---|---|---|
Code | AST subtree extraction | Functions, if/while... |
Protein design | Region respecting secondary structure | Residue window |
Text | User- or classifier-specified token | Word, phrase |
SFIM's reliance on structural alignment ensures infills are syntactically and semantically plausible, mimicking real-world editing operations and promoting syntactic validity, as opposed to models trained on random span masking which may sever code elements or disrupt language flow (Gong et al., 30 May 2025).
2. Training Objectives, Data Transformations, and Losses
SFIM training objectives universally require reformatting the training data to enforce model exposure to bidirectional context accompanied by structure-aware masking. In code, this involves extracting syntactically complete spans and reordering training examples, commonly using the "Prefix-Suffix-Middle" (PSM) or "Suffix-Prefix-Middle" (SPM) formats with sentinel tokens marking each boundary (Bavarian et al., 2022, Jiang et al., 17 Oct 2024). SFIM loss is typically formulated as cross-entropy over the infilled region, but many models (e.g., aiXcoder-7B, AST-FIM) include dual loss formulations for PSM and SPM to enhance flexibility.
In mathematical reasoning, MathFimer employs a decomposition of solution chains into prefix, missing step, and suffix, with loss applied only to the reconstructed middle step, using supervised fine-tuning on expanded reasoning chains (Yan et al., 17 Feb 2025).
SFIM training increasingly incorporates auxiliary objectives for planning (e.g., horizon-length prediction (Ding et al., 4 Oct 2024)) or curriculum-based strategies, in which the complexity of masked regions is gradually increased according to program or reasoning structure (Sagtani et al., 21 Dec 2024, Ren et al., 27 Aug 2025).
3. Evaluation Benchmarks and Quantitative Performance
SFIM methods are evaluated across domains with real-world and synthetic benchmarks designed to test the quality of infill and the preservation of structure:
- Code: Real-FIM-Eval (code deltas from real commits covering multiple languages (Gong et al., 30 May 2025)), HumanEval and MBPP for function completion (Ren et al., 27 Aug 2025), SAFIM for syntax-aware infilling, CCEval with cross-file context (Sagtani et al., 21 Dec 2024).
- Protein: SEIFER benchmark requires not only sequence matching but strict conservation of secondary structure in the infilled region (Lee et al., 2023).
- Mathematics: MathFimer-expanded datasets with inserted intermediate reasoning steps, evaluated on GSM8K, MATH, and MetaMathQA (Yan et al., 17 Feb 2025).
Performance gains reported for SFIM-trained models over baseline FIM/random mask models are substantial: up to 5 points improvement on code FIM benchmarks (AST-FIM over Rand-FIM (Gong et al., 30 May 2025)), up to 24% relative gain in FIM tasks with horizon-length prediction (Ding et al., 4 Oct 2024), and 3–9 percentage-point advancement on mathematical reasoning benchmarks through chain expansion (Yan et al., 17 Feb 2025). SFIM also reduces decrements in left-to-right generative capability compared to standard FIM; for example, joint FIM/L2R training yields "FIM-for-free" without degrading conventional perplexity metrics (Bavarian et al., 2022).
4. Architectural and Serving Considerations
Implementing SFIM at scale necessitates architectural adaptations for both training and inference stages:
- Dual Decoding and Attention: Middle-out decoding instantiates two decoders expanding in opposite directions from a central token, each with dual self-attention mechanisms to consolidate context from both generated sides (Mehri et al., 2018).
- Bidirectional Masked LLMing: Models like FiLM generalize SFIM capabilities to any mask position via a masked LLMing objective with adaptive mask scheduling (Beta-distributed mask probabilities) (Shen et al., 2023).
- KV Cache Efficiency: For efficient serving in interactive environments, EFIM proposes a prompt reordering (moving the infill to the end) and fragment tokenization training, enabling both prefix and suffix KV caches to be reused across requests, demonstrating up to 98% throughput gain and 52% latency reduction (Guo et al., 28 May 2025).
- Syntax-Constrained Decoding: Early-rejection incremental parsers with quotient grammars force syntactic validity during decoding, achieving near 90% syntax-correctness rate in Python FIM tasks (Melcer et al., 28 Feb 2024).
- Robust Subtoken/Character-Level Handling: FIM-SE eliminates the need for subtoken prediction at span boundaries by aligning infill units to line boundaries and applying strong character constraints, reducing perplexity and error rates in character-level SFIM tasks (Ren et al., 27 May 2024).
5. Real-World Applications and Broader Implications
SFIM has enabled new capabilities and performance gains across several real-world tasks:
- Code Completion and Editing: SFIM-trained models produce infills that fit within realistic code editing operations, supporting IDE-assisted patching, automated refactoring, and multi-line suggestion (Gong et al., 30 May 2025, Jiang et al., 17 Oct 2024, Ren et al., 27 Aug 2025).
- Form Filling and Structured Data Entry: Domain-agnostic form filling combines multi-faceted contextual inputs and structured output maps, supporting real-time autofill across arbitrary web forms (Aveni et al., 2023).
- Mathematical Reasoning Expansion: MathFimer demonstrates that FIM-based reasoning insertion can systematically improve both solution chain detail and final accuracy in LLM reasoning (Yan et al., 17 Feb 2025).
- Protein Engineering: By infilling the middle of proteins with context-aware mutations, SFIM-guided pLMs such as ProtFIM generate functional, structurally consistent designs for protein engineering (Lee et al., 2023).
Table: Domains and Notable SFIM Application Contexts
Domain | SFIM Output Type | Benchmark/Deployment |
---|---|---|
Code | Block/statement infill | HumanEval, SAFIM, Real-FIM-Eval |
Language | Phrase/paragraph insertion | WikiText-103, ROCStories |
Protein | Sequence segment infill | SEIFER |
Mathematical | Reasoning step insertion | GSM8K, MATH, MetaMathQA |
Form filling | Field-by-field autofill | OmniFill user studies |
A plausible implication is that SFIM-mediated models facilitate controllable, semantically-aware editing and completion for a range of tasks where preservation of global structure is critical, and provide a principled route to fine-grained curriculum design, planning objectives (e.g., horizon-length), and cache-efficient deployment.
6. Limitations and Open Challenges
Despite its strengths, SFIM faces several ongoing challenges and limitations:
- Subtoken/Fragment Generation: Prompt formats exposing incomplete tokens at the boundary can degrade model performance unless special fragment tokenization procedures are employed (Guo et al., 28 May 2025, Ren et al., 27 May 2024).
- Boundary Awareness: Without auxiliary objectives (such as HLP), models may fail to terminate generation at the correct integration point, requiring dataset-specific post-processing for evaluation and sometimes for deployment (Ahmad et al., 24 May 2025, Ding et al., 4 Oct 2024).
- Domain Generalization: While SFIM generalizes well within code and block-structured domains, cross-language or cross-schema generalization relies on robust structural parsing and annotation pipelines, and may require adaptation for new programming languages or data schemas (Gong et al., 30 May 2025).
- Inference Efficiency Under High Concurrency: Although prompt reordering can boost KV cache reuse, resource constraints (e.g., limited GPU memory) still limit scaling under extreme concurrency (Guo et al., 28 May 2025).
- Evaluation Complexity: Meaningful evaluation of SFIM completions must reconcile diverse structural units, requiring benchmarks that reflect real-world editing patterns and standardized post-processing heuristics when "raw" outputs remain imperfect (Ahmad et al., 24 May 2025).
7. Future Directions and Research Trajectory
Recent SFIM research suggests several directions:
- Generalizing SFIM to Multimodal and Non-Code Domains: Strategies employing structure-aware masking (e.g., visual or tabular block masking) may extend SFIM to multimodal settings.
- Integration with Preference and Reward Optimization: SFIM paired with preference optimization (such as DPO) over granular structural units enables more data-efficient learning from limited but high-quality test cases (Ren et al., 27 Aug 2025).
- Planning and Lookahead: Explicit modeling of horizon or boundary (as in HLP) may be vital for applications requiring structured insertion or alignment beyond straightforward L2R objectives (Ding et al., 4 Oct 2024).
- Human-in-the-Loop and Transparency: SFIM deployment in user-facing systems (e.g., OmniFill) reveals the importance of context transparency and user-driven task specification, highlighting the need for attribution and prompt construction interfaces (Aveni et al., 2023).
- Standardization of Benchmarks and Post-Processing: The proliferation of SFIM tasks and specialized data formats (PSM, SPM, line/block-aligned splitting, AST masking) underscores the need for shared evaluation standards and tooling (Bavarian et al., 2022, Gong et al., 30 May 2025).
SFIM-based architectures and objectives are shaping a trajectory towards structure-aligned, contextually adaptable, and computationally efficient LLMs tailored for a wide range of infilling, editing, and reasoning tasks.