Search-and-Replace Infilling (SRI)
- Search-and-Replace Infilling (SRI) is a unified framework that generalizes fill-in-the-middle to dynamic, context-aware code editing via one-pass inference.
- It reformulates code completion as a diff-style search-and-replace operation, explicitly grounding edits through a two-block (SEARCH and REPLACE) generation process.
- Leveraging instruction-tuned Transformer models and the SRI-200K dataset, the approach enhances performance and robustness while preserving overall coding competencies.
Search-and-Replace Infilling (SRI) is a code infilling framework that generalizes the traditional fill-in-the-middle (FIM) paradigm to support dynamic, context-aware editing through a single-pass inference. Unlike FIM, which is restricted to static completion, SRI structurally integrates verification and editing—the hallmarks of agentic workflows—directly into model generation. This enables explicit grounding of edits and aligns with instruction-following priors of contemporary Chat LLMs, leading to enhanced performance, robustness, and inference efficiency in code completion and editing tasks (Zhang et al., 19 Jan 2026).
1. Formal Framework of SRI
SRI reformulates the code completion problem as a search-and-replace operation over a code context , where denotes the prefix, marker is a sentinel such as “/* MIDDLE CODE TO COMPLETE */”, and denotes the suffix. During inference, the model outputs a pair :
- : SEARCH block, a verbatim copy (“echo”) of the region containing the marker, grounding the edit location.
- : REPLACE block, the substitution for the marker, representing the desired code completion or correction.
The model optimizes: A deterministic patch operator applies the diff-style update—functionally analogous to a git patch—by replacing with in .
2. Training Objective and Model Architecture
SRI models are instruction-tuned using token-level cross-entropy over the concatenated SEARCH and REPLACE sequence, with no auxiliary loss or regularization:
where is the full target token sequence. Architecturally, SRI-Coder variants inherit exactly the same Transformer stack (SwiGLU activations, RMSNorm) and byte-pair encoding as their respective base models (Qwen2.5-Coder and Qwen3-Coder, spanning 0.5B–480B parameters). Only the marker and diff delimiters (/* MIDDLE CODE TO COMPLETE */, <<<<<<< SEARCH, =======, >>>>>>> REPLACE) are introduced as new tokens.
3. Dataset Construction: SRI-200K
The SRI-200K dataset was constructed from The Stack v2 using Tree-sitter parsing, yielding 200K “middle” code segments balanced across four types: function bodies, multi-line blocks (if/for), random spans, and single lines (ratio 2:1:1:1). A 20K subset, weighted toward high-quality repositories by GitHub stars, was used for instruction tuning; the remainder is reserved for extended research. Each sample contains a full file context (truncated to 32K tokens), a 10-line edit window around the marker, and is rendered as a diff-style block for supervision:
1 2 3 4 5 |
<<<<<<< SEARCH … code with /* MIDDLE CODE TO COMPLETE */ … ======= … same code but marker replaced with ground-truth … >>>>>>> REPLACE |
4. Training Pipeline and Hyperparameters
SRI-Coder models are fine-tuned using Megatron-LM across 16 NVIDIA A100 (80GB) GPUs, with a context length of 32,768 tokens. Each batch mixes 20K SRI examples, 60K general instructions (sourced from Glaive-Code-Assistant), plus 100 safety prompts. Optimization uses AdamW (weight decay 0.1, gradient clip 1.0), linear warm-up (30 steps) to a learning rate of , followed by decay to over 853 steps. Training employs BF16 precision with a global batch size of 256 and micro-batch size 1.
5. Algorithmic Workflow
A single SRI inference is executed as below:
1 2 3 4 5 6 7 8 9 10 |
def SRI_infill(code_file: str) -> str: # 1. Insert marker if not present ctx = load_file(code_file) assert "/* MIDDLE CODE TO COMPLETE */" in ctx # 2. Greedily generate diff diff_block = model.generate(ctx, prompt=SRI_PROMPT) # 3. Parse SEARCH / REPLACE regions search_snippet, replace_snippet = parse_diff(diff_block) # 4. Apply patch return apply_replace(ctx, search_snippet, replace_snippet) |
6. Empirical Evaluation and Results
SRI was benchmarked against Base FIM and Chat-FIM paradigms using similarity-based (CrossCodeEval EM, Edit Similarity) and execution-based (Pass@1 on ExecRepoBench) metrics. Selected results:
| Model | CrossCodeEval EM | Pass@1 (ExecRepoBench) |
|---|---|---|
| Qwen2.5-Coder-32B (FIM) | 57.1% | 25.7% |
| DeepSeek-V3-Base (FIM) | 61.9% | — |
| Claude-3.5-Haiku (Chat-FIM) | 23.9% | 35.6% |
| Claude-3.5-Haiku (SRI) | 44.5% | 61.8% (+26.2%) |
| SRI-Coder-32B (ours) | 57.6% (+46.3) | 61.6% (+37.1) |
SRI-Coder models fine-tuned on 20K examples matched or exceeded larger Base FIM models, and SRI tuning preserved inference latency within 1–2% of standard FIM. On MBPP, HumanEval, BigCodeBench, and LiveCodeBench, SRI-Coder exhibited negligible degradation (0–1 pt), contrasting with 3–7 pt average drops for natural-language FIM tuning. This suggests that SRI’s diff-style objective preserves general coding competencies in instruction-tuned Chat LLMs.
7. Limitations, Adoptions, and Extensions
Current evaluations are restricted to offline benchmarks, with practical IDE integration and user studies pending. Smaller models (<1B) achieve diminished SRI gains, indicating the need for knowledge distillation or curriculum strategies. Multi-file edits and richer agentic workflows integrated directly on the SRI format represent plausible directions for future code-assistant development. The SRI-200K dataset and SRI-Coder checkpoints are available under open-source licenses to support broad adoption and further research (Zhang et al., 19 Jan 2026).