Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation? (2505.18789v2)

Published 24 May 2025 in cs.SE and cs.CL

Abstract: Post-processing is crucial for the automatic evaluation of LLMs in fill-in-the-middle (FIM) code generation due to the frequent presence of extraneous code in raw outputs. This extraneous generation suggests a lack of awareness regarding output boundaries, requiring truncation for effective evaluation. The determination of an optimal truncation strategy, however, often proves intricate, particularly when the scope includes several programming languages. This study investigates the necessity of post-processing instruction-tuned LLM outputs. Our findings reveal that supervised fine-tuning significantly enhances FIM code generation, enabling LLMs to generate code that seamlessly integrates with the surrounding context. Evaluating our fine-tuned \texttt{Qwen2.5-Coder} (base and instruct) models on HumanEval Infilling and SAFIM benchmarks demonstrates improved performances without post-processing, especially when the \emph{middle} consist of complete lines. However, post-processing of the LLM outputs remains necessary when the \emph{middle} is a random span of code.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Wasi Uddin Ahmad (41 papers)
  2. Somshubra Majumdar (31 papers)
  3. Boris Ginsburg (111 papers)

Summary

This paper investigates the effectiveness of instruction-tuned LLMs for Fill-in-the-Middle (FIM) code generation and the necessity of post-processing their raw outputs. FIM is a crucial task in code completion where the model must generate a missing code segment conditioned on both the preceding (prefix) and succeeding (suffix) code context. A common challenge in evaluating FIM models is that raw outputs often contain extraneous code, requiring post-processing like truncation to align with evaluation criteria, which may not reflect real-world usage.

The authors' primary motivation is to determine if instruction-tuned LLMs, which are trained to follow instructions, are inherently better at FIM and if supervised fine-tuning (SFT) can improve their FIM capabilities to the point where raw outputs are directly usable for evaluation and real-world application without complex post-processing.

The methodology involved several steps:

  1. Initial Evaluation: Assessed the FIM performance of off-the-shelf instruction-tuned Qwen2.5-Coder-Instruct models (7B, 14B, 32B) on two benchmarks: HumanEval Infilling and SAFIM. These benchmarks cover different types of FIM tasks (single-line, multi-line, random span, algorithm block, control flow, API call). They applied standard, dataset-specific post-processing rules to the raw outputs for this initial evaluation.
  2. Instruction-Response Data Generation: Created a training dataset for SFT. This involved collecting Python functions from GitHub, filtering them, and then using a larger LLM (Mixtral-8x22B) to generate instruction-response pairs. The prompt instructed the LLM to split each function into prefix, middle, and suffix based on five different strategies (random span, algorithmic block, control-flow expression, API function call, assignment expression). The resulting dataset contained approximately 1 million instruction-response pairs.
  3. Supervised Fine-tuning: Fine-tuned both the base and instruction-tuned versions of the Qwen2.5-Coder models (7B, 14B, 32B) using the generated instruction-response dataset. The fine-tuning process involved 5000 steps on NVIDIA H100 GPUs, using AdamW optimizer, a batch size of 256, and a maximum sequence length of 4096 tokens. A CosineAnnealing scheduler with a 10% warmup was used for the learning rate.
  4. Evaluation of Fine-tuned Models: Evaluated the fine-tuned models on the same HumanEval Infilling and SAFIM benchmarks, presenting results both with and without applying the standard post-processing rules. The standard pass@1 metric was used.

Key findings from the experiments provide practical guidance for implementing FIM capabilities with LLMs:

  • Off-the-shelf performance: Out-of-the-box instruction-tuned LLMs perform poorly on FIM tasks, even with post-processing. This suggests that general instruction following is not sufficient for effective FIM code generation.
  • Impact of SFT: Supervised fine-tuning significantly improves FIM performance for both base and instruction-tuned models. This demonstrates that lightweight, task-specific fine-tuning is crucial and can achieve substantial gains without requiring expensive pre-training from scratch.
  • Base vs. Instruct after SFT: Fine-tuning instruction-following models tends to yield slightly better average performance compared to fine-tuning base models. This suggests that the initial instruction-tuning provides a beneficial foundation for FIM.
  • Necessity of Post-processing: The need for post-processing depends critically on the type of FIM task:
    • For FIM tasks where the middle section comprises complete lines (e.g., HumanEval Single-line, Multi-line, SAFIM tasks where the middle spans entire blocks or expressions), the raw outputs of the fine-tuned models achieve better or comparable performance than outputs subjected to standard (often truncation-based) post-processing rules. This implies that for these types of tasks, fine-tuned models can often generate outputs that seamlessly fit the context, and forced truncation is detrimental.
    • For FIM tasks involving partial lines or random spans (e.g., HumanEval Random-span), post-processing is still necessary and improves performance. The required post-processing primarily involves removing overlapping code segments between the generated middle and the surrounding prefix/suffix.
  • SFT Data and Training: The authors found that generating approximately 1 million samples (one sample per source function) was sufficient, and generating more samples from the same function (5M total in an initial experiment) did not significantly improve performance. They also noted that training for more than roughly one epoch degraded performance, suggesting that for this type of FIM task and dataset, careful control over training duration is important.

Practical Implementation:

Based on these findings, practitioners implementing FIM code generation with LLMs should:

  1. Start with a fine-tuned model: Do not rely solely on off-the-shelf instruction-tuned models for FIM. Implement or utilize a model that has undergone specific supervised fine-tuning for FIM tasks.
  2. Consider SFT Data Generation: Synthetic data generation using a larger LLM, as described in the paper, is a practical approach for creating FIM training data. Focus on diverse source code rather than generating multiple FIM splits from the same function.
  3. Conditional Post-processing: Implement post-processing logic that is conditional on the type of FIM task. For tasks targeting complete lines or blocks, trust the raw output of a well-fine-tuned model. For tasks involving arbitrary spans or partial lines, apply post-processing to remove overlaps between the generated middle and the prefix/suffix.
  4. Overlap Removal Implementation: Overlap removal can be implemented by checking for suffixes of the prefix that match prefixes of the generated middle, and similarly for prefixes of the suffix matching suffixes of the generated middle. The paper provides Python examples of such post-processing functions for HumanEval tasks (Figure 5).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def remove_overlap_prefix_middle(prefix, middle):
    """Removes overlap between the end of prefix and start of middle."""
    prefix_len = len(prefix)
    middle_len = len(middle)
    for i in range(min(prefix_len, middle_len), 0, -1):
        if middle.startswith(prefix[-i:]):
            return middle[i:]
    return middle

def remove_overlap_middle_suffix(middle, suffix):
    """Removes overlap between the end of middle and start of suffix."""
    suffix_len = len(suffix)
    middle_len = len(middle)
    for i in range(min(middle_len, suffix_len), 0, -1):
        if middle.endswith(suffix[:i]):
            return middle[:-i]
    return middle

def apply_random_span_postprocessing(completion, prefix, suffix):
    """Applies overlap removal for random span FIM."""
    completion = remove_overlap_prefix_middle(prefix, completion)
    completion = remove_overlap_middle_suffix(completion, completion + suffix) # Note: check overlap with potential full context
    return completion

The paper's findings suggest that while FIM fine-tuning is essential, it can significantly reduce the need for complex, heuristic-based post-processing, particularly when the task involves completing full code lines. The remaining need for post-processing in random-span FIM points to a current limitation in models' ability to perfectly adhere to arbitrary span boundaries without explicit instruction or fine-tuning on that specific boundary type.

Limitations noted by the authors include the focus primarily on Python, the reliance on synthetically generated training data, and evaluation on specific benchmarks which may not encompass all real-world FIM complexities. Future work could explore cross-language generalization, alternative data sources (e.g., human edits), and evaluation on more diverse, realistic FIM scenarios.