PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In Code

Published 11 Dec 2025 in cs.SE and cs.AI | (2512.10713v1)

Abstract: LLM-based code assistants have emerged as a powerful application of generative AI, demonstrating impressive capabilities in code generation and comprehension. A key requirement for these systems is their ability to accurately follow user instructions. We present Precise Automatically Checked Instruction Following In Code (PACIFIC), a novel framework designed to automatically generate benchmarks that rigorously assess sequential instruction-following and code dry-running capabilities in LLMs, while allowing control over benchmark difficulty. PACIFIC produces benchmark variants with clearly defined expected outputs, enabling straightforward and reliable evaluation through simple output comparisons. In contrast to existing approaches that often rely on tool usage or agentic behavior, our work isolates and evaluates the LLM's intrinsic ability to reason through code behavior step-by-step without execution (dry running) and to follow instructions. Furthermore, our framework mitigates training data contamination by facilitating effortless generation of novel benchmark variations. We validate our framework by generating a suite of benchmarks spanning a range of difficulty levels and evaluating multiple state-of-the-art LLMs. Our results demonstrate that PACIFIC can produce increasingly challenging benchmarks that effectively differentiate instruction-following and dry running capabilities, even among advanced models. Overall, our framework offers a scalable, contamination-resilient methodology for assessing core competencies of LLMs in code-related tasks.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper proposes PACIFIC, a framework that deterministically generates contamination-resistant benchmarks to evaluate LLMs' compositional instruction following in code.
It employs a modular design with controllable instruction counts and output lengths to stress-test stepwise reasoning and strict format adherence.
Experimental results reveal significant accuracy degradation on complex benchmarks, highlighting gaps in current multi-step dry-run and reasoning capabilities.

PACIFIC: A Deterministic, Contamination-Resistant Benchmark Generation Framework for Code Instruction-Following Evaluation

Motivation and Core Design Principles

PACIFIC addresses fundamental deficiencies of existing benchmarks for evaluating LLM instruction-following in code: reliance on LLM-based evaluators, lack of deterministic evaluation, weak control over benchmark difficulty, and contamination risk. The framework is designed with four guiding principles: simple and deterministic evaluation, difficulty control, scalability and contamination resistance, and modular extensibility. All benchmarks are generated automatically, evaluated via straightforward output comparison against expected results, and can be scaled or adapted with new instructions, programming languages, or presentation formats.

PACIFIC does not require execution engines, external tools, or LLM-based judges. The evaluation focuses on the LLM's intrinsic capability to dry-run code and precisely follow sequenced instructions, providing a transparent methodology for isolating core reasoning skills from tool-augmented performance.

Figure 1: High-level architecture of PACIFIC, depicting end-to-end automated and deterministic evaluation pipeline from instruction pool and parameterized benchmark generation through to metric extraction.

Benchmark and Task Construction

The atomic element of a PACIFIC benchmark is a sample, which consists of an initial input (number or string) and a sequence of code instructions drawn from an extensible pool. Instructions are concatenated to form a compositional reasoning chain: each receives as input the previous instruction’s output, supporting four possible type transitions (number to number, number to string, string to string, string to number). Each instruction is formally defined, implemented, and validated in Python, Java, and C++, and expressed both as code and natural language.

To illustrate:

"next_perfect_square": Output the first perfect square greater than the input.
"shift_back": Subtract 1 from each character in the previous output, with wrap-around as in a Caesar cipher.
"abs_digit_name": Replace each digit of the absolute value of the previous number with an alphabetic name code, producing a string.

Samples may be presented as a single prompt (Prompt mode) or as a multi-turn dialogue (Chat mode). PACIFIC handles type consistency and output length constraints during sample construction, dynamically selecting instruction sequences to match the desired difficulty.

Figure 2: Structure of a benchmark sample—initial input and sequence of instructions forming a prompt for stepwise model processing.

Difficulty control is achieved through two orthogonal parameters:

Instruction count: As the chain length increases, error propagation and compositional reasoning demand escalate.
Target output length: Longer final outputs demand persistence of transformation consistency and increase format management complexity.

This two-dimensional parameterization allows systematic stress-testing of LLMs across the regime from trivial to extremely challenging compositional tasks, while modular design supports rapid expansion of instruction pools and supported languages.

Deterministic and Modular Evaluation Pipeline

Evaluation is rule-based and fully deterministic. For each instruction, the framework parses model outputs (using a strict tag-based format) and compares with reference outputs produced by executing the canonical code implementation. Prompt-Level Accuracy (PLA) measures the percentage of samples where all instructions are followed correctly; Instruction-Level Accuracy (ILA) reflects aggregate per-instruction correctness.

The framework categorizes errors into missing answers, type mismatches, and other format deviations. This structure reduces the evaluation cost to a minimum and employs no learned model in the evaluation loop, increasing transparency and reproducibility.

PACIFIC's contamination resistance is enabled via parameterized sample generation, representation diversity (code versus NL instructions, prompt versus chat, multiple programming languages), and random seed control. Thus, variants can be generated rapidly for new, uncontaminated evaluations whenever prior instance leakage is detected or suspected.

Empirical Evaluation and Findings

Difficulty Calibration and Model Degradation

A large-scale experiment includes 15 benchmarks over 5 instruction counts and 3 output lengths, with 99 samples per benchmark, in Python, Java, and C++. Evaluation of Claude-3.5-Haiku, Claude-4-Sonnet, GPT-OSS-120B, Llama-4-Maverick-17B-128E-Instruct, Qwen3-235B-A22B-Instruct, and Qwen3-Coder-480B-A35B-Instruct exposes pronounced accuracy degradation as task difficulty increases.

Figure 3: Prompt-Level Accuracy for all models across benchmark complexity parameters. Accuracy declines as instruction count and output length increase.

The loss in compositional instruction adherence is nonlinear, with many models dropping to 0% on the most challenging benchmarks. For example, on the hardest configuration, only one model attained 28% PLA, and the majority reached 0%. This characteristic distinguishes PACIFIC results from general instruction-following and code generation leaderboards, indicating alignment with a distinct reasoning capability.

Model Comparison

Aggregate model performance shows a clear stratification, yet collapses for all models at high difficulty. Notably, top performers in coding and traditional instruction-following (such as Qwen3-235B-A22B-Instruct) routinely underperform in PACIFIC, while gpt-oss-120b leads PACIFIC but lags on other leaderboards. This non-monotonic ranking highlights PACIFIC’s focus on compositional, step-level logical reasoning, rather than rote code generation or isolated instruction adherence.

Figure 4: Aggregate mean Prompt-Level Accuracy illustrating model stratification and overall low success rates on hard benchmarks.

Complementary error analysis shows that error rates, especially missing answers and format violations, increase rapidly with instruction count, yet output length exerts lesser influence on format errors. Models vary widely in their robustness to strict output formatting requirements.

Figure 5: Missing answer rates by instruction count, highlighting format and generation fragility with increased compositional load.

Instruction-Level Analysis

Analysis of individual instruction accuracy reveals that simple atomic instructions (e.g., "to_roman", "next_perfect_square") typically exceed 85% accuracy, possibly due to memorization effects. However, "abs_digit_name", "shift_back", and "replace_vowels_with_gh", although computationally trivial for humans, are challenging for LLMs, especially when composed. Even high instruction-level accuracy on isolated tasks fails to yield robust chain execution in complex benchmarks.

Figure 6: Instruction-level accuracy for representative “easy” and “hard” instructions, showing a contrast in atomic versus compositional performance.

Practical and Theoretical Implications

PACIFIC introduces a new standard for deterministic, contamination-resilient, difficulty-controllable evaluation of LLMs' instruction-following and code reasoning. The empirical finding that state-of-the-art LLMs degrade to near-zero performance on complex, but formal and deterministic, chains underscores a major gap in compositional skill acquisition in current architectures. This gap is masked by conventional code-generation and instruction-following tasks, which rarely impose stringent compositional structure or embargo agentic tool use.

By isolating dry-running (mental simulation of code) and stepwise instruction-following, PACIFIC offers a direct probe into the reasoning stack of LLMs without the confounders of black-box execution, LLM-based evaluation, or tool-augmented environments. The framework is robust against contamination and bias, extensible for new task classes and languages, and thus readily applicable to benchmark future foundation models as well as LLM-based agents with increasingly sophisticated compositional abilities.

The results suggest several research directions:

Advancements in compositional generalization—architectural and training interventions are necessary to address multi-step instruction-following skill deficiency.
Integration of PACIFIC into continual evaluation or red-teaming will improve traceability and transparency in model deployment for code assistant applications.
Analysis of systematic error modes in compositional tasks may illuminate factors contributing to error propagation and formatting brittleness.
Extension to additional data types, domains, and reasoning modalities will further assess the universality and limits of current LLMs’ dry-running capacities.

Conclusion

PACIFIC advances the methodology for evaluating LLMs’ core instruction-following and code simulation competencies via principled, deterministic benchmarks. It provides strong difficulty control, intrinsic contamination resistance, and modular extensibility, enabling rigorous isolation of compositional failures in contemporary models. Experimental results demonstrate a severe and persistent gap in dry-running and multi-step instruction adherence, motivating future algorithmic and architectural innovations. PACIFIC will serve as a critical evaluation resource for the development of robust, transparent, and trustworthy LLM-based software engineering assistants.

Reference:

PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In Code (2512.10713)

Markdown Report Issue