Programming Problem Merging (PPM)

Updated 27 November 2025

Programming Problem Merging (PPM) is a method that applies systematic lambda-based transformations to seed problems, producing diverse and challenging benchmarks.
PPM employs both Pure Value and Type-Aware Value Transformations to modify outputs and types, ensuring enriched task semantics and type compatibility.
Experimental evaluations on HumanEval and MBPP datasets demonstrate PPM’s effectiveness by significantly reducing Pass@1 scores while maintaining high naturalness and diversity.

Programming Problem Merging (PPM) is a methodology for generating diverse, challenging, and natural programming problems by algorithmically transforming seed problems through systematic post-processing of their return values and task descriptions. Initially developed to benchmark Large Code Generation Models (LCGMs), PPM produces merged problems whose canonical solutions and interaction patterns differ semantically from their seeds, far surpassing baseline perturbation approaches in diversity and effective benchmarking power (Chen et al., 28 Jan 2024).

1. Formal Framework for Programming Problem Merging

A seed programming problem is formalized as $P = \langle p, S, T \rangle$ , where $p = \langle f, t, d \rangle$ consists of the function signature $f$ , a natural-language task description $t$ , and a set of I/O demonstrations $d = \{(x_i, y_i)\}$ . $S$ denotes the canonical solution and $T$ is the set of test inputs. PPM operates by composing $S$ with a function $\lambda$ selected from a curated library $\Lambda$ of "lambda programming problems", each paired as $(\varphi, \lambda)$ —where $\varphi$ is a natural language description of a value transformation and $\lambda$ is the corresponding code operator.

For a given problem $P$ whose canonical solution $S$ maps test inputs $x \in T$ to outputs $v = S(x)$ , PPM constructs a merged problem $P_{\text{new}} = \langle (f, t' = t \Vert \varphi, d'), S', T \rangle$ , where $S' = \lambda \circ S$ and the new demos are $d' = \{(x_i, \lambda(S(x_i)))\}$ . Constraints ensure type compatibility between $S$ , $\lambda$ , and $\varphi$ , and operator randomness is enforced over a large domain to guarantee output uniqueness (Chen et al., 28 Jan 2024).

2. Metamorphic Operator Classes and the PPM Workflow

Two archetypes of post-processing are presented:

Pure Value Transformation (PPM-V): Operators that modify solution outputs while preserving their type—for instance, by applying an integer offset or negating booleans. Example: $\lambda(v) = v + \theta$ , $\theta \sim \text{Uniform}([-100, 100])$ .
Type-Aware Value Transformation (PPM-T): Operators that map between types, such as int $\rightarrow$ string or string $\rightarrow$ boolean. The operator is coupled with an updated task description to reflect the new type and semantics, and randomization is similarly included.

Both variants follow a unified three-step pipeline:

Return-Value Type Analysis: Execute $S$ on $T$ , recursively abstract token types to identify all output types present.
Metamorphic Operator Selection: Choose a type-compatible operator $\lambda$ and sample any necessary random parameters.
Prompt and Solution Construction: Update the task description by concatenating $\varphi$ , update the I/O demos using $\lambda$ , and define the new solution as $S' = \lambda \circ S$ .

The computational cost is dominated by evaluating $S$ on $T$ ( $O(|T|\,\text{cost}(S))$ ), while operator selection and prompt augmentation are $O(1)$ tasks (Chen et al., 28 Jan 2024).

3. Comparative Evaluation and Benchmarking Protocol

PPM was evaluated using two datasets—HumanEval (164 Python problems) and MBPP-Sanitized (427 Python problems)—and compared against nine prompt-perturbation baselines (including demo addition or deletion, token/character mutation, function name changes, and comment modifications). Eight LCGMs were tested, such as CodeGen, InCoder, SantaCoder, and PolyCoder series. For every problem-method pair, candidates were sampled ( $n=100$ ) from each model and evaluated using three main metrics: diversity, effectiveness, and naturalness (Chen et al., 28 Jan 2024).

A summary of the metrics:

Metric	Role	Mathematical Definition / Criterion
BLEU-4	Prompt diversity	Lower = more diverse than original
SemSim	Semantic similarity	Lower = greater semantic change in embedding space
DiffImp	Implementation change	Fraction of merged problems with structurally new $S'$
Pass@k	Generation challenge	Probability at least one candidate in $k$ passes all tests
Perplexity	Naturalness	Lower = more natural under GPT-2
IDE Warnings	Naturalness	Lower = more natural in PyCharm
Human Score	Realism	Mean 0/0.5/1 score over five judges

PPM approaches uniquely produce DiffImp $=1.0$ , indicating that every generated problem's solution is structurally distinct from its seed, a property unattainable by all nine baselines. On HumanEval and MBPP-Sanitized, BLEU-4 and SemSim analyses confirm substantially greater diversity for PPM-T (BLEU-4 $\approx$ 0.66/0.54 vs. TokenMut $\approx$ 0.82/0.76; SemSim $\approx$ 0.90 vs. TokenMut $\approx$ 0.96).

Effectiveness is measured by Pass@k, with PPM-V and PPM-T causing a $\sim$ 75–95\% drop in Pass@1 for all eight code generation models. This is in stark contrast to baselines, which typically provoke $\leq$ 15\% decrease or even improved scores. Results are statistically significant ( $p<0.01$ , paired $t$ -test) (Chen et al., 28 Jan 2024).

4. Qualitative Characterization and Example

The effect of PPM transformations is illustrated by the following representative HumanEval case:

Seed:

Signature: def foo(nums: List[int]) -> int
Task: "Return the sum of odd elements in the list."
Demos: ([1,2,3], 4), ([10,5,7], 12)
Solution: return sum(x for x in nums if x%2==1)

With PPM-T (int $\rightarrow$ string + offset $\theta=5$ ):

Augmented description: "Return the sum of odd elements in the list. Then convert that sum+5 to a string."
Signature: def foo(nums: List[int]) -> str
Demos: ([1,2,3], "9"), ([10,5,7], "24")
Solution:
1 2 3
def foo(nums): s = sum(x for x in nums if x % 2 == 1) return str(s + 5)
The test cases $T$ $T$ are unchanged; correctness is validated by $\lambda(S(x))$ $λ (S (x))$ (Chen et al., 28 Jan 2024).

5. Insights from Experimental Results

PPM produces merged problems that:

Achieve DiffImp $=1.0$ , indicating perfect semantic and structural diversity,
Cause a consistent 75–95\% reduction in Pass@1 for prominent LCGMs on both HumanEval and MBPP (e.g., CodeGen-2B Pass@1 decreases from $\approx$ 0.20 to $0.01$),
Retain high naturalness (PPM-T Perplexity $\approx$ 18.5 vs. TokenMut $\approx$ 35),
Yield considerably fewer IDE warnings (typically 0–3 false positives) and higher human realism scores (PPM-T $\approx$ 0.92/0.90) compared to baselines,
Exhibit stable performance across reruns (Pass@1 variance $<$ 0.003 for five PPM-T runs).

No accidental leakage or redundancy was observed: after 100 random runs, ~70\% of merged instances remained distinct, and increased offset ranges in the transformations further reduce duplicates (Chen et al., 28 Jan 2024).

6. Limitations, Threats to Validity, and Future Directions

Potential threats include the absence of absolute ground-truth for naturalness (mitigated using perplexity, IDE, and human panel metrics) and restriction to Python, as differences may arise for languages with strong static typing. Random parameterizations introduce stochasticity, though empirical distinctness is consistently high. The current operator library is hand-engineered and restricted to basic types.

Planned extensions include expanding $\Lambda$ with more expressive post-processing patterns, integrating automatic discovery of $\lambda$ fragments from corpora, supporting composite and custom types, co-varying test sets with the transformations, and combining PPM with broader robustness frameworks for multi-axis benchmarking (Chen et al., 28 Jan 2024).

7. Relation to Parallel Merging Algorithms

The term "Programming Problem Merging" is contextually distinct from parallel merging algorithms developed for high-performance computing. In earlier literature, PPM referred to simplified, stable parallel merging algorithms utilizing block partitioning, binary search-based cross-ranking, and stable sequential merges—developed for balanced, synchronization-efficient merging of sorted sequences ( $A, B$ ) into $C$ using $p$ processors. Notably, both the simplified parallel merge (Träff, 2012) and the optimal load-balanced parallel merge based on co-ranking (Siebert et al., 2013) achieve $O\bigl(\frac{n}{p}+\log n\bigr)$ time, perfect stability, and minimal synchronization. However, these methods are algorithmic solutions for array merging and do not address code generation or problem diversity (Träff, 2012, Siebert et al., 2013). The recent PPM framework in code generation benchmarking is unrelated in methodology and application, sharing only the name.