Papers
Topics
Authors
Recent
2000 character limit reached

RefineBench: Multi-Domain Benchmark Suite

Updated 1 December 2025
  • RefineBench is a set of benchmark frameworks that rigorously evaluate language model refinement, industrial-scale optimization, and software artifact assessment.
  • It utilizes innovative methodologies such as guided and self-refinement protocols, delta-base modeling for realistic yield adjustments, and controlled hierarchical output degradation.
  • Empirical findings demonstrate rapid performance gains with guided feedback while highlighting challenges in autonomous error detection and complex nonconvex optimization.

RefineBench refers to multiple, independently developed benchmark frameworks, each targeting a distinct technical domain: (1) LLM refinement capability via checklist-guided multi-turn evaluation, (2) industrial-scale refinery–petrochemical production planning, and (3) fine-grained validation of LLM-based evaluators for code and artifact quality. Each RefineBench implementation advances rigorous, reproducible research in benchmarking and evaluation, offering unique methodologies tailored to its field. The following sections provide a detailed account of each RefineBench instantiation, their architectures, evaluation protocols, and empirical insights.

1. RefineBench for LLM Response Refinement (Checklist-Based Multi-Turn Evaluation)

RefineBench, as introduced by ["RefineBench: Evaluating Refinement Capability of LLMs via Checklists" (Lee et al., 27 Nov 2025)], is a multi-domain benchmark designed to quantify LLMs' capacity to refine their responses across iterative turns. The benchmark addresses the gap where user interactions increasingly request models to revise outputs—either through explicit feedback (guided refinement) or self-directed reflection (self-refinement)—yet prior benchmarks fail to systematically cover such scenarios, particularly for open-ended or multi-faceted tasks.

1.1 Scope and Structure

RefineBench comprises 1,000 rigorously curated problems spanning 11 domains: Mathematics, Statistics, Physics, Chemistry, Computer Science/AI, Biology/Medicine, Economics/Business, Engineering, Law, Humanities/Social Science, and Other. Each problem includes:

  • A prompt or passage (as needed)
  • A question
  • One or more reference answers
  • A human-verified checklist of binary evaluation items (average: 9.9 per problem)

Checklists decompose problem criteria into discrete, minimally overlapping facets, facilitating fine-grained analysis of which aspects of an answer are correct, and which persist as errors after refinement steps.

1.2 Evaluation Metrics

Given a response rr and checklist C={c1,,cN}C = \{c_1, \ldots, c_N\}, RefineBench defines:

  • Checklist score: $\mathrm{Score}(r) = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}\{\text{$rsatisfiessatisfiesc_i$}\}$
  • Pass metric: Pass(r)=1\mathrm{Pass}(r) = 1 if Score(r)=1\mathrm{Score}(r) = 1 (all checklist items satisfied); 0 otherwise

1.3 Refinement Protocols

  • Guided Refinement: At each turn, the model receives the prior answer, the original query, and natural language feedback specifying which checklist items failed. Iteration continues up to 5 turns or until the model self-reports "[TERMINATE]".
  • Self-Refinement: The model receives no explicit feedback but is prompted to introspect and revise if further improvement is possible, terminating as needed.
  • Partial Guided Refinement: Only a subset of failing checklist items is disclosed at each turn, simulating ambiguous or incomplete user feedback.

1.4 Empirical Findings

Empirical assessment of 34 leading LMs (including proprietary models such as GPT-4.1, GPT-5, Gemini-2.5-Pro, Claude-Opus-4.1, and large open-weight models like Qwen3, DeepSeek-R1, LLaMA-3.1 series) yielded the following:

  • Self-refinement delivers limited gain: e.g., Gemini 2.5 Pro climbs from 29.5% to 31.3% ((+1.8%)) pass rate after five turns; DeepSeek-R1 sees marginal or even negative change. Most models plateau below 30–32% in this regime.
  • Guided refinement enables rapid convergence: Proprietary and large open-weight LMs (>>70B) escalate pass rates from approximately 20–30% to above 90% in five turns (e.g., GPT-4.1 from 23.4% to 95.5%; Claude-Opus-4.1 from 18.7% to 98.4%).
  • Smaller open-weight LMs (<<8B) improve under guidance but remain capped below 33%.

This suggests that the bottleneck in current LLMs is error identification, not error repair; when checklist failures are provided, models can rapidly close gaps. Without such feedback, models are unable to reliably self-locate missing elements or decide when to terminate.

1.5 Failure Modes and Analysis

  • LMs readily apply corrections when failures are segmented, but struggle to autonomously discover checklist gaps.
  • Overconfidence and premature self-termination lead to suboptimal self-refinement; even with outstanding errors, models often halt after 3–4 steps.
  • In-depth analysis reveals a collapse in chain-of-thought rigor after initial self-refinement iterations, particularly in symbolic domains.

Key open challenges include automated mapping from output to error localization (proxying the checklist), joint training for self-critique, confidence-driven stopping criteria, and robust multi-turn learning to prevent drift and reasoning degradation.

2. RefineBench in Industrial Production Planning (Refinery–Petrochemical Complexes)

The "RefineBench" described in ["A production planning benchmark for real-world refinery-petrochemical complexes" (Du et al., 28 Mar 2025)] is an open-source, demand-driven benchmark for optimization in industrial-scale refinery–petrochemical complex planning. It specifically addresses the lack of transparent, realistic, and reproducible large-scale models in the literature, aiming to catalyze advances in LP, MILP, MINLP, and global optimization research.

2.1 Problem Formulation and Mathematical Model

  • Scope: Integrated multi-period, multi-product production planning across refinery and chemical processing networks.
  • Variables: Includes mass flows (FVI, FVO, FVM), tracked properties (FQ), adjusted yields for delta-base units (Γ\Gamma), inventories (FVLI/FVLO, Ls,tL_{s,t}), and binary inventory flags (Xs,tX_{s,t}).
  • Constraints: Encodes material balances, flexible stream-port connectivity, swing-cut distillation, fixed and delta-base secondary processing, mixers, splitters, blenders, proportional blending, inventory and logical constraints, and capacity bounds.
  • Objective: Maximize profit over planning horizon by optimizing purchases, selling, inventory, and penalty costs.

RefineBench features a port–stream hybrid superstructure that maintains both flexible modular connectivity and explicit spatial decomposition needed for efficient partitioning and algorithmic development.

2.2 Delta-Base Modeling

Delta-base units capture first-order yield sensitivity to feedstock properties based on linear regression from plant historical data. The model for delta-base yield adjustment is:

Γu,m,s,tγu,m,s=(s,q)FQs,q,tBu,m,qΔu,m,qδu,m,s,q\Gamma_{u,m,s,t} - \gamma_{u,m,s} = \sum_{(s',q)} \frac{FQ_{s',q,t} - B_{u,m,q}}{\Delta_{u,m,q}} \delta_{u,m,s,q}

This enables realistic representation of process variability and ensures feasible solutions in downstream scheduling, as confirmed by ablation studies.

2.3 Benchmark Cases and Scale

Three scenarios are supplied:

Case Network Time Periods Integer Vars Nonlinear Terms Model Size
1 Stand-alone refinery 1 0 2,082 3,573 vars, 3,428 cons.
2 Integrated refinery-chemical 1 56 4,294 7,157 vars, 8,156 cons.
3 Integrated, multi-period 3 168 12,882 21,469 vars, 24,466 cons.

All model data, code, and documentation are released openly.

2.4 Computational Performance

Tests with advanced solvers (ANTIGONE, BARON) show:

  • Tight duality gaps are difficult to close (Case 1: 9.9% with BARON, 17.8% with ANTIGONE).
  • Integer variables and nonlinear pooling/delta terms severely impair branch-and-bound progress.
  • Fixed-yield ablation demonstrates that omitting delta-base modeling results in less realistic solutions that violate downstream scheduling requirements.

2.5 Application Domains and Limitations

Potential research enabled by RefineBench includes:

  • Development and benchmarking of global MINLP algorithms, decomposition schemes, network-graph infeasibility diagnostics, and robust planning under uncertainty.
  • Limitations: substantial model scale and nonconvexity, confidentiality-driven data perturbations, and abstraction from detailed scheduling dynamics.

RefineBench provides the first open, reproducible benchmark suitable for realistic, industrial integrated refinery–petrochemical planning.

3. RefineBench for Fine-Grained LLM Evaluator Validation

In ["Automated Validation of LLM-based Evaluators for Software Engineering Artifacts" (Fandina et al., 4 Aug 2025)], "RefineBench" (REFINE\textbf{REFINE}) refers to an automated framework for benchmarking and validation of LLM-based evaluators ("LLM-as-a-judge") in software engineering, particularly targeting tasks requiring nuanced ranking or assessment of code artifacts.

3.1 Architectural Overview

REFINE comprises two principal modules:

  • Hierarchy Dataset Builder: For each input xx, synthesizes a hierarchy {o1,o2,...,ok}\{o_1, o_2, ..., o_k\} of outputs ordered by latent artifact quality, s(o1)>s(o2)>...>s(ok)s(o_1) > s(o_2) > ... > s(o_k). Techniques include:

    • Reduced-capacity generation (varied LLM sizes, e.g., llama-3-70B, 3B, 1B)
    • DeQrease Decoder: custom decoding restricting to top-KK tokens and reweighting for controlled degradation (parameters: prefix fraction pp, top-KK, temperature tt). Precise generator pseudocode is provided in the source.
    • Domain-aware error injection under LLM control (e.g., off-by-one errors, API misuse).

    Two-way LLM filtering enforces correct quality ordering by averaging scores from both forward and reverse high-resolution evaluators, discarding inconsistent hierarchies.

  • Evaluator Tester: Quantifies the quality of a candidate evaluator EE by computing an \emph{Alignment Score}:

αE(x)=1(k2)1u<vk1[sE(x,ou)>sE(x,ov)]\alpha_E(x) = \frac{1}{\binom{k}{2}} \sum_{1 \leq u < v \leq k} \mathbf{1}[s_E(x,o_u) > s_E(x,o_v)]

The benchmark alignment metric is Alignment(E)=1nxXαE(x)\mathrm{Alignment}(E) = \frac{1}{n} \sum_{x \in \mathcal{X}} \alpha_E(x), with optional reporting of Spearman ρ\rho and Kendall τ\tau for supplementary insight.

3.2 Granularity Control

REFINE enables precise granularity control:

Δi=avg_score(oi1)avg_score(oi)\Delta_i = \mathrm{avg\_score}(o_{i-1}) - \mathrm{avg\_score}(o_i)

By tuning (pi,Ki,ti)(p_i, K_i, t_i) per hierarchy level, users may target from coarse to fine degradation, systematically stress-testing evaluator sensitivity.

3.3 Empirical Evaluation

Applied to IBM industrial data for COBOL-centric code translation, summarization, and code generation tasks, REFINE supports identification and tuning of high-quality LLM judges:

  • Baseline alignment scores for weak evaluators fell below 0.7; after one or two refinement cycles (prompt tuning, hierarchy adjustments), Alignment(E)\mathrm{Alignment}(E) exceeded 0.9 for best configurations.
  • Example: Summarization task improved from \sim0.65 to \sim0.92 between two refinement phases.

3.4 Workflow Integration and Use Cases

REFINE is directly integrated in IBM’s model development life cycle:

  1. Coarse-grained benchmark to filter poor evaluators.
  2. Finer, nuanced benchmarks for the top KK evaluators.
  3. Successive prompt/model tuning to maximize Alignment(E)\mathrm{Alignment}(E).
  4. Adoption of top-performing LLM-judge configurations in model release gating, automated code review, and regression testing.

By providing scalable, objective, and controllable evaluation, REFINE improves the rigor of LLM-based artifact assessment in production settings.

4. Comparative Table of RefineBench Instantiations

RefineBench Variant Domain Benchmark Focus
Checklist-Based LLM Refinement (Lee et al., 27 Nov 2025) LLM (multi-domain) Multi-turn refinement capability, guided/self-refinement, checklist evaluation
Refinery–Petrochemical Optimization (Du et al., 28 Mar 2025) Industrial Planning Realistic MINLP for network-scale production/inventory scheduling, delta-base modeling
LLM Evaluator Benchmark (REFINE) (Fandina et al., 4 Aug 2025) Software Engineering, Code Evaluation Fine-grained LLM-judge ranking alignment, artifact synthesis with controlled granularity

5. Significance and Open Challenges

RefineBench, in its diverse instantiations, enables measurement and development of (i) LLMs' multi-turn and self-refinement capabilities, (ii) optimization algorithms for complex industrial systems with real-data realism, and (iii) rigorous, scalable evaluator validation for software engineering artifacts. Across all domains, key open problems include automated error localization (for LMs and artifacts), fine-structured evaluation that bridges synthesis and real-world deployment, and the development of algorithms and models capable of robust, scalable self-correction or real-time optimization under constraints.

Each RefineBench variant is publicly available, designed to act as a catalyst for research progress by providing challenging, transparent, and reproducible testbeds for academic and industrial benchmarking.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to RefineBench.