Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 58 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 33 tok/s Pro

GPT-4o 115 tok/s Pro

Kimi K2 183 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

FuncBenchGen Benchmarking Framework

Updated 2 October 2025

FuncBenchGen is a systematic framework for synthetic generation and evaluation of function-centric benchmark tasks with controllable complexity.
It leverages dependency graph construction and parameterization to create reproducible benchmarks across domains like FaaS, LLMs, constraint programming, and numerical optimization.
The framework enables precise performance profiling and diagnostic insights while isolating evaluations from data contamination and static benchmarking limitations.

FuncBenchGen refers to a family of frameworks and methodologies for synthetic generation and evaluation of function-centric benchmark tasks across various domains, including Function-as-a-Service (FaaS) cloud environments, variant-rich software evolution, constraint programming, numerical optimization, and tool-augmented LLM (TaLM) evaluation. In its most recent instantiation (Maekawa et al., 30 Sep 2025), FuncBenchGen is a contamination-free, controllable framework for evaluating multi-step function calling in LLMs, leveraging synthetic dependency graph construction to parameterize and isolate relevant benchmarking dimensions. Previous frameworks and naming conventions (e.g., AutoIG, GNBG) are tightly related via their shared objective: precise control over task structure, systematic coverage of complexity, and reliable benchmarking unattainable with static or crowd-curated datasets.

1. Architectural Principles and Domain Coverage

FuncBenchGen architecture varies by domain context. In FaaS benchmarking (Pellegrini et al., 2019), FuncBenchGen is realized as a proxy-based measurement framework with three interacting components: FaaSBench (Java workload generator and metric collector), Proxy Cloud Function (PCF, a JavaScript intermediary), and Target Cloud Function (TCF, instrumented or production-deployed measurement target). In LLM multi-step tool-use evaluation (Maekawa et al., 30 Sep 2025), FuncBenchGen synthesizes a set of external function schemas interconnected in a hidden directed acyclic graph (DAG), with each node representing a function whose output may serve as input to others.

Across constraint programming (Dang et al., 2022) and numerical optimization (Yazdani et al., 2023), FuncBenchGen-like frameworks instantiate benchmark instances via parameterized generators, controlling problem features (difficulty, multimodality, separability, conditioning, evolution patterns) and capturing relevant instance data and meta-information.

A cross-domain feature is the framework's ability to tailor benchmarks for specific research goals: isolating performance bottlenecks, tracing dependencies, or simulating evolution in software product lines (Derks et al., 2021).

2. Task Generation: Dependency Graphs, Variability, and Parameterization

Central to FuncBenchGen is the formalization of benchmark task structure through dependency graphs and parametric controls:

In LLM function-calling tasks (Maekawa et al., 30 Sep 2025), the framework constructs a hidden function-dependency DAG, defining $G = (\mathcal{A}, \mathcal{E})$ where $\mathcal{A}$ is the set of functions, and $(f_i, f_j) \in \mathcal{E}$ iff $f_j$ can consume the output of $f_i$ . Parameters such as dependency depth $d$ , number of core nodes $n^\text{core}$ , and number of distractor functions $n^\text{conn}$ , $n^\text{dis}$ precisely modulate complexity.
In constraint programming (Dang et al., 2022), AutoIG samples generator configurations in the instance parameter space (e.g., task number, density) using algorithmic tuning (irace) and essence modelling pipelines to produce benchmark instances that are graded or discriminating with respect to solver performance.
For numerical optimization (Yazdani et al., 2023), GNBG defines composite search landscapes using a single parametric baseline function:

$f(x) = \sigma + [T(R(x - m))^\top H T(R(x - m))]^\lambda$

where $T(\cdot)$ is a nonlinear transformation introducing multimodality and irregularity, $R$ a rotation encoding variable interactions, $H$ scaling/conditioning, and $\lambda$ controls basin linearity. Instance generation entails varying $d$ (dimension), $o$ (components/basins), and related parameters.

This systematic parameterization ensures reviewable benchmarks, supports “what-if” scenario analysis, and aligns with the need for reproducible performance profiling.

3. Evaluation Methodologies and Metrics

Benchmarking methodologies reflect domain-specific operation and the overarching goal of diagnostic depth:

For FaaS (Pellegrini et al., 2019), metrics span transmission delays, header/body sizes, execution times, and instance duration, split across network and compute domains via two-step invocation (FaaSBench→PCF→TCF). Peak throughput, responsiveness (autoscaling delays), and timeout triggering expose platform limits and runtime overhead.
In LLM tool use (Maekawa et al., 30 Sep 2025), evaluation comprises execution of multi-step tasks over synthetic APIs, with success measured by the agent’s ability to compose correct call sequences. Failures are classified (state propagation, incorrect value usage), and mitigation (explicit state reminders) is shown to yield quantitative success rate improvements (e.g., GPT-5: 62.5%→81.3%).
Constraint programming frameworks (Dang et al., 2022) employ grading algorithms that penalize trivial or unsolvable instances, assigning "good" instances only if solver performance lies in $[t_\text{min}, t_\text{max}]$ . Discriminating instances are identified using normalized score ratios across solver pairs.
GNBG-style optimizing benchmarks (Yazdani et al., 2023, Baronti et al., 23 Jun 2024) enable metric comparison of convergence rates, success over multimodal landscapes, and behaviour under controlled conditioning and separability, with instance and meta-information supplied for reproducibility.

4. Data Exchange, Isolation, and Contamination Control

Contamination-free benchmarking is a haLLMark of recent FuncBenchGen methodology (Maekawa et al., 30 Sep 2025). Synthetic on-the-fly function and task construction isolates evaluation from pretraining or test-time data overlap, in contrast to curated API benchmarks found in the literature. For cloud and software systems benchmarking (Pellegrini et al., 2019, Derks et al., 2021), experiments either use isolated testbeds (OpenFaaS on Debian VMs) or simulation over abstracted codebase “asset trees” and controlled evolution operators.

Exchange protocols, such as the JSON-based workload and response formats in FaaSBench, guarantee correlated traceability (e.g., unique uuids, start/stop times), facilitating reliable analytics and post-hoc diagnosis. Constraints on resource isolation (e.g., mitigation of noisy neighbor effects, handling of clock synchronization across regions) are explicitly acknowledged as limitations with recommended future research directions.

5. Practical Implications, Optimization, and Tool Development

FuncBenchGen frameworks offer substantial utility for both platform developers and end-users:

Optimization and Bottleneck Analysis: By decomposing request/response overhead and function call dependencies, users can identify critical points for resource allocation, runtime improvement, or implementation adjustment (Pellegrini et al., 2019).
Resource Planning: Benchmark metrics guide cloud provisioning decisions and SLA compliance (Pellegrini et al., 2019).
Algorithm and Solver Profiling: AutoIG and GNBG facilitate systematic comparison and portfolio construction in constraint programming and optimization, surfacing “weak spots” as well as pockets of unexpected strength (Dang et al., 2022).
State Tracking in LLMs: Explicit variable restatement mitigates brittle context propagation, with empirical gain validated across model variants (Maekawa et al., 30 Sep 2025).

Such frameworks can also inform tool development by clarifying the impact of distractors, ill-conditioning, and variant interactions.

6. Limitations, Challenges, and Future Directions

Several explicit limitations are documented:

Scalability bound by hardware (memory, bandwidth), risk of performance isolation failure when resources are co-located, and lack of integration with native logging systems in production (Pellegrini et al., 2019).
Data and logging limitations, clock synchronization, and absence of workload reuse impede deep-dive analytics (Pellegrini et al., 2019).
In the context of LLM evaluation, degradation with increased dependency depth and distractor function connectivity signals unresolved model limitations (Maekawa et al., 30 Sep 2025).
For evolving software benchmarks, ensuring compileability and extensibility across arbitrary languages is non-trivial, though asset tree abstraction and transaction wrapping partially address this (Derks et al., 2021).

Anticipated expansions include migration to new runtime environments, public release of source code, multi-language variant support, tailored workload design (including billing-timeout correlations), integration with native logging systems, and research into advanced state management and distractor filtration in LLMs.

7. Significance and Impact on Benchmarking Research

FuncBenchGen, across its instantiations, advances benchmarking practice by enabling fine-grained control, contamination resistance, and systematic instance generation unsuitable by legacy static datasets or partially crowdsourced benchmarks. In cloud, AI, and optimization communities, these principles embody a shift towards evaluation frameworks that reflect real-world deployment challenges and ensure fair, reproducible comparison. By uncovering specific failure modes (e.g., state tracking, bottleneck propagation, software evolution regressions), FuncBenchGen directs future research at both the architectural and algorithmic level.

The framework’s utility is validated both qualitatively—through identification and mitigation of systemic weaknesses—and quantitatively—as seen in success rate improvements for leading models such as GPT-5 under controlled evaluation (Maekawa et al., 30 Sep 2025). Its adoption signals an increased expectation for reliability and transparency in the benchmarking of cloud functions, LLM tool-use, optimization algorithms, and beyond.