CogIP-Bench: LLM Code Interpreter Evaluation

Updated 5 December 2025

CogIP-Bench is an evaluation framework that measures LLMs’ ability to use built-in code interpreters for solving complex, multi-turn data science tasks.
It combines process-oriented and output-oriented metrics, such as numeric accuracy, ROUGE-L, and SSIM, to assess performance comprehensively.
Systematic comparisons across proprietary and open-source models using both end-to-end and oracle modes reveal actionable insights for improving tool-use agents.

CogIP-Bench (CIBench) is an evaluation framework for assessing the ability of LLMs to invoke and reason with a built-in code interpreter plugin while solving realistic, multi-step data-science tasks. CogIP-Bench defines a task suite and metrics for process and output quality, enables systematic comparison of proprietary and open-source LLMs, and provides actionable insights for advancing tool-use agents in data science workflows (Zhang et al., 15 Jul 2024).

1. Formal Framework and Core Metrics

Let $T = \{T_1,\ldots,T_N\}$ denote the set of $N$ multi-turn data-science tasks (notebooks), $M$ the set of five evaluation metrics, and $E = \{e_{\text{end2end}}, e_{\text{oracle}}\}$ the two evaluation modes. CogIP-Bench is thus defined as $(T, M, E)$ .

Output-Oriented Metrics:

Numeric Accuracy:

$A_{\text{num}} = \frac{(\text{Number of correctly predicted numeric answers})}{(\text{Total number of numeric questions})}$

Text Score (ROUGE-L):

$A_{\text{text}} = \text{ROUGE-L}(\text{predicted report}, \text{ground-truth report})$

Visualization Score (SSIM):

$A_{\text{vis}} = \text{SSIM}(\text{generated plot}, \text{reference plot})$

Process-Oriented Metrics:

Tool Call Rate:

$R_{\text{tool}} = \frac{(\#~\text{“run code” commands})}{(\#~\text{questions})}$

Executable Rate:

$R_{\text{exe}} = \frac{(\#~\text{code runs error-free})}{(\#~\text{code submissions})}$

Overall Score:

$A_{\text{avg}} = \frac{A_{\text{num}} + A_{\text{text}} + A_{\text{vis}}}{3}$

These metrics enable multifaceted assessment of LLM tool-use beyond conventional code correctness, capturing not only what answer is produced, but also the interpretability of outputs and robustness of the code execution process.

2. Dataset Construction and Scope

The CogIP-Bench dataset comprises 234 multi-step Jupyter-style notebooks with over 1,900 questions, encompassing seven task categories. Construction proceeds through the following phases:

Module and Topic Selection: Ten core Python libraries—pandas, matplotlib, seaborn, scikit-learn, PyTorch, TensorFlow, LightGBM, nltk, opencv-python, scipy—are selected. For each, GPT-4 proposes approximately 50 representative topics.
LLM–Human Cooperative Generation:
- GPT-4 generates notebooks (10–15 steps, explicit parameters, external dataset references).
- Human authors provide feedback, refine prompts (remove file-IO, ensure reproducibility).
- Recurring patterns are distilled into templates; new datasets are slotted in (GPT-4–synthesized/recently published data).
- Manual quality control ensures user-authentic prompts, step runtimes under 1 minute, dataset sizes under 50 MB, and precise ground-truth checks.
Session Structure: Each notebook is a multi-turn, interactive IPython session. Downstream questions depend on earlier cell outputs, simulating realistic, stateful code-interpreter usage across a data science workflow.

This methodology yields a comprehensive, diverse set of multi-turn problems, emphasizing complex, authentic tool-use scenarios.

3. Evaluation Modes and Protocols

CogIP-Bench implements the ReAct protocol, alternating between “Reasoning” (model thinks and plans) and “Action” (model submits code). Models are permitted up to three self-debug attempts per question.

End-to-End Mode ( $e_{\text{end2end}}$ ):
- The model receives only the prompt and its prior context.
- Must autonomously decide when to invoke the interpreter, interpret errors, debug, and proceed without external intervention.
Oracle Mode ( $e_{\text{oracle}}$ ):
- On failure (code error or wrong result), ground-truth code and reasoning are injected into the context as an in-context example.
- Simulates interactive human assistance or few-shot demonstration for downstream tasks.

These modes distinguish autonomous problem-solving from scenarios where corrective feedback is available, enabling analysis of both raw and guidance-augmented LLM performance.

4. Experimental Setup

Model Coverage:

Model Type	Examples	Parameter Scales
Proprietary	GPT-4-1106-preview, GPT-4o	Not specified
Open-Source	Llama-2 (7B–70B), Llama-3 (8B, 70B), Vicuna (7B–13B)	7B–72B
	Qwen (7B–72B), InternLM2 (7B, 20B), Yi (6B, 34B)
Code-oriented	Code Llama, Codex-style variants
Others	DeepSeek, ChatGLM3, Baichuan 2	See paper for full list

Evaluation Platform and Environment:

OpenCompass evaluation harness.
Python 3.10, with explicit module versions: pandas 1.5.3, matplotlib 3.7.2, seaborn 0.13.0, scikit-learn 1.2.1, PyTorch 1.13.1, TensorFlow 2.14.0, LightGBM 4.1.0, nltk 3.8, opencv-python 4.8.1.78, scipy 1.11.2.
Deterministic decoding ( $\text{temperature}=0$ ), up to three self-debug trials per query.
Tasks vary in complexity: easy (<2 steps), medium (2–4), hard (>4).

This methodological rigor ensures reliable, reproducible assessment across diverse model architectures and parameterizations.

5. Key Results and Analytical Findings

Quantitative Performance

Proprietary Models:
- GPT-4-1106-preview: $A_{\text{num}}\approx77.8\%$ , $A_{\text{text}}\approx78.9\%$ , $A_{\text{vis}}\approx64.0\%$ in $e_{\text{end2end}}$ ; $A_{\text{avg}}\approx75.5\%$ in $e_{\text{oracle}}$ .
- GPT-4o: $A_{\text{avg}}\approx74.5\%$ .
Top Open-Source: Best models (Llama-3-70B-Instruct, Qwen-72B) achieve $A_{\text{avg}}\approx65\%$ .
Medium-size Models: Llama-2-7B, Vicuna-7B attain $A_{\text{avg}}\approx 20$ – $35\%$ .

Behavior and Robustness

Process Metrics: Most models invoke the interpreter in over 90% of queries. Executable rates: proprietary models $>95\%$ ; open-source, $45$– $90\%$ .
Task Difficulty: On “hard” (>4-step) tasks, average scores decrease by 10–25 points relative to “easy” tasks, indicating decreased robustness in extended workflows.
Module Performance: High scores on SciPy (mathematics/statistics) and simple visualization; near-zero on advanced modeling (PyTorch, TensorFlow).
Debugging: Allowing two to three self-debug trials yields +5–10% increases in $R_{\text{exe}}$ , $A_{\text{num}}$ , $A_{\text{vis}}$ for many models, indicating partial self-repair capability.
Cross-Benchmark Correlation: CIBench scores correlate strongly ( $r>0.7$ ) with GSM8K, HumanEval, and BBH metrics, suggesting generalized gains in reasoning and coding transfer to tool-use scenarios.
Language Sensitivity: Most models lose 2–5 points in $A_{\text{avg}}$ when inputs are translated to Chinese; larger degradations occur for some models (e.g., DeepSeek, Qwen), indicating cross-linguistic sensitivity.

6. Insights and Recommendations

Analysis of CogIP-Bench results yields several key recommendations:

Error-Correction Training: Incorporate explicit error-parsing and bug-repair capabilities to enhance LLM autonomy following interpreter failures.
Multi-Turn Reasoning: Emphasize architectural support for long-term memory and contextual state tracking in interactive, sequential tool-use.
Module-Specific Proficiency: Target specialized fine-tuning or retrieval-augmented approaches for advanced frameworks (e.g., PyTorch, TensorFlow), as current LLMs—including state-of-the-art—underperform markedly.
Metric Extensions: Develop output-side metrics capturing non-deterministic behaviors (e.g., stochastic model training outcomes) and increase robustness of visualization similarity measures.
Language Robustness: Enhance multilingual pipeline fidelity to maintain performance in non-English data science workflows.

A plausible implication is that future advances in LLM tool-use proficiency hinge not only on scaling and general code competence, but also on dedicated architecture, training, and pre/post-processing tailored to the distinctive demands of interactive, multi-framework data science sessions.

7. Significance for LLM Agent Evaluation

CogIP-Bench provides a large-scale, multi-turn, real-world benchmark that exposes the performance gap between proprietary and open-source LLMs in realistic data science scenarios, especially for advanced modeling workflows. Its joint use of process- and output-oriented metrics, under two evaluation modes, offers precise analysis of both autonomous and guided tool-use proficiency. By making stateful, multi-turn problem-solving and comprehensive metrication central, CogIP-Bench sets a new standard for evaluation, facilitating both progress assessment and diagnosis in the development of code-interpreter-empowered agents (Zhang et al., 15 Jul 2024).

PDF Markdown Chat (Pro)

References (1)

CIBench: Evaluating Your LLMs with a Code Interpreter Plugin (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to CogIP-Bench.