Mapping AI Benchmark Data to Quantitative Risk Estimates Through Expert Elicitation

Published 6 Mar 2025 in cs.AI | (2503.04299v2)

Abstract: The literature and multiple experts point to many potential risks from LLMs, but there are still very few direct measurements of the actual harms posed. AI risk assessment has so far focused on measuring the models' capabilities, but the capabilities of models are only indicators of risk, not measures of risk. Better modeling and quantification of AI risk scenarios can help bridge this disconnect and link the capabilities of LLMs to tangible real-world harm. This paper makes an early contribution to this field by demonstrating how existing AI benchmarks can be used to facilitate the creation of risk estimates. We describe the results of a pilot study in which experts use information from Cybench, an AI benchmark, to generate probability estimates. We show that the methodology seems promising for this purpose, while noting improvements that can be made to further strengthen its application in quantitative AI risk assessment.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a pilot study that uses Cybench benchmark data and expert elicitation to translate LLM performance into quantitative risk metrics.
The methodology integrates the IDEA protocol with cybersecurity scenarios, revealing a baseline malware development risk of 25% that increases to 30-35% with LLM assistance.
The study highlights divergent expert opinions and calls for refined benchmarks and detailed risk scenarios to improve future AI risk modeling.

Mapping AI Benchmark Data to Quantitative Risk Estimates Through Expert Elicitation

Introduction

The paper "Mapping AI Benchmark Data to Quantitative Risk Estimates Through Expert Elicitation" (2503.04299) attempts to address the disconnect between the capabilities of LLMs and measurable real-world risks. While LLMs show potential in various sectors, the exact translation of these capabilities into quantifiable risk assessments remains obscure. This work introduces a pilot study using AI benchmark data from Cybench to inform risk models and facilitate expert elicitation in estimating probabilities of AI-induced harm. This allows for a measured approach linking AI performance with specific risks, such as cybersecurity threats.

Methodology

The study centers around a specific cybersecurity risk scenario, drawing from established frameworks like MITRE ATT&CK. Experts are invited to estimate the probability that a cybercrime group successfully develops malware assisted by LLMs. These estimates are informed by Cybench benchmark data, notably the First Solve Time (FST) metric which ranks tasks based on difficulty. The structured elicitation utilizes the IDEA protocol, ensuring diversification in expert opinions while maintaining methodological rigor.

Figure 1: The risk scenario has six steps, starting with the existence of an actor and their attempts at executing the risk scenario, and ending with the economic damage ensuing from the successful completion of the attack. Between those two is a set of probabilities for each step, conditional on completing the prior steps.

Experts are confronted with progressively harder tasks from Cybench, meant to be reflective of real-world malware development challenges. The individual probability estimates contribute to mapping benchmark performance into tangible risk metrics. The experts’ initial estimates are refined through discussion, leading to final consensus-based probabilities.

Results

The findings suggest a nuanced relationship between LLM capabilities and increased cybercrime risk. Without AI assistance, the baseline probability of successful malware development stands at 25%. An uplift to 30-35% is noted with current LLM assistance, attributed primarily to advancements in solving tasks like "Unbreakable" in Cybench.

Figure 2: Relationship between FST (First Solve Time) of the hardest task an LLM can solve in Cybench and the estimated probability of a cybercrime group successfully developing and deploying malware with that LLM's assistance. The baseline probability without LLM assistance is 25\%. Reference points show the highest FST that current models (o1, Claude 3.5 Sonnet, and GPT-4o) can consistently solve.

Experts highlight divergence in opinion, notably influenced by the perceived operational complexity of tasks. Group A interpreted high FST tasks as indicative of substantial uplift, whereas Group B considered these performance metrics to have limited correlation with real-world malware creation.

Figure 3: Comparison of mean probability estimates between Groups A and B, showing how each group's interpretation of LLM capabilities led to different assessments. Group A estimated higher probabilities, viewing LLM capability at solving complex CTF tasks as a significant advantage for cybercrime groups. Group B estimated lower probabilities, considering CTF performance as only minimally indicative of real-world malware development capabilities.

Confidence in these estimates varies, underscoring the need for further refinement in both benchmarking and elicitation methods. The Bayesian interpolation visually indicates the general trend without over-specifying probabilities for given FST values, acknowledging the uncertainty inherent in expert assessments.

Discussion

Limitations

Challenges in this study include upward bias, a small expert sample size, and limited deliberation timing, impacting the robustness of conclusions. Experts experienced ambiguity due to the broad scope of risk scenarios and task specificity.

Recommendations

Future efforts should aim to minimize proxy variable reliance by developing benchmarks closely aligned with specific risk model steps. Providing more detailed risk scenarios can enhance the precision of expert elicitation.

Future Research

Expanding this methodology to include more participants and diverse risk scenarios across AI domains can sharpen AI risk modeling tools. Testing different estimation protocols or deliberation times may yield more accurate expert consensus, informing regulatory strategies and academic discourse.

Conclusion

The linkage of AI performance metrics with quantifiable risk assessments represents an important stride in AI risk modeling. Despite inherent uncertainty and the study's limited scope, this work suggests a streamlined methodological path for translating model capabilities into genuine risk estimates.

Figure 4: The performance of LLM benchmarks directly informs the probability estimates generated through expert elicitation. For example, the expert is informed that an LLM can solve the task 'Unbreakable' in Cybench and uses this information to increase the probability of success for a malware creation step by 5\%.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper is about turning AI test scores into real-world risk numbers. Today, we have many tests that show what LLMs—like advanced chatbots—can do. But a good test score doesn’t directly tell us how likely the model is to help cause harm in the real world (for example, in cyberattacks). The authors show a simple way to connect AI “benchmarks” (standard tests) to estimates of real-world risk by asking cybersecurity experts to translate test performance into probabilities.

What questions did the paper ask?

The paper explores three easy-to-understand questions:

If an AI can pass certain hard cybersecurity puzzles, how much more likely does it make a cybercriminal succeed at creating and deploying malware?
Can we build a simple “map” that links AI test performance to a risk number (a probability)?
Will experts agree on those numbers, and what does disagreement tell us about how to improve future tests and risk models?

How did the researchers study it?

Think of this like connecting a student’s practice drills to their game-day performance. The team:

Chose a cybersecurity benchmark called Cybench. It contains “Capture the Flag” (CTF) puzzles—small, focused hacking challenges. Each puzzle has a difficulty score called First Solve Time (FST), which is how long the fastest human team took to solve it. Higher FST = harder puzzle.
Picked five relevant puzzles ranging from easier (about 7 minutes) to harder (about 5.5 hours).
Brought together cybersecurity experts for a structured workshop. This process is called “expert elicitation”—a careful way to ask experts for estimates. They used a well-known method called the IDEA protocol (Investigate, Discuss, Estimate, Aggregate), which helps experts think clearly, compare views, and then combine their answers.

Here’s the key step they asked the experts to do:

For each puzzle, imagine an LLM that can reliably solve puzzles up to that difficulty. Given that, what’s the probability that a cybercrime group succeeds in creating and deploying malware?

The experts started with a baseline: without any AI help, the chance of success was set at 25% (based on another study). For each puzzle, experts gave a new probability if the criminals had an LLM with that level of skill.

After collecting the estimates, the researchers used a simple statistical approach to draw a smooth curve that connects “puzzle difficulty the AI can handle” to “probability of malware success.” You can think of it like fitting a gentle line through a few points to show the overall trend, not exact predictions.

Quick definitions to help:

Benchmark: a standardized test for AI.
CTF task: a bite-sized hacking puzzle that tests specific skills.
FST (First Solve Time): a difficulty score; longer = harder.
Expert elicitation: a structured way to gather estimates from experts.
Uplift: how much AI raises the chance of success compared to no AI.

What did they find?

The main findings, explained simply:

Today’s LLMs likely give a small boost. Experts estimated that current models raise the chance of successful malware creation from 25% to around 30–35% (about a 5–10% increase).
If future LLMs get much better—able to solve most of the hardest Cybench tasks—the uplift could be meaningful. Experts thought the success chance could rise to roughly 40–65%.
Experts disagreed a lot. One group believed that strong AI puzzle performance would help less-skilled criminals succeed more often. Another group argued that real-world malware work is “messy” and requires stitching many steps together, so solving isolated puzzles might not translate into big real-world gains. This disagreement created wide uncertainty bands in the results.

Why this matters:

The study shows a practical way to connect test scores to risk probabilities—turning “AI is good at X” into “there’s a Y% chance of harm.”
It also shows where we lack clarity: benchmarks often test narrow skills, while real attacks require many coordinated abilities. That gap can cause experts to disagree.

What are the limits of this study?

Small group, short time: Only seven experts completed all tasks, and discussion time was brief.
Possible order bias: Tasks were shown from easiest to hardest, which might have nudged estimates upward over time.
Benchmark-to-reality gap: CTF puzzles are clean and self-contained; real attacks are more complex. That makes translation from test performance to real-world outcomes tricky.

What could this change or improve?

This work suggests two big next steps:

Make risk scenarios more specific. Experts asked about details like defender strength, attacker skill, and time limits. Clearer scenarios should lead to sharper estimates.
Align tests with real-world risk models. If we design benchmarks that mirror the exact steps in a risk model (for example, “move through a large codebase without being detected”), then experts won’t have to guess how much a test matters—they’ll know.

The long-term impact:

Policymakers and regulators can get numbers they can use (probabilities) rather than just “capability scores.”
Researchers and safety teams can build better tests that measure the skills that actually matter for harm.
Developers can see where their models might increase real-world risk and design safeguards accordingly.

Bottom line

This paper is an early but important step toward turning AI test results into understandable risk numbers. It finds that current LLMs likely provide a small boost to cyber attackers, while future, stronger models could provide a larger one. Just as important, it shows how to build a bridge from lab tests to real-world risk—and where that bridge still needs reinforcing.

View Paper Prompt View All Prompts

Glossary

Bayesian interpolation approach: A method that uses Bayesian statistics to smoothly infer a relationship from sparse or noisy data. "To create a continuous mapping between the likelihood of success at developing and deploying malware and the FST, we implement a Bayesian interpolation approach."
Buffer overflow: A software vulnerability where writing more data to a buffer than it can hold overwrites adjacent memory, often enabling exploitation. "Examination of 'main.rs' file to identify a buffer overflow vulnerability."
Capture the Flag (CTF): Competitive cybersecurity challenges where participants solve security problems or exploit vulnerabilities to capture “flags.” "First Solve Time (FST) of each Capture the Flag (CTF) task."
Chemical, Biological, Radiological, and Nuclear (CBRN): A category of high-risk threats involving hazardous agents and weapons across these domains. "AI-enabled CBRN weapon design scenarios"
Collection: In the MITRE ATT&CK taxonomy, the tactic of gathering data from a target prior to exfiltration or use. "what that taxonomy calls the tactics 'lateral movement', 'collection', 'command and control', and 'exfiltration'"
Command and control: In cyber operations, establishing channels to communicate with compromised systems to issue commands or exfiltrate data. "what that taxonomy calls the tactics 'lateral movement', 'collection', 'command and control', and 'exfiltration'"
Confidence interval: A statistical range that expresses the uncertainty around an estimated parameter. "The resulting trend presents a large confidence interval, especially at higher FSTs."
Cyber Kill Chain: A framework describing stages of a cyberattack from reconnaissance to actions on objectives. "Lockheed Martinâs Cyber Kill Chain"
CYBERSECEVAL 3: A benchmark suite evaluating cybersecurity risks and capabilities of LLMs. "it has several established benchmarks (e.g., CYBERSECEVAL 3 and CTIBench)"
Delphi procedure: A structured method for eliciting expert judgments via iterative rounds to refine opinions. "It is a modified Delphi procedure"
Deserialization (pickle): The process of reconstructing objects from a serialized format; unsafe deserialization can enable code execution. "Analysis of 'chall.py' and 'my_pickle.py' to identify a pickle deserialization vulnerability."
Endpoint Detection and Response (EDR): Security tools focused on monitoring and responding to threats on endpoints like laptops and servers. "But it would not help overcome other hurdles, like evasion from EDR/NDR."
Expert elicitation: Systematic collection of judgments from domain experts to quantify uncertain parameters. "we use expert elicitation: we show cybersecurity experts the hardest task that a hypothetical LLM can solve from Cybench"
Exfiltration: The unauthorized transfer of data from a target system to an external location. "what that taxonomy calls the tactics 'lateral movement', 'collection', 'command and control', and 'exfiltration'"
First Solve Time (FST): A metric capturing the time it took the fastest team to solve a CTF task, used as a difficulty proxy. "Cybench provides an appropriate benchmark for our study due to its quantitative difficulty metric - the First Solve Time (FST) of each Capture the Flag (CTF) task."
IDEA protocol: A structured expert judgment process with phases: Investigate, Discuss, Estimate, Aggregate. "The IDEA protocol consists of a four-step elicitation process (âInvestigateâ, âDiscussâ, âEstimateâ, and âAggregateâ)"
Jailbreaking: Techniques to bypass model safety restrictions to access otherwise prohibited capabilities. "through jailbreaking the model or having access to the model weights"
Lateral movement: The technique of moving through a network post-compromise to access additional systems and privileges. "the step of lateral movement through large codebases represents a key bottleneck"
Markov Chain Monte Carlo (MCMC): A class of algorithms for sampling from complex probability distributions to approximate Bayesian posteriors. "Markov Chain Monte Carlo (MCMC) sampling is one such Bayesian technique, which we employ to model the relationship."
MITRE ATT&CK: A comprehensive taxonomy of adversary tactics, techniques, and procedures used in cyber operations. "MITRE ATT{paper_content}CK"
Model weights: The learned parameters of a machine learning model that determine its behavior and outputs. "through jailbreaking the model or having access to the model weights"
Network Detection and Response (NDR): Security systems for detecting and responding to threats by analyzing network traffic. "But it would not help overcome other hurdles, like evasion from EDR/NDR."
Quantitative AI risk assessment: The process of numerically estimating AI-related risks using data, models, and expert inputs. "while noting improvements that can be made to further strengthen its application in quantitative AI risk assessment."
Remote Code Execution (RCE): The ability for an attacker to execute arbitrary code on a remote target system. "Involves RCE, Overflow, and ROP concepts."
Return-Oriented Programming (ROP): An exploitation technique chaining existing code fragments (gadgets) to execute arbitrary behavior without injecting code. "Involves RCE, Overflow, and ROP concepts."
Risk modeling: Building structured representations of pathways to harm with measurable components to estimate overall risk. "Risk modeling could help address this gap by decomposing complex risk pathways into discrete, measurable steps"
Risk scorecard: A structured summary of risk indicators and categories used to communicate model risks. "the risk scorecard in OpenAIâs o1 system card shows that the model poses greater risks than its predecessor GPT4o"
Safety case: A structured, evidence-based argument demonstrating that a system is acceptably safe for a given context. "In the context of safety cases for AI, \citet{goemans2024} discuss expert input and benchmarks as sources of quantitative evidence in safety case nodes"
Scaffolding scheme: A method that orchestrates tools, prompts, and processes to enhance an LLM’s performance on complex tasks. "they compare the performance of five models ... using a custom scaffolding scheme."
Spear phishing: Targeted phishing attacks that use personalized information to deceive specific individuals. "A cybercrime group launches highly targeted spear phishing attacks"
System card: A document detailing a model’s capabilities, limitations, and risk profile to inform stakeholders. "the risk scorecard in OpenAIâs o1 system card"
Tool scaffolding: Integrating external tools and structured workflows around an LLM to improve task completion. "the actor can use the model in any fashion (e.g., chatbot, tool scaffolding, etc.)"
Uplift: The increase in success probability attributable to assistance from an AI model compared to a baseline. "a delta often referred to as the uplift provided by an LLM"
Web Application Firewall (WAF): A security system that filters and monitors HTTP traffic to protect web applications from attacks. "Bypassing a restrictive Web Application Firewall (WAF) to achieve remote code execution."
Zero-day vulnerabilities: Security flaws unknown to the vendor and defenders, lacking patches and thus exploitable by attackers. "benchmarks evaluating the ability of LLMs to discover zero-day vulnerabilities could directly measure how the models would perform against real-world zero-day vulnerabilities."

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Mapping AI Benchmark Data to Quantitative Risk Estimates Through Expert Elicitation

Summary

Mapping AI Benchmark Data to Quantitative Risk Estimates Through Expert Elicitation

Introduction

Methodology

Results

Discussion

Limitations

Recommendations

Future Research

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

What questions did the paper ask?

How did the researchers study it?

What did they find?

What are the limits of this study?

What could this change or improve?

Bottom line

Glossary

Open Problems

Continue Learning

Collections

Tweets