CryptoAnalystBench Evaluation Suite

Updated 2 July 2026

CryptoAnalystBench is a rigorously-designed suite of benchmarks that evaluates LLM and agentic systems’ analytic, reasoning, and verification capabilities in complex crypto environments.
It integrates multi-tool workflows including on-chain data synthesis, adversarial cryptanalytic code recovery, and zero-knowledge circuit benchmarking to emulate realistic analyst tasks.
The suite systematically exposes LLM limitations in temporal reasoning, data consistency, and cryptographic proof synthesis, guiding targeted improvements in high-stakes domains.

CryptoAnalystBench is a suite of rigorously-designed, production-grade benchmarks for evaluating the analytic, reasoning, and verification capacities of LLMs and agentic systems in high-complexity, high-stakes computational domains. Originating in the cryptocurrency and decentralized finance (DeFi) context, CryptoAnalystBench has grown to include multi-tool long-form analyst reasoning, adversarial cryptanalytic code recovery, agent-based cryptographic problem solving, zero-knowledge (ZK) circuit benchmarking, and machine-checked proof synthesis for industrial cryptographic code. These benchmarks systematically expose the practical and theoretical limitations of LLMs in realistic analyst and cryptography workflows, emphasizing multi-source heterogeneity, temporal volatility, structured data synthesis, and deep formal correctness.

1. Domain Motivation and Evaluation Gaps

CryptoAnalystBench addresses critical gaps in LLM evaluation by targeting scenarios with extreme data density, rapidly-evolving information, and complex tool interdependencies, as prevalent in professional crypto analysis, threat intelligence, privacy engineering, and low-level cryptographic verification (Eswaran et al., 11 Feb 2026). Unlike standard NLP or QA benchmarks—typically limited to short, self-contained queries or shallow retrieval—cryptocurrency and DeFi research involves:

Dozens of heterogeneous tool calls (blockchain explorers, market APIs, code execution, document retrieval)
Hundreds of thousands of tokens of structured and unstructured input and output per query
Live, volatile data demanding precise temporal grounding and risk-sensitive recommendations

These requirements surface distinctive failure modes and decision-critical reliability bottlenecks that generic LLM benchmarks fail to capture, motivating dedicated, production-aligned testbeds (Guo et al., 29 Nov 2025).

2. Benchmark Design and Categories

The CryptoAnalystBench suite comprises complementary benchmarks, each focused on a different axis of crypto-analytic and cryptographic AI evaluation:

Composed of 198 production queries drawn from live analyst workflows, spanning 11 crypto research categories (e.g., protocol research, market data, on-chain flows, risk, security, governance). Queries are curated to require complex, multi-step queries, dynamic data synthesis, and reconciliation of conflicting tool outputs. An agentic harness orchestrates 10–20 real-world analytics tools per query.

Issues 50 newly-parameterized, expert-authored queries monthly, ensuring time-sensitivity and adversarial resilience. Tasks are classified by a four-quadrant taxonomy (Simple/Complex × Retrieval/Prediction), probing both foundational data-fusion and advanced inference/forecasting competencies.

Benchmarks LLMs on adversarial secret recovery, simulating malware/hacker tactics with 336 JavaScript samples transformed through 12 levels of obfuscation and encryption (XOR, AES-256). Evaluation focuses on detection rates, exact extraction, false positives, and convolution with code-level reasoning and symbolic execution.

AICrypto integrates 135 MCQs, 150 CTF-style challenges, and 18 proof problems (from real cryptography finals), with expert baselines and agentic CTF evaluation. Domains include block ciphers, RSA, lattice cryptography, discrete logs, and classical ciphers.

Compares kinetic metrics of ZK hash functions (Poseidon2, Neptune, GMiMC) and SNARK proofs (Groth16, PLONK, Fflonk) within circom/snarkjs on EVM-compatible chains. Captures circuit constraints, on-chain gas, batch-processing via sequencer, and privacy guarantees.

Presents 2,284 machine-checkable proof obligations from AWS’s s2n-bignum cryptographic assembly library in HOL Light. Synthesizing proofs requires bit-level, ISA-aware symbolic execution and links directly to industrial correctness guarantees.

3. Agentic Tool Integration and System Architecture

CryptoAnalystBench leverages agentic frameworks that combine LLM-driven tool selection, execution, and iterative evidence accumulation. Agents interact with:

On-chain block explorers (Etherscan, The Graph)
Market & DeFi APIs (CoinGecko, DefiLlama)
On-chain analytics platforms (Dune, Nansen)
Code execution environments
Browser/document retrieval

Inputs undergo semantic rewriting and named-entity recognition. The agent executes a ReAct-style loop: tool choice, result parsing, scratchpad update, and analysis synthesis with inline citations. This emulates human analyst workflows and highlights orchestration, sequencing, and integration errors (Eswaran et al., 11 Feb 2026).

4. Evaluation Pipelines, Metrics, and Error Taxonomies

CryptoAnalystBench implements layered, production-aligned evaluation:

Citation verification: Three-stage pipeline for atomic claim extraction, claim typing (exact, derived, fabricated), and data-source attribution. Metrics include Citation Precision and Completeness.
Judge rubrics: Four user-defined axes—Temporal Relevance (TR), Data Consistency (DC), Depth (D), and Relevance (R)—each scored 1–10 per query. Aggregate dimension scores computed as $S_d = \frac{1}{N} \sum_{i=1}^N s_{i,d}$ .
Task-specific metrics:
- IoC recovery: Detection Rate (DR), Exact Accuracy (EA), Uncertainty Rate, Hallucination Rate (Morales et al., 7 May 2026)
- Proof synthesis: Kernel-checked binary completion under category-aware timeouts (Rao et al., 15 Mar 2026)
- Multiquadrant task success: Average Success Rate (ASR), with ±5% numerical tolerance (Guo et al., 29 Nov 2025)

Failure taxonomy (higher-order): Staleness/Missing time bounds, Inconsistency, Source Reconciliation Failures, Shallow Synthesis, Missing Risk/Mechanism, Overconfident Prediction, and Partial/Misframed Answers (Eswaran et al., 11 Feb 2026). These capture error modalities not detected by surface metrics (e.g., hallucination).

5. Empirical Findings and Comparative Analysis

Across multiple model cohorts (GPT-OSS-120B, Qwen-235B, Kimi-2.5, GLM-4.7, GPT-5.2), top models saturate basic Relevance and Temporal Relevance (∼8.5–9.6), but manifest persistent, substantial deficits in Depth and especially in Data Consistency. Citation Precision exceeds 85% in leading systems, but high-order failure rates for temporal staleness and omission of risk context remain between 5–17% per category.

Under adversarial code obfuscation, LLM capabilities collapse for cryptographically-encrypted secrets (XOR, AES-256): zero secret recovery despite in-the-clear keys and decrypt routines, attributed to reliance on token-pattern matching rather than symbolic execution. Plaintext with lightweight obfuscation is almost always extractable, underlining a sharp performance boundary at the introduction of cryptographic concealment (Morales et al., 7 May 2026).

In proof synthesis on industrial cryptographic code, LLMs achieve <6% success on kernel-checked obligations, with full functional-correctness proofs almost never synthesized (0/859 in high-effort mode for s2n-bignum). This performance is markedly inferior to results on abstract mathematics or Olympiad-style benchmarks (>80% solve rates), emphasizing the gap between competition math and real-world code verification (Rao et al., 15 Mar 2026).

Cryptocurrency analyst agents, even in agentic frameworks, display a "retrieval–prediction imbalance" (Editor's term): e.g., GPT-5 direct: Retrieval ASR 41.2%, Prediction ASR 6.25%. Forecasting, inference, and complex synthesis tasks consistently underperform data lookup (Guo et al., 29 Nov 2025).

ZK and SNARK benchmarks confirm Poseidon2/Groth16 plus sequencer batching provide up to 73% on-chain cost reduction (ETH/BNB) and formal security/liveness theorems, setting new standards for privacy-preserving transaction circuit evaluation (Guo et al., 2024).

6. Mitigation Strategies and Open Research Problems

Targeted interventions from empirical study include:

Prioritizing structured API output over unstructured sources
Enhanced prompt design with explicit temporal context and evaluation date
Task-primed sub-prompts for mechanism/risk depth

Interventions offer measurable (+0.3–0.6) improvements in Depth/Relevance for high-capability models, but must be tuned per agent planner. Persistent challenges include LLM-as-judge score calibration (possibly requiring hybrid pipelines), failure taxonomy extension to finer-grained analyst behaviors, and adapative mitigation selection based on trace and query complexity (Eswaran et al., 11 Feb 2026).

Key open problems:

Integrating symbolic/taint/code-execution engines for robust cryptanalytic reasoning (Morales et al., 7 May 2026)
Extending benchmarks to richer property classes (e.g., constant-time, side-channel, equivalence)
Agent framework development for complex exploit chaining and multi-agent compositionality (Wang et al., 13 Jul 2025)
Domain transfer to legal/scientific/enterprise analyst tasks at scale

7. Impact and Outlook

CryptoAnalystBench establishes new state-of-the-art for evaluating LLMs in domains with dense, volatile, and highly structured information requirements. It surfaces systemic inadequacies in temporal reasoning, data multi-source reconciliation, prediction under uncertainty, and low-level, symbolic correctness. The released datasets, harnesses, failure taxonomies, and evaluation infrastructure define clear, reproducible targets for next-generation research in analyst agents, cryptanalytic AI, and industrial-grade formal verification (Eswaran et al., 11 Feb 2026, Guo et al., 29 Nov 2025, Morales et al., 7 May 2026, Wang et al., 13 Jul 2025, Guo et al., 2024, Rao et al., 15 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (6)

CryptoAnalystBench: Failures in Multi-Tool Long-Form LLM Analysis (2026)

CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency (2025)

Benchmarking Large Language Models for IoC Recovery under Adversarial Code Obfuscation and Encryption (2026)

AICrypto: A Comprehensive Benchmark For Evaluating Cryptography Capabilities of Large Language Models (2025)

Benchmarking ZK-Friendly Hash Functions and SNARK Proving Systems for EVM-compatible Blockchains (2024)

s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CryptoAnalystBench.

CryptoAnalystBench Evaluation Suite

1. Domain Motivation and Evaluation Gaps

2. Benchmark Design and Categories

a) Multi-tool Long-form Analyst Reasoning (Eswaran et al., 11 Feb 2026)

b) Dynamic On-chain Data and Predictive Tasks (Guo et al., 29 Nov 2025)

c) Cryptanalytic and Code Reasoning Challenges (Morales et al., 7 May 2026)

d) Cryptographic Proof and CTF Tasks (Wang et al., 13 Jul 2025)

e) ZK-Friendly Hash and SNARK System Benchmarks (Guo et al., 2024)

f) Low-level Assembly Proof Synthesis (Rao et al., 15 Mar 2026)

3. Agentic Tool Integration and System Architecture

4. Evaluation Pipelines, Metrics, and Error Taxonomies

5. Empirical Findings and Comparative Analysis

6. Mitigation Strategies and Open Research Problems

7. Impact and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

CryptoAnalystBench Evaluation Suite

1. Domain Motivation and Evaluation Gaps

2. Benchmark Design and Categories

a) Multi-tool Long-form Analyst Reasoning (Eswaran et al., 11 Feb 2026)

b) Dynamic On-chain Data and Predictive Tasks (Guo et al., 29 Nov 2025)

c) Cryptanalytic and Code Reasoning Challenges (Morales et al., 7 May 2026)

d) Cryptographic Proof and CTF Tasks (Wang et al., 13 Jul 2025)

e) ZK-Friendly Hash and SNARK System Benchmarks (Guo et al., 2024)

f) Low-level Assembly Proof Synthesis (Rao et al., 15 Mar 2026)

3. Agentic Tool Integration and System Architecture

4. Evaluation Pipelines, Metrics, and Error Taxonomies

5. Empirical Findings and Comparative Analysis

6. Mitigation Strategies and Open Research Problems

7. Impact and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics