CryptoAnalystBench Evaluation Suite
- CryptoAnalystBench is a rigorously-designed suite of benchmarks that evaluates LLM and agentic systems’ analytic, reasoning, and verification capabilities in complex crypto environments.
- It integrates multi-tool workflows including on-chain data synthesis, adversarial cryptanalytic code recovery, and zero-knowledge circuit benchmarking to emulate realistic analyst tasks.
- The suite systematically exposes LLM limitations in temporal reasoning, data consistency, and cryptographic proof synthesis, guiding targeted improvements in high-stakes domains.
CryptoAnalystBench is a suite of rigorously-designed, production-grade benchmarks for evaluating the analytic, reasoning, and verification capacities of LLMs and agentic systems in high-complexity, high-stakes computational domains. Originating in the cryptocurrency and decentralized finance (DeFi) context, CryptoAnalystBench has grown to include multi-tool long-form analyst reasoning, adversarial cryptanalytic code recovery, agent-based cryptographic problem solving, zero-knowledge (ZK) circuit benchmarking, and machine-checked proof synthesis for industrial cryptographic code. These benchmarks systematically expose the practical and theoretical limitations of LLMs in realistic analyst and cryptography workflows, emphasizing multi-source heterogeneity, temporal volatility, structured data synthesis, and deep formal correctness.
1. Domain Motivation and Evaluation Gaps
CryptoAnalystBench addresses critical gaps in LLM evaluation by targeting scenarios with extreme data density, rapidly-evolving information, and complex tool interdependencies, as prevalent in professional crypto analysis, threat intelligence, privacy engineering, and low-level cryptographic verification (Eswaran et al., 11 Feb 2026). Unlike standard NLP or QA benchmarks—typically limited to short, self-contained queries or shallow retrieval—cryptocurrency and DeFi research involves:
- Dozens of heterogeneous tool calls (blockchain explorers, market APIs, code execution, document retrieval)
- Hundreds of thousands of tokens of structured and unstructured input and output per query
- Live, volatile data demanding precise temporal grounding and risk-sensitive recommendations
These requirements surface distinctive failure modes and decision-critical reliability bottlenecks that generic LLM benchmarks fail to capture, motivating dedicated, production-aligned testbeds (Guo et al., 29 Nov 2025).
2. Benchmark Design and Categories
The CryptoAnalystBench suite comprises complementary benchmarks, each focused on a different axis of crypto-analytic and cryptographic AI evaluation:
a) Multi-tool Long-form Analyst Reasoning (Eswaran et al., 11 Feb 2026)
Composed of 198 production queries drawn from live analyst workflows, spanning 11 crypto research categories (e.g., protocol research, market data, on-chain flows, risk, security, governance). Queries are curated to require complex, multi-step queries, dynamic data synthesis, and reconciliation of conflicting tool outputs. An agentic harness orchestrates 10–20 real-world analytics tools per query.
b) Dynamic On-chain Data and Predictive Tasks (Guo et al., 29 Nov 2025)
Issues 50 newly-parameterized, expert-authored queries monthly, ensuring time-sensitivity and adversarial resilience. Tasks are classified by a four-quadrant taxonomy (Simple/Complex × Retrieval/Prediction), probing both foundational data-fusion and advanced inference/forecasting competencies.
c) Cryptanalytic and Code Reasoning Challenges (Morales et al., 7 May 2026)
Benchmarks LLMs on adversarial secret recovery, simulating malware/hacker tactics with 336 JavaScript samples transformed through 12 levels of obfuscation and encryption (XOR, AES-256). Evaluation focuses on detection rates, exact extraction, false positives, and convolution with code-level reasoning and symbolic execution.
d) Cryptographic Proof and CTF Tasks (Wang et al., 13 Jul 2025)
AICrypto integrates 135 MCQs, 150 CTF-style challenges, and 18 proof problems (from real cryptography finals), with expert baselines and agentic CTF evaluation. Domains include block ciphers, RSA, lattice cryptography, discrete logs, and classical ciphers.
e) ZK-Friendly Hash and SNARK System Benchmarks (Guo et al., 2024)
Compares kinetic metrics of ZK hash functions (Poseidon2, Neptune, GMiMC) and SNARK proofs (Groth16, PLONK, Fflonk) within circom/snarkjs on EVM-compatible chains. Captures circuit constraints, on-chain gas, batch-processing via sequencer, and privacy guarantees.
f) Low-level Assembly Proof Synthesis (Rao et al., 15 Mar 2026)
Presents 2,284 machine-checkable proof obligations from AWS’s s2n-bignum cryptographic assembly library in HOL Light. Synthesizing proofs requires bit-level, ISA-aware symbolic execution and links directly to industrial correctness guarantees.
3. Agentic Tool Integration and System Architecture
CryptoAnalystBench leverages agentic frameworks that combine LLM-driven tool selection, execution, and iterative evidence accumulation. Agents interact with:
- On-chain block explorers (Etherscan, The Graph)
- Market & DeFi APIs (CoinGecko, DefiLlama)
- On-chain analytics platforms (Dune, Nansen)
- Code execution environments
- Browser/document retrieval
Inputs undergo semantic rewriting and named-entity recognition. The agent executes a ReAct-style loop: tool choice, result parsing, scratchpad update, and analysis synthesis with inline citations. This emulates human analyst workflows and highlights orchestration, sequencing, and integration errors (Eswaran et al., 11 Feb 2026).
4. Evaluation Pipelines, Metrics, and Error Taxonomies
CryptoAnalystBench implements layered, production-aligned evaluation:
- Citation verification: Three-stage pipeline for atomic claim extraction, claim typing (exact, derived, fabricated), and data-source attribution. Metrics include Citation Precision and Completeness.
- Judge rubrics: Four user-defined axes—Temporal Relevance (TR), Data Consistency (DC), Depth (D), and Relevance (R)—each scored 1–10 per query. Aggregate dimension scores computed as .
- Task-specific metrics:
- IoC recovery: Detection Rate (DR), Exact Accuracy (EA), Uncertainty Rate, Hallucination Rate (Morales et al., 7 May 2026)
- Proof synthesis: Kernel-checked binary completion under category-aware timeouts (Rao et al., 15 Mar 2026)
- Multiquadrant task success: Average Success Rate (ASR), with ±5% numerical tolerance (Guo et al., 29 Nov 2025)
Failure taxonomy (higher-order): Staleness/Missing time bounds, Inconsistency, Source Reconciliation Failures, Shallow Synthesis, Missing Risk/Mechanism, Overconfident Prediction, and Partial/Misframed Answers (Eswaran et al., 11 Feb 2026). These capture error modalities not detected by surface metrics (e.g., hallucination).
5. Empirical Findings and Comparative Analysis
Across multiple model cohorts (GPT-OSS-120B, Qwen-235B, Kimi-2.5, GLM-4.7, GPT-5.2), top models saturate basic Relevance and Temporal Relevance (∼8.5–9.6), but manifest persistent, substantial deficits in Depth and especially in Data Consistency. Citation Precision exceeds 85% in leading systems, but high-order failure rates for temporal staleness and omission of risk context remain between 5–17% per category.
Under adversarial code obfuscation, LLM capabilities collapse for cryptographically-encrypted secrets (XOR, AES-256): zero secret recovery despite in-the-clear keys and decrypt routines, attributed to reliance on token-pattern matching rather than symbolic execution. Plaintext with lightweight obfuscation is almost always extractable, underlining a sharp performance boundary at the introduction of cryptographic concealment (Morales et al., 7 May 2026).
In proof synthesis on industrial cryptographic code, LLMs achieve <6% success on kernel-checked obligations, with full functional-correctness proofs almost never synthesized (0/859 in high-effort mode for s2n-bignum). This performance is markedly inferior to results on abstract mathematics or Olympiad-style benchmarks (>80% solve rates), emphasizing the gap between competition math and real-world code verification (Rao et al., 15 Mar 2026).
Cryptocurrency analyst agents, even in agentic frameworks, display a "retrieval–prediction imbalance" (Editor's term): e.g., GPT-5 direct: Retrieval ASR 41.2%, Prediction ASR 6.25%. Forecasting, inference, and complex synthesis tasks consistently underperform data lookup (Guo et al., 29 Nov 2025).
ZK and SNARK benchmarks confirm Poseidon2/Groth16 plus sequencer batching provide up to 73% on-chain cost reduction (ETH/BNB) and formal security/liveness theorems, setting new standards for privacy-preserving transaction circuit evaluation (Guo et al., 2024).
6. Mitigation Strategies and Open Research Problems
Targeted interventions from empirical study include:
- Prioritizing structured API output over unstructured sources
- Enhanced prompt design with explicit temporal context and evaluation date
- Task-primed sub-prompts for mechanism/risk depth
Interventions offer measurable (+0.3–0.6) improvements in Depth/Relevance for high-capability models, but must be tuned per agent planner. Persistent challenges include LLM-as-judge score calibration (possibly requiring hybrid pipelines), failure taxonomy extension to finer-grained analyst behaviors, and adapative mitigation selection based on trace and query complexity (Eswaran et al., 11 Feb 2026).
Key open problems:
- Integrating symbolic/taint/code-execution engines for robust cryptanalytic reasoning (Morales et al., 7 May 2026)
- Extending benchmarks to richer property classes (e.g., constant-time, side-channel, equivalence)
- Agent framework development for complex exploit chaining and multi-agent compositionality (Wang et al., 13 Jul 2025)
- Domain transfer to legal/scientific/enterprise analyst tasks at scale
7. Impact and Outlook
CryptoAnalystBench establishes new state-of-the-art for evaluating LLMs in domains with dense, volatile, and highly structured information requirements. It surfaces systemic inadequacies in temporal reasoning, data multi-source reconciliation, prediction under uncertainty, and low-level, symbolic correctness. The released datasets, harnesses, failure taxonomies, and evaluation infrastructure define clear, reproducible targets for next-generation research in analyst agents, cryptanalytic AI, and industrial-grade formal verification (Eswaran et al., 11 Feb 2026, Guo et al., 29 Nov 2025, Morales et al., 7 May 2026, Wang et al., 13 Jul 2025, Guo et al., 2024, Rao et al., 15 Mar 2026).