TeleAntiFraud-Bench: Telecom Fraud Benchmark

Updated 11 January 2026

TeleAntiFraud-Bench is a benchmark suite that evaluates telecom fraud detection using multimodal audio-text data from simulated and real-world call recordings.
It integrates scenario classification, binary fraud detection, and fraud type classification with detailed LLM-based annotation and chain-of-thought reasoning.
The framework supports rigorous evaluation of conventional and modern ML models while enforcing strong privacy controls through differential privacy techniques.

TeleAntiFraud-Bench is a standardized evaluation framework and benchmark suite for telecom fraud detection, specifically designed to assess automated systems on realistic, privacy-preserving, and multimodal datasets. It is distinguished by its integration of audio-text data, scenario-driven task structure, strict privacy controls, and alignment with both conventional and advanced machine learning pipelines in the anti-fraud domain. TeleAntiFraud-Bench is constructed from the TeleAntiFraud-28k dataset, and systematically supports research into scenario classification, fraud detection, and fraud type classification, while enabling rigorous, directly comparable evaluations across algorithms and modalities (Ma et al., 31 Mar 2025, Wang et al., 4 Jan 2026).

1. Benchmark Design and Dataset Construction

TeleAntiFraud-Bench is the held-out, proportionally sampled test subset of TeleAntiFraud-28k, an open-source corpus consisting of 28,511 anonymized audio-text pairs drawn from simulated and real-world telecom call recordings. The benchmark contains 7,021 samples (24.62% of TeleAntiFraud-28k), with the following class distributions:

Task	Classes (#)	Distribution per Label	Sample Sizes
Scenario	7	E.g., Customer Service, Food Delivery, etc.	4,632–91 per scenario
Binary Fraud	2	Fraud: 52.66%; Non-Fraud: 47.34%	3,697; 3,324
Fraud Type	7	E.g., Banking, Investment, Phishing, etc.	2,408–35 per type

All samples are annotated using an LLM-driven “slow-thinking” pipeline that integrates stepwise reasoning, scene/clue extraction, and output verification through a hybrid of regular expressions and secondary LLM consistency checks. Annotations for each sample include scenario, fraud presence, and, if fraudulent, fraud type, each supported by machine-readable reasoning justifications and associated confidences (Ma et al., 31 Mar 2025, Wang et al., 4 Jan 2026).

2. Task Protocols and Annotation Scheme

TeleAntiFraud-Bench supports three sequential supervisory tasks:

Scenario Classification: Assigns one of seven high-level customer-service scenarios per utterance (e.g., "Dining Service", "Transportation Inquiry").
Fraud Detection: Binary labeling (fraud/non-fraud) of the utterance.
Fraud Type Classification: For fraud-labeled samples, assigns one of seven fine-grained fraud typologies.

Each decision is annotated with both label and explicit chain-of-thought reasoning, validated for logical and procedural consistency. The “slow-thinking” annotation scheme is designed to surface both surface-level cues and deeper, multi-turn conversational fraud indicators (Ma et al., 31 Mar 2025, Wang et al., 4 Jan 2026).

3. Evaluation Metrics and Scoring

Performance on TeleAntiFraud-Bench is quantified by task-specific weighted F₁-scores, composite averages, and explicit measures of reasoning quality:

Weighted F₁ (per task): For classes $i=1...n$ with weights $w_i$ (proportional to true sample counts), precision $P_i$ , recall $R_i$ ,

${\rm Weighted}\;F_1 = \sum_{i=1}^n w_i \cdot \frac{2P_iR_i}{P_i + R_i}$

Process Quality Score: Computed by rubric-based LLMs over three dimensions: logical rigor, practicality, and expression clarity, typically normalized on a scale (e.g., 0–5 per dimension).
Composite Score:

${\rm Score}_{\text{total}} = 0.25 \times F_1^{\text{scene}} + 0.25 \times F_1^{\text{fraud}} + 0.25 \times F_1^{\text{type}} + 0.25 \times {\rm Score}_{\text{process}}$

Additional metrics include thinking efficiency (TEM: F₁ divided by log reasoning-chain length), real-time inference latency, and dynamic risk assessment performance in stepwise audio analysis (Wang et al., 4 Jan 2026).

4. Baseline Methods and State-of-the-Art Performance

TeleAntiFraud-Bench has been used as a testbed for evaluating both classical and modern machine learning frameworks, including multi-stage ASR → LLM pipelines and end-to-end multimodal large audio LLMs (LALMs). The most prominent baselines and their performance on the three main tasks are summarized below:

Model/Pipeline	Scenario F₁	Fraud F₁	Type F₁	AVG F₁	Fin. Score
DeepSeek-R1 (ASR+LLM)	83.60	79.25	85.16	82.67	62.17
Qwen2.5-72B-Instruct	78.31	51.44	81.24	70.33	52.87
GPT-4o (multimodal)	80.25	50.00	86.26	72.17	—
AntiFraud-Qwen2Audio (SFT)	81.31	84.78	82.91	83.00	62.36
SAFE-QAQ (SAFE-LS, RL)	84.64	90.20	87.25	87.49	65.76

The SAFE-QAQ framework (SAFE-LS variant) achieves absolute gains of +3.33 (scenario), +4.83 (fraud), +5.32 (fraud type) F₁ over the best SFT baseline, with a composite final score improvement of +3.40 points (Wang et al., 4 Jan 2026). These improvements derive from end-to-end audio processing immune to ASR degradation, explicit exploitation of paralinguistic and acoustic cues, and rule-based reinforcement learning targeting hierarchical reasoning depth.

5. Real-Time and Privacy-Preserving Benchmarking Extensions

TeleAntiFraud-Bench incorporates methodologies from privacy-preserving benchmarking to support secure evaluation of fraud detection algorithms on sensitive datasets. Adapting subsample-aggregate and synthetic-data approaches from the differential privacy literature (Goldberg et al., 30 Jul 2025), it can:

Partition the underlying (potentially graph-structured) data into $k$ subgraphs and aggregate per-partition metrics, reducing global query sensitivity and calibrating Laplace or Gaussian noise for $(\epsilon,\delta)$ -DP guarantees.
Generate DP-synthetic graphs via stochastic block model (SBM) estimation with privacy-preserving noise injected in edge count statistics; supports offline evaluation without leaking raw data details.
Tailor DP mechanisms and partitioning rates using public surrogate graphs, and deploy noise-calibrated leaderboard and top- $m$ detector releases.

This apparatus enables TeleAntiFraud-Bench to support robust third-party detector benchmarking, mitigate re-identification or membership inference attacks, and facilitate challenge-leaderboard competitions in a privacy-responsible framework (Goldberg et al., 30 Jul 2025).

6. Applications, Limitations, and Ethical Considerations

TeleAntiFraud-Bench is deployed both as a public research benchmark and as a core element in live telecom fraud detection workflows (e.g., SAFE-QAQ deployment at 70,000 calls/day (Wang et al., 4 Jan 2026)). Applications include fine-tuning and benchmarking LALMs, analysis of slow-thinking reasoning pipelines, and dynamic real-time fraud alerting.

Limitations include the inherent coverage constraints of TeleAntiFraud-28k (the only large-scale, public, audio-text telecom-fraud corpus at present), challenges of adapting to novel scam tactics or extreme acoustic conditions, and existing binary fraud labeling granularity. Annotation and evaluation protocols explicitly encourage ethical use; re-identification attempts and offensive misuse are prohibited by dataset licensing, with all public releases strictly anonymized or synthetic (Ma et al., 31 Mar 2025).

TeleAntiFraud-Bench is closely linked to adjacent benchmarks that target cloud resource abuse via telemetry data aggregations (e.g., privacy-friendly OpenStack telemetry-based fraud detection using Random Forests over nine rate features, 97.5% accuracy under 10-fold CV (Solanas et al., 2014)), and to node-level fraud detection on private call graphs with formal DP guarantees (Goldberg et al., 30 Jul 2025). Future extensions include integrating feature-rich temporal graph statistics, adaptive meta-classifier thresholds, and support for multi-speaker, overlapping, and other paralinguistic complexities.

References:

"TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection" (Ma et al., 31 Mar 2025)
"SAFE-QAQ: End-to-End Slow-Thinking Audio-Text Fraud Detection via Reinforcement Learning" (Wang et al., 4 Jan 2026)
"Benchmarking Fraud Detectors on Private Graph Data" (Goldberg et al., 30 Jul 2025)
"Detecting fraudulent activity in a cloud using privacy-friendly data aggregates" (Solanas et al., 2014)