SAFE-QAQ: Slow-Thinking Fraud Detection

Updated 11 January 2026

SAFE-QAQ is a slow-thinking fraud detection framework that integrates reinforcement learning and multimodal audio-text analysis to identify fraudulent calls.
Its architecture synergizes audio feature extraction, optional ASR logging, and a multimodal encoder to produce transparent, chain-of-thought outputs for incremental decision-making.
Empirical results on TeleAntiFraud-Bench demonstrate high classification metrics and operational efficiency, reducing manual audits and financial losses.

SAFE-QAQ is an end-to-end, slow-thinking audio-text fraud detection framework utilizing reinforcement learning and multimodal LLMs for robust, real-time detection in telephony scenarios. Designed to overcome the limitations of ASR-based pipelines—such as transcription error and lack of acoustic reasoning—SAFE-QAQ systematically integrates hierarchical reasoning, rule-based reinforcement learning, and phase-aware dynamic risk assessment to achieve high accuracy and operational throughput in detecting and classifying fraudulent behaviors in live calls (Wang et al., 4 Jan 2026).

1. System Architecture

SAFE-QAQ consists of a multilayered audio–text processing pipeline. The primary inputs are the raw audio waveform $u$ (e.g., telephony calls) and a task-specifying text prompt $t$ . Its core components include:

Audio Feature Extraction: Log-Mel spectrograms, pitch, and energy are computed directly from $u$ .
Optional ASR Decoder: A lightweight automatic speech recognition (ASR) module generates transcripts exclusively for logging purposes; the primary model's reasoning remains audio-centric, mitigating ASR-imposed error cascades.
Multimodal Encoder: Using Qwen2-Audio-7B-Instruct as the backbone (termed “AntiFraud-Qwen2Audio”), both audio features and the prompt are tokenized and processed with a transformer architecture.
Output Head: Produces an explicit "chain-of-thought" ( $\tau$ ), encapsulated in > … tags, as well as a final classification block wrapped in <answer>…</answer> tags. Outputs include scenario label, binary fraud detection, and fine-grained fraud type.
Chunked, Incremental Inference: Audio is segmented into "turns" for incremental, real-time detection, allowing SAFE-QAQ to respond early within live conversational streams.

2. Hierarchical Slow-Thinking Reasoning

The framework operationalizes "slow thinking" (analogous to System 2 reasoning), requiring explicit multistep, hierarchical rationales rather than single-pass classification. Reasoning is organized in three levels:

Scenario Classification: Distinguishes major threat genres (e.g., phishing, logistics scam).
Fraud Detection: Binary determination (fraud/genuine).
Fraud-Type Classification: Assigns one of seven fine-grained fraud types.

Each chain-of-thought ( $\tau$ ) is a structured sequence of reasoning snippets targeting specific audio-text cues (e.g., vocal tone, hesitation, environmental noise), which are incrementally assembled and leveraged by subsequent reasoning levels. This allows transparency and enables intermediate error tracing.

3. Reinforcement Learning Formulation

SAFE-QAQ models reasoning and prediction as a Markov decision process where at each timestep $t$ , the state $s_t$ comprises the current audio, instruction, and the partial reasoning chain. The policy $\pi_\theta$ generates output sequences $o=(\tau, y)$ .

Key reinforcement learning innovations include:

Group Relative Policy Optimization (GRPO): Multiple candidate outputs are sampled per instance ( $G$ ), each receiving a reward computed via explicit rule-based metrics.
Reward Structure:
- $R_{acc}$ : Indicator for correct label.
- $R_{fmt}$ : Proper formatting ( $<think>$ , $<answer>$ tags).
- $R_{depth}$ : Scaled reward for deeper, more explicit chains of reasoning; penalizes shallow responses.
- $R_{phase}$ : Accuracy in early/late/final phase recognition for risk localization in live calls.
Objective:

$J_{\rm GRPO}(\theta) = \mathbb{E}\left[\frac{1}{G}\sum_{i=1}^G \left( \min(\rho_i A_i, \mathrm{clip}_\epsilon(\rho_i)A_i) - \beta D_{KL}(\pi_\theta\,\|\,\pi_{\rm ref}) \right)\right]$

with intra-group advantage ( $A_i$ ) and importance sampling ( $\rho_i$ ) to stabilize optimization.

SAFE-QAQ utilizes entirely rule-based rewards, with no learned reward critic, to enforce interpretable, auditable response structures.

4. Dynamic Risk Assessment and Real-Time Detection

A dynamic risk assessment module ("SAFE-Real") maintains phase awareness throughout live calls. Each conversational segment is tagged as early, late, or final, and appropriate outputs along with risk scores are generated at each stage. The reward structure incentivizes correct decisions being made as early as feasible—critical for real-time operational fraud mitigation:

$R_{\rm total} = \alpha R_{acc} + \beta R_{fmt} + \eta R_{depth} + \delta R_{phase}$

This phase-aware training enables both early warning and robust final-detection, resulting in substantial gain over static post-hoc analysis.

5. Multistage Training and Inference Strategies

SAFE-QAQ employs a three-stage RL training protocol:

SAFE-RL: Initial rule-based RL on complete transcript data, teaching the model slow-thinking chains.
SAFE-RS and SAFE-LS: Apply rejection sampling and length-constrained RL to prefer the shortest, correct reasoning chains from up to $K=16$ sampled candidates; penalize superfluous token usage using a logarithmic length penalty.
SAFE-Real: Train on chunked conversational turns with phase recognition rewards; enables rapid, early-stage inference in live-streamed contexts.

During inference, SAFE-QAQ demonstrates high efficiency (SAFE-LS: p50 = 916 ms latency, SAFE-Real: 8.98 s per segment, 81.4% faster than baseline) while preserving SOTA performance.

6. Empirical Performance and Ablation Results

Evaluated on the TeleAntiFraud-Bench, SAFE-QAQ achieves state-of-the-art classification metrics:

SAFE-LS: F1 (Scenario) = 84.64, F1 (Fraud) = 89.61, F1 (Type) = 88.23, AVG F1 = 87.49, composite final score = 65.76.
SAFE-Real: Real-time deployment, F1 (Scenario) = 91.40, F1 (Fraud) = 88.93, F1 (Type) = 77.56, AVG F1 = 85.96.

Ablation studies demonstrate performance and reasoning quality deteriorate when slow-thinking or depth rewards are omitted. Each RL stage adds measurable gains, especially in chain-of-thought conciseness and decision quality.

7. Deployment and Operational Impact

SAFE-QAQ, specifically its real-time optimized SAFE-Real variant, is deployed at scale within a major telecom operator, screening over 70,000 calls per day:

Automation: Complex fraud cases are flagged prior to human analysis.
Impact: ~50% reduction in manual audit workload; substantial decline in fraud-related financial losses.

This demonstrates the operational viability and scalability of slow-thinking, RL-driven multimodal detection in real-world, high-volume telephony fraud contexts (Wang et al., 4 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SAFE-QAQ: End-to-End Slow-Thinking Audio-Text Fraud Detection via Reinforcement Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SAFE-QAQ Framework.