FT-GRPO for Audio Deepfake Detection
- FT-GRPO is a two-stage training paradigm that integrates supervised fine-tuning with reinforcement learning under frequency-time constraints.
- The method leverages high-quality demonstrations and LoRA adapter techniques to provide structured, interpretable chain-of-thought rationales.
- It improves model accuracy and transparency by enforcing explicit frequency and time domain analyses in audio deepfake classifications.
Frequency Time-Group Relative Policy Optimization (FT-GRPO) is a structured two-stage training paradigm developed for interpretable all-type audio deepfake detection (ADD) using Audio LLMs (ALLMs). FT-GRPO addresses deficiencies of standard supervised and reinforcement fine-tuning approaches by training ALLMs to produce not only accurate real/fake classifications, but also frequency-time grounded, chain-of-thought (CoT) rationales. The method exploits high-quality demonstrations and enforces explicit rule-based frequency-time constraints during reinforcement learning, thereby mitigating common issues such as interpretability collapse and reward hacking in sparse supervision regimes (Xie et al., 6 Jan 2026).
1. Motivation and Limitations of Standard Methods
ALLMs applied to ADD face two primary methodological challenges:
- Supervised Fine-Tuning (SFT) Limitation: Conventional SFT on binary real/fake labels reduces model behavior to a black-box classifier, precluding transparent or granular interpretability of model decisions.
- Vanilla Reinforcement Fine-Tuning (RFT) Limitation: RFT under sparse reward structures—typically rewarding only superficial format compliance or final prediction—enables the model to generate ungrounded or hallucinated rationales simply to maximize reward, a phenomenon termed reward hacking.
These limitations undermine both the trustworthiness and actionable value of ALLM-based ADD systems, particularly for real-world deployment where interpretability and generalization across heterogeneous audio types (speech, singing, environmental sounds, music) are critical.
2. Two-Stage Training Framework
FT-GRPO is organized into two sequential stages:
- Supervised Fine-Tuning (Cold Start): ALLMs are initialized with high-quality, frequency-time structured chain-of-thought (FT-CoT) rationales, where demonstration data include explicit analysis in both frequency and time domains for each audio clip. The model architecture comprises an audio encoder, aligner, and LLM, with parameter-efficient LoRA adapters injected at all stages (rank , ). Training is performed on the “think” dataset subset, optimizing conditional log-likelihood over demonstration sequences.
- Frequency Time-Group Relative Policy Optimization (FT-GRPO): After SFT, group relative policy optimization is conducted under frequency-time constraints. FT-GRPO optimizes a composite reward structure, explicitly encouraging (i) prediction accuracy, (ii) output format compliance (wrapping rationales/answers in specified tags), and (iii) complete, grounded FT reasoning by requiring inclusion of both frequency and time domain arguments in each rationale. The process includes group-based sampling and advantage estimation for stable reinforcement learning dynamics.
3. FT-GRPO Methodology
3.1 Markov Decision Process Formulation
FT-GRPO is framed as a single-step episodic Markov decision process (MDP):
- State : Input audio and prompt prefix.
- Action : Emission of structured output , where is the FT-CoT rationale and .
- Transition: The episode ends after output, with no further state transitions.
- Policy : Parameterizes .
- Reference policy : Fixed snapshot post-SFT, enabling KL regularization.
3.2 Frequency-Time Constraints and Reward Composition
Two explicit meta-tags are defined: Frequency Domain and Time Domain. Each CoT rationale must include both tags, each followed by at least one grammatically complete sentence, enforced via a completeness check function . The composite reward for output on is:
- Accuracy reward: if , else $0$.
- Format reward: , where checks structural tag compliance.
- FT reasoning reward: , yielding total reward values in .
- Total reward: , with .
3.3 Group Relative Policy Optimization
For each audio instance, a group of outputs is sampled:
- Group mean:
- Group std:
- Advantage per sample:
The optimization objective per batch is:
where is the KL penalty strength (set via warmup ratio $0.05$).
3.4 Algorithmic and Implementation Notes
- LoRA adapters: rank , , dropout $0.05$.
- Learning rate: for both stages.
- Batch size: $16$ (SFT), $32$ (GRPO).
- Group size: , sampling temperature $0.9$.
- Epochs: SFT ($2$–$3$), GRPO ($2$).
- Framework: ms-swift + DeepSpeed ZeRO-2, bfloat16 precision.
4. Experimental Evaluation
FT-GRPO demonstrates consistent improvements over both SFT and prior state-of-the-art ADD techniques. For a speech-trained Qwen2.5-Omni-3B ALLM, the following accuracy (ACC) improvements were observed after FT-GRPO compared to SFT:
| Audio Type | SFT ACC (%) | FT-GRPO ACC (%) | Δ |
|---|---|---|---|
| Speech | 99.04 | 99.75 | +0.71 |
| Sound | 73.31 | 75.05 | +1.74 |
| Singing | 66.29 | 84.26 | +17.97 |
| Music | 91.16 | 90.44 | -0.72 |
| Average | 82.45 | 87.38 | +4.93 |
Additionally, FT-GRPO surpasses alternative models, including W2V2-AASIST (avg. 63.5%), WPT-W2V2-AASIST (69.2%), and ALLM4ADD (85.8%). In all-type co-training, FT-GRPO yields an average ACC of 90.10% (+5.15 over SFT) (Xie et al., 6 Jan 2026).
5. Interpretability via Frequency-Time Grounded Chain-of-Thought
By combining rule-based FT constraints with explicit reward structure during GRPO, FT-GRPO ensures that every output rationale contains:
- At least one frequency-domain analysis, explicitly tagged and evidenced (e.g., “The spectrogram shows elevated energy around 8–10 kHz with uniform harmonics—an indicator of neural vocoder aliasing”).
- At least one time-domain analysis, explicitly tagged and evidenced (e.g., “There is a sudden amplitude jump at 1.2 s without natural fade-in/out, revealing a generated segment boundary”).
This FT-grounded CoT structure renders the decision process transparent. Practitioners can verify rationale validity alongside the real/fake output, substantially increasing trust and verifiability for critical MOVE domains.
6. Implications and Significance
FT-GRPO establishes a robust paradigm for interpretable, all-type ADD by integrating strong SFT initialization with structured, constraint-driven reinforcement learning. The methodology mitigates black-box behavior and reward hacking, and enforces output rationales traceable to observable acoustic phenomena in both frequency and time domains.
A plausible implication is that FT-GRPO's constraint-based rationale structuring may generalize to other domains requiring transparent multi-modal LLM predictions. This suggests broader adoption in interpretable AI for safety-critical applications.
7. Example FT-CoT Output
An indicative post-GRPO rationale for a fake audio instance:
1 2 3 4 5 |
〈think〉 ⟨Frequency Domain⟩ The spectrogram shows elevated energy around 8–10 kHz with uniform harmonics—an indicator of neural vocoder aliasing. ⟨Time Domain⟩ There is a sudden amplitude jump at 1.2 s without natural fade-in/out, revealing a generated segment boundary. 〈/think〉 〈answer〉fake〈/answer〉 |
This output exemplifies the completeness and interpretability guaranteed by FT-GRPO for each ADD decision (Xie et al., 6 Jan 2026).