FT-GRPO for Audio Deepfake Detection

Updated 13 January 2026

FT-GRPO is a two-stage training paradigm that integrates supervised fine-tuning with reinforcement learning under frequency-time constraints.
The method leverages high-quality demonstrations and LoRA adapter techniques to provide structured, interpretable chain-of-thought rationales.
It improves model accuracy and transparency by enforcing explicit frequency and time domain analyses in audio deepfake classifications.

Frequency Time-Group Relative Policy Optimization (FT-GRPO) is a structured two-stage training paradigm developed for interpretable all-type audio deepfake detection (ADD) using Audio LLMs (ALLMs). FT-GRPO addresses deficiencies of standard supervised and reinforcement fine-tuning approaches by training ALLMs to produce not only accurate real/fake classifications, but also frequency-time grounded, chain-of-thought (CoT) rationales. The method exploits high-quality demonstrations and enforces explicit rule-based frequency-time constraints during reinforcement learning, thereby mitigating common issues such as interpretability collapse and reward hacking in sparse supervision regimes (Xie et al., 6 Jan 2026).

1. Motivation and Limitations of Standard Methods

ALLMs applied to ADD face two primary methodological challenges:

Supervised Fine-Tuning (SFT) Limitation: Conventional SFT on binary real/fake labels reduces model behavior to a black-box classifier, precluding transparent or granular interpretability of model decisions.
Vanilla Reinforcement Fine-Tuning (RFT) Limitation: RFT under sparse reward structures—typically rewarding only superficial format compliance or final prediction—enables the model to generate ungrounded or hallucinated rationales simply to maximize reward, a phenomenon termed reward hacking.

These limitations undermine both the trustworthiness and actionable value of ALLM-based ADD systems, particularly for real-world deployment where interpretability and generalization across heterogeneous audio types (speech, singing, environmental sounds, music) are critical.

2. Two-Stage Training Framework

FT-GRPO is organized into two sequential stages:

Supervised Fine-Tuning (Cold Start): ALLMs are initialized with high-quality, frequency-time structured chain-of-thought (FT-CoT) rationales, where demonstration data include explicit analysis in both frequency and time domains for each audio clip. The model architecture comprises an audio encoder, aligner, and LLM, with parameter-efficient LoRA adapters injected at all stages (rank $r=64$ , $\alpha=16$ ). Training is performed on the “think” dataset subset, optimizing conditional log-likelihood over demonstration sequences.
Frequency Time-Group Relative Policy Optimization (FT-GRPO): After SFT, group relative policy optimization is conducted under frequency-time constraints. FT-GRPO optimizes a composite reward structure, explicitly encouraging (i) prediction accuracy, (ii) output format compliance (wrapping rationales/answers in specified tags), and (iii) complete, grounded FT reasoning by requiring inclusion of both frequency and time domain arguments in each rationale. The process includes group-based sampling and advantage estimation for stable reinforcement learning dynamics.

3. FT-GRPO Methodology

3.1 Markov Decision Process Formulation

FT-GRPO is framed as a single-step episodic Markov decision process (MDP):

State $s$ : Input audio $a$ and prompt prefix.
Action $a_t$ : Emission of structured output $o = (\tau, y_\mathrm{pred})$ , where $\tau$ is the FT-CoT rationale and $y_\mathrm{pred} \in \{\text{real, fake}\}$ .
Transition: The episode ends after output, with no further state transitions.
Policy $\pi_\theta$ : Parameterizes $P(o|a)$ .
Reference policy $\pi_\mathrm{ref}$ : Fixed snapshot post-SFT, enabling KL regularization.

3.2 Frequency-Time Constraints and Reward Composition

Two explicit meta-tags are defined: $\langle$ Frequency Domain $\rangle$ and $\langle$ Time Domain $\rangle$ . Each CoT rationale must include both tags, each followed by at least one grammatically complete sentence, enforced via a completeness check function $g(a_i)$ . The composite reward for output $o = (\tau, y_\mathrm{pred})$ on $(a, y)$ is:

Accuracy reward: $r_\mathrm{acc} = 1$ if $y_\mathrm{pred} = y$ , else $0$.
Format reward: $r_\mathrm{fmt} = \mathbb{I}(\mathcal{F}(o))$ , where $\mathcal{F}(o)$ checks structural tag compliance.
FT reasoning reward: $r_\mathrm{ft} = \tfrac{1}{2} \sum_{i\in\{\text{FD, TD}\}} \mathbb{I}(a_i \in \tau \land g(a_i) = \text{True})$ , yielding total reward values in $\{0, 0.5, 1.0\}$ .
Total reward: $r = r_\mathrm{acc} + \alpha \, r_\mathrm{fmt} + \beta \, r_\mathrm{ft}$ , with $\alpha = \beta = 0.1$ .

3.3 Group Relative Policy Optimization

For each audio instance, a group of $G$ outputs $\{o_i\}_{i=1}^G$ is sampled:

Group mean: $\bar r = \frac{1}{G} \sum_i r_i$
Group std: $\sigma_r = \mathrm{std}(\{r_i\}) + \epsilon$
Advantage per sample: $A_i = \frac{r_i - \bar r}{\sigma_r}$

The optimization objective per batch is:

$\mathcal{L}_\mathrm{FT-GRPO}(\theta) = -\mathbb{E}_{a}\biggl[\sum_{i=1}^G A_i \log \pi_\theta(o_i \mid a) \biggr] + \lambda\,\mathbb{E}_a\left[ D_\mathrm{KL}(\pi_\mathrm{ref}(\cdot \mid a) \| \pi_\theta(\cdot \mid a)) \right]$

where $\lambda$ is the KL penalty strength (set via warmup ratio $0.05$).

3.4 Algorithmic and Implementation Notes

LoRA adapters: rank $r=64$ , $\alpha=16$ , dropout $0.05$.
Learning rate: $1 \times 10^{-5}$ for both stages.
Batch size: $16$ (SFT), $32$ (GRPO).
Group size: $G=8$ , sampling temperature $0.9$.
Epochs: SFT ($2$–$3$), GRPO ($2$).
Framework: ms-swift + DeepSpeed ZeRO-2, bfloat16 precision.

4. Experimental Evaluation

FT-GRPO demonstrates consistent improvements over both SFT and prior state-of-the-art ADD techniques. For a speech-trained Qwen2.5-Omni-3B ALLM, the following accuracy (ACC) improvements were observed after FT-GRPO compared to SFT:

Audio Type	SFT ACC (%)	FT-GRPO ACC (%)	Δ
Speech	99.04	99.75	+0.71
Sound	73.31	75.05	+1.74
Singing	66.29	84.26	+17.97
Music	91.16	90.44	-0.72
Average	82.45	87.38	+4.93

Additionally, FT-GRPO surpasses alternative models, including W2V2-AASIST (avg. 63.5%), WPT-W2V2-AASIST (69.2%), and ALLM4ADD (85.8%). In all-type co-training, FT-GRPO yields an average ACC of 90.10% (+5.15 over SFT) (Xie et al., 6 Jan 2026).

5. Interpretability via Frequency-Time Grounded Chain-of-Thought

By combining rule-based FT constraints with explicit reward structure during GRPO, FT-GRPO ensures that every output rationale contains:

At least one frequency-domain analysis, explicitly tagged and evidenced (e.g., “The spectrogram shows elevated energy around 8–10 kHz with uniform harmonics—an indicator of neural vocoder aliasing”).
At least one time-domain analysis, explicitly tagged and evidenced (e.g., “There is a sudden amplitude jump at 1.2 s without natural fade-in/out, revealing a generated segment boundary”).

This FT-grounded CoT structure renders the decision process transparent. Practitioners can verify rationale validity alongside the real/fake output, substantially increasing trust and verifiability for critical MOVE domains.

6. Implications and Significance

FT-GRPO establishes a robust paradigm for interpretable, all-type ADD by integrating strong SFT initialization with structured, constraint-driven reinforcement learning. The methodology mitigates black-box behavior and reward hacking, and enforces output rationales traceable to observable acoustic phenomena in both frequency and time domains.

A plausible implication is that FT-GRPO's constraint-based rationale structuring may generalize to other domains requiring transparent multi-modal LLM predictions. This suggests broader adoption in interpretable AI for safety-critical applications.

7. Example FT-CoT Output

An indicative post-GRPO rationale for a fake audio instance:

〈think〉
⟨Frequency Domain⟩ The spectrogram shows elevated energy around 8–10 kHz with uniform harmonics—an indicator of neural vocoder aliasing.
⟨Time Domain⟩ There is a sudden amplitude jump at 1.2 s without natural fade-in/out, revealing a generated segment boundary.
〈/think〉
〈answer〉fake〈/answer〉

This output exemplifies the completeness and interpretability guaranteed by FT-GRPO for each ADD decision (Xie et al., 6 Jan 2026).

PDF Markdown Chat (Pro)

References (1)

Interpretable All-Type Audio Deepfake Detection with Audio LLMs via Frequency-Time Reinforcement Learning (2026)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Frequency Time-Group Relative Policy Optimization (FT-GRPO).