Papers
Topics
Authors
Recent
2000 character limit reached

FT-GRPO for Audio Deepfake Detection

Updated 13 January 2026
  • FT-GRPO is a two-stage training paradigm that integrates supervised fine-tuning with reinforcement learning under frequency-time constraints.
  • The method leverages high-quality demonstrations and LoRA adapter techniques to provide structured, interpretable chain-of-thought rationales.
  • It improves model accuracy and transparency by enforcing explicit frequency and time domain analyses in audio deepfake classifications.

Frequency Time-Group Relative Policy Optimization (FT-GRPO) is a structured two-stage training paradigm developed for interpretable all-type audio deepfake detection (ADD) using Audio LLMs (ALLMs). FT-GRPO addresses deficiencies of standard supervised and reinforcement fine-tuning approaches by training ALLMs to produce not only accurate real/fake classifications, but also frequency-time grounded, chain-of-thought (CoT) rationales. The method exploits high-quality demonstrations and enforces explicit rule-based frequency-time constraints during reinforcement learning, thereby mitigating common issues such as interpretability collapse and reward hacking in sparse supervision regimes (Xie et al., 6 Jan 2026).

1. Motivation and Limitations of Standard Methods

ALLMs applied to ADD face two primary methodological challenges:

  • Supervised Fine-Tuning (SFT) Limitation: Conventional SFT on binary real/fake labels reduces model behavior to a black-box classifier, precluding transparent or granular interpretability of model decisions.
  • Vanilla Reinforcement Fine-Tuning (RFT) Limitation: RFT under sparse reward structures—typically rewarding only superficial format compliance or final prediction—enables the model to generate ungrounded or hallucinated rationales simply to maximize reward, a phenomenon termed reward hacking.

These limitations undermine both the trustworthiness and actionable value of ALLM-based ADD systems, particularly for real-world deployment where interpretability and generalization across heterogeneous audio types (speech, singing, environmental sounds, music) are critical.

2. Two-Stage Training Framework

FT-GRPO is organized into two sequential stages:

  1. Supervised Fine-Tuning (Cold Start): ALLMs are initialized with high-quality, frequency-time structured chain-of-thought (FT-CoT) rationales, where demonstration data include explicit analysis in both frequency and time domains for each audio clip. The model architecture comprises an audio encoder, aligner, and LLM, with parameter-efficient LoRA adapters injected at all stages (rank r=64r=64, α=16\alpha=16). Training is performed on the “think” dataset subset, optimizing conditional log-likelihood over demonstration sequences.
  2. Frequency Time-Group Relative Policy Optimization (FT-GRPO): After SFT, group relative policy optimization is conducted under frequency-time constraints. FT-GRPO optimizes a composite reward structure, explicitly encouraging (i) prediction accuracy, (ii) output format compliance (wrapping rationales/answers in specified tags), and (iii) complete, grounded FT reasoning by requiring inclusion of both frequency and time domain arguments in each rationale. The process includes group-based sampling and advantage estimation for stable reinforcement learning dynamics.

3. FT-GRPO Methodology

3.1 Markov Decision Process Formulation

FT-GRPO is framed as a single-step episodic Markov decision process (MDP):

  • State ss: Input audio aa and prompt prefix.
  • Action ata_t: Emission of structured output o=(τ,ypred)o = (\tau, y_\mathrm{pred}), where τ\tau is the FT-CoT rationale and ypred{real, fake}y_\mathrm{pred} \in \{\text{real, fake}\}.
  • Transition: The episode ends after output, with no further state transitions.
  • Policy πθ\pi_\theta: Parameterizes P(oa)P(o|a).
  • Reference policy πref\pi_\mathrm{ref}: Fixed snapshot post-SFT, enabling KL regularization.

3.2 Frequency-Time Constraints and Reward Composition

Two explicit meta-tags are defined: \langleFrequency Domain\rangle and \langleTime Domain\rangle. Each CoT rationale must include both tags, each followed by at least one grammatically complete sentence, enforced via a completeness check function g(ai)g(a_i). The composite reward for output o=(τ,ypred)o = (\tau, y_\mathrm{pred}) on (a,y)(a, y) is:

  • Accuracy reward: racc=1r_\mathrm{acc} = 1 if ypred=yy_\mathrm{pred} = y, else $0$.
  • Format reward: rfmt=I(F(o))r_\mathrm{fmt} = \mathbb{I}(\mathcal{F}(o)), where F(o)\mathcal{F}(o) checks structural tag compliance.
  • FT reasoning reward: rft=12i{FD, TD}I(aiτg(ai)=True)r_\mathrm{ft} = \tfrac{1}{2} \sum_{i\in\{\text{FD, TD}\}} \mathbb{I}(a_i \in \tau \land g(a_i) = \text{True}), yielding total reward values in {0,0.5,1.0}\{0, 0.5, 1.0\}.
  • Total reward: r=racc+αrfmt+βrftr = r_\mathrm{acc} + \alpha \, r_\mathrm{fmt} + \beta \, r_\mathrm{ft}, with α=β=0.1\alpha = \beta = 0.1.

3.3 Group Relative Policy Optimization

For each audio instance, a group of GG outputs {oi}i=1G\{o_i\}_{i=1}^G is sampled:

  • Group mean: rˉ=1Giri\bar r = \frac{1}{G} \sum_i r_i
  • Group std: σr=std({ri})+ϵ\sigma_r = \mathrm{std}(\{r_i\}) + \epsilon
  • Advantage per sample: Ai=rirˉσrA_i = \frac{r_i - \bar r}{\sigma_r}

The optimization objective per batch is:

LFTGRPO(θ)=Ea[i=1GAilogπθ(oia)]+λEa[DKL(πref(a)πθ(a))]\mathcal{L}_\mathrm{FT-GRPO}(\theta) = -\mathbb{E}_{a}\biggl[\sum_{i=1}^G A_i \log \pi_\theta(o_i \mid a) \biggr] + \lambda\,\mathbb{E}_a\left[ D_\mathrm{KL}(\pi_\mathrm{ref}(\cdot \mid a) \| \pi_\theta(\cdot \mid a)) \right]

where λ\lambda is the KL penalty strength (set via warmup ratio $0.05$).

3.4 Algorithmic and Implementation Notes

  • LoRA adapters: rank r=64r=64, α=16\alpha=16, dropout $0.05$.
  • Learning rate: 1×1051 \times 10^{-5} for both stages.
  • Batch size: $16$ (SFT), $32$ (GRPO).
  • Group size: G=8G=8, sampling temperature $0.9$.
  • Epochs: SFT ($2$–$3$), GRPO ($2$).
  • Framework: ms-swift + DeepSpeed ZeRO-2, bfloat16 precision.

4. Experimental Evaluation

FT-GRPO demonstrates consistent improvements over both SFT and prior state-of-the-art ADD techniques. For a speech-trained Qwen2.5-Omni-3B ALLM, the following accuracy (ACC) improvements were observed after FT-GRPO compared to SFT:

Audio Type SFT ACC (%) FT-GRPO ACC (%) Δ
Speech 99.04 99.75 +0.71
Sound 73.31 75.05 +1.74
Singing 66.29 84.26 +17.97
Music 91.16 90.44 -0.72
Average 82.45 87.38 +4.93

Additionally, FT-GRPO surpasses alternative models, including W2V2-AASIST (avg. 63.5%), WPT-W2V2-AASIST (69.2%), and ALLM4ADD (85.8%). In all-type co-training, FT-GRPO yields an average ACC of 90.10% (+5.15 over SFT) (Xie et al., 6 Jan 2026).

5. Interpretability via Frequency-Time Grounded Chain-of-Thought

By combining rule-based FT constraints with explicit reward structure during GRPO, FT-GRPO ensures that every output rationale contains:

  • At least one frequency-domain analysis, explicitly tagged and evidenced (e.g., “The spectrogram shows elevated energy around 8–10 kHz with uniform harmonics—an indicator of neural vocoder aliasing”).
  • At least one time-domain analysis, explicitly tagged and evidenced (e.g., “There is a sudden amplitude jump at 1.2 s without natural fade-in/out, revealing a generated segment boundary”).

This FT-grounded CoT structure renders the decision process transparent. Practitioners can verify rationale validity alongside the real/fake output, substantially increasing trust and verifiability for critical MOVE domains.

6. Implications and Significance

FT-GRPO establishes a robust paradigm for interpretable, all-type ADD by integrating strong SFT initialization with structured, constraint-driven reinforcement learning. The methodology mitigates black-box behavior and reward hacking, and enforces output rationales traceable to observable acoustic phenomena in both frequency and time domains.

A plausible implication is that FT-GRPO's constraint-based rationale structuring may generalize to other domains requiring transparent multi-modal LLM predictions. This suggests broader adoption in interpretable AI for safety-critical applications.

7. Example FT-CoT Output

An indicative post-GRPO rationale for a fake audio instance:

1
2
3
4
5
〈think〉
⟨Frequency Domain⟩ The spectrogram shows elevated energy around 8–10 kHz with uniform harmonics—an indicator of neural vocoder aliasing.
⟨Time Domain⟩ There is a sudden amplitude jump at 1.2 s without natural fade-in/out, revealing a generated segment boundary.
〈/think〉
〈answer〉fake〈/answer〉

This output exemplifies the completeness and interpretability guaranteed by FT-GRPO for each ADD decision (Xie et al., 6 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Frequency Time-Group Relative Policy Optimization (FT-GRPO).