Constant-Q Cepstral Coefficients (CQCC)
- CQCC is a feature representation that employs a constant-Q transform and logarithmic amplitude scaling to provide precise logarithmic frequency resolution for detecting spoofing artefacts.
- CQCC computation involves converting the CQT spectrogram to cepstral coefficients via a DCT, with optional delta and delta-delta derivatives to capture dynamic speech patterns.
- Empirical studies show that while CQCCs excel at revealing high-frequency artefacts in spoofing attacks, they may underperform in low-frequency regions, implying benefits from feature fusion.
Constant-Q Cepstral Coefficients (CQCCs) are a feature representation designed for anti-spoofing in automatic speaker verification (ASV) systems. CQCCs leverage the Constant-Q Transform (CQT), which differs from the Short-Time Fourier Transform (STFT) by employing a frequency-dependent filterbank with a constant quality factor (Q). This provides logarithmic frequency resolution, making CQCCs particularly effective for exposing artefacts introduced by various forms of speech synthesis, voice conversion, and replay attacks. The transformation from the CQT spectrogram to cepstral coefficients is performed via logarithmic amplitude scaling and a Discrete Cosine Transform (DCT), optionally followed by the computation of delta (Δ) and delta-delta (ΔΔ) coefficients to capture speech dynamics.
1. Mathematical Foundation and Computation
The Constant-Q Transform (CQT) provides the foundation for CQCCs. For a discretely sampled input signal , CQT is defined for frequency bin and time center as:
where is the -th window length, and is the atom consisting of a window and center frequency . The constant-Q property imposes for all , with as the filter bandwidth. Frequencies are geometrically spaced as for minimum frequency and bins per octave .
CQCC computation proceeds as follows:
- Compute the CQT magnitude-squared: .
- Interpolate/Resample the CQT to a linearly spaced frequency grid .
- Compute log-amplitude: .
- Apply a type-II DCT along frequency:
with typically 30 for static CQCCs, optionally augmented with first and second-order derivatives (Tak et al., 2020, Adiban et al., 2019).
2. Comparison with Other Cepstral Features
Conventional cepstral features such as MFCCs and LFCCs are derived from STFT-based filterbanks with linear (LFCC) or Mel (MFCC) spacing. Their filters have constant or slowly varying bandwidths, meaning the Q factor increases with frequency. In contrast, CQCCs employ constant-Q filters with bandwidths proportional to their center frequency. This scheme provides high spectral resolution at low frequencies and high temporal resolution at high frequencies. Linear resampling of the CQT spectrum prior to the DCT emphasizes high-frequency information, as the DCT basis vectors sample these regions densely. MFCCs and LFCCs lack intrinsic spectral emphasis, aside from the shape of their filterbanks (Tak et al., 2020).
3. Empirical Performance and Evaluation
Extensive evaluation on ASVspoof 2015 and 2019 tasks reveals the following:
- On ASVspoof 2015, CQCCs delivered a substantial reduction in Equal Error Rate (EER) compared to LFCCs for certain attacks, e.g., unit selection TTS (CQCC: 1.06%, LFCC: 8.19%) (Tak et al., 2020).
- On ASVspoof 2019, performance varied by attack and front-end:
- Neural TTS attacks (A07): CQCC linear EER = 0.00%, LFCC = 12.86%
- Unit-selection (A16): CQCC = 15.16%, LFCC = 18.97%
- Spectral filter attacks (A19): Both front-ends have similar EERs (~4.7%-4.9%)
- For attacks with artefacts in low-frequency bands (A13, A14, A17), LFCCs outperformed CQCC linear (Tak et al., 2020).
This suggests that CQCCs are highly effective for attacks characterized by high-frequency artefacts but perform suboptimally when artefacts are present in lower frequencies.
4. Explainability via Sub-band and Failure Analysis
Explainability studies assessed sub-band contributions by band-limiting evaluation utterances and visualizing EER as a function of sub-band choice:
- For attacks where "CQCC linear" excels (A07, A16, A19), EER is near zero when evaluating only the top 400 Hz sub-band, indicating that artefacts reside in the highest frequencies.
- For VC/RNN/VAE attacks (A13, A14, A17), artefacts are found in very low-frequency sub-bands, detected more effectively by LFCC due to its uniform frequency emphasis and diluted in CQCC due to resampling.
- No single feature set universally captures all artefacts; fusion of multiple representations is empirically beneficial.
High-frequency artefacts commonly arise from band-limiting or concatenation in synthesis, while low-frequency artefacts are linked to formant or spectral envelope mismatches (Tak et al., 2020).
5. Practical Applications and Enhancements
CQCCs have been applied as input features for advanced anti-spoofing architectures, including GMM classifiers and neural models. In replay spoofing countermeasures, CQCCs have been used in conjunction with autoencoders to learn compact representations. For example, passing 90-dimensional CQCCs through an autoencoder with a 70-unit bottleneck and downstream Siamese CNN classifier achieved an EER of 0.62% (baseline: 11.04%) on the ASVspoof 2019 evaluation set (Adiban et al., 2019). CQCCs’ frequency resolution properties make them robust for replay attacks that distort low-frequency content. Variation in CQCC vector length (30, 60, 90, 120) and inclusion of Δ, ΔΔ coefficients are commonly tuned for downstream systems.
6. Limitations, Recommendations, and Future Directions
CQCCs with linear or geometric resampling do not universally excel across all attack types; artefacts may be concentrated in frequency regions not well-emphasized by a given configuration. Fusion of front-ends (linear and geometric CQCCs, MFCCs, LFCCs) is recommended for broader spectral coverage (Tak et al., 2020). Sub-band attention or adaptive resampling strategies could further enhance detection by allowing models to focus on informative regions. Integration with phase/group delay features is proposed to target artefacts undetectable in the magnitude domain alone. Further detailed signal-level analysis of spoofing artefacts can drive the development of improved countermeasures. A plausible implication is that learned filterbanks or adaptive attention mechanisms are essential directions for future research.
7. Summary Table: Comparison of CQCC, MFCC, and LFCC
| Feature | Filter Spacing | Bandwidth | Emphasis |
|---|---|---|---|
| CQCC | Logarithmic | Proportional to (constant-Q) | High at low-f, temporal at high-f |
| MFCC | Mel | Constant/Slow-vary | Uniform (Mel-weighted) |
| LFCC | Linear | Constant/Slow-vary | Uniform |
In summary, CQCCs are a distinct front-end for anti-spoofing, uniquely suited for capturing frequency-localized artefacts through the constant-Q paradigm. Their practical effectiveness depends critically on frequency localization of spoofing artefacts and can be enhanced via system fusion or adaptive feature extraction strategies (Tak et al., 2020, Adiban et al., 2019).