Siamese Autoencoder Framework

Updated 17 November 2025

Siamese Autoencoder Framework is a deep learning approach that combines undercomplete autoencoder embedding with Siamese CNN-based pair discrimination for classification tasks.
It leverages reconstruction losses on CQCC-extracted features and shared-weight CNNs to differentiate between bona fide and replayed signals with high accuracy.
The architecture trains the autoencoder and Siamese network separately, using majority vote aggregation to achieve significant improvements in verification performance.

A Siamese Autoencoder Framework encompasses architectures that integrate the representation learning strengths of autoencoders with the pairwise or triplet discrimination capabilities of Siamese networks. By sharing weights across multiple input branches, encoding samples into a common latent space, and using reconstruction losses regularized by distances or margin-based criteria, these frameworks are highly effective for pairing-based classification, verification, and retrieval tasks, as exemplified in replay spoofing detection in speaker verification systems (Adiban et al., 2019).

1. Architectural Principles and Workflow

The Siamese Autoencoder described in (Adiban et al., 2019) is organized as a two-stage pipeline optimized for discriminating between bona fide and replayed audio signals:

Feature Extraction: Raw speech is converted into Constant-Q Cepstral Coefficient (CQCC) framewise feature vectors. The CQCC pipeline involves windowing, the Constant-Q Transform, log-power spectrum analysis, Discrete Cosine Transform, and aggregation into a $D$ -dimensional feature vector for each frame.
Autoencoder Embedding: An undercomplete autoencoder, with a $D \rightarrow Y \rightarrow D$ topology and $Y<D$ , is trained to reconstruct the CQCC feature, ensuring that its bottleneck embedding captures both spectral and replay-induced noise characteristics.
Siamese Network Classification: Two identical convolutional neural networks (CNNs), each with three convolutional layers followed by max-feature-map (MFM) activations and max pooling, followed by two fully connected layers and a highway connection, take as input: (a) a fixed bona fide reference code and (b) the embedding of a test utterance. The pair is classified using a softmax output.
Aggregation and Decision: The output is aggregated by majority vote over 100 bona fide reference embeddings, producing the final bona fide vs. spoof classification.

A schematic of the Siamese Autoencoder pipeline:

Speech waveform
    ↓         [CQCC extraction]
CQCC features (D-dim per frame)
    ↓         [Autoencoder: D→Y→D]
Autoencoder bottleneck embedding (Y-dim)
    ↓         [Siamese CNN, two legs: reference & test]
Pairwise softmax similarity
    ↓         [Majority vote over reference set]
Replay/Bonafide Decision

2. Mathematical Formulation

Autoencoder

Encoder: $f_{\mathrm{enc}}(x) = \sigma(W_1 x + b_1)$ , $x\in\mathbb{R}^D$ , output $h_{AE}\in\mathbb{R}^Y$ .
Decoder: $\hat{x} = f_{\mathrm{dec}}(h_{AE}) = W_2 h_{AE} + b_2$ (linear output).
Loss: $L_{\mathrm{rec}} = \frac{1}{m} \sum_{i=1}^m \|x_i - \hat{x}_i\|_2^2$ .
Regularized: $L_{AE} = L_{\mathrm{rec}} + \frac{\lambda}{2} \sum_{l=1}^{L-1} \|W^{(l)}\|_F^2$ , with typical $\lambda \approx 10^{-4}$ .

Siamese CNN

Each "leg" of the Siamese network processes the $Y$ -dim embedding as a small "image" (reshaped if needed). For a pair $(h_g, h_t)$ (genuine, test):

Three convolutional layers with filters $\{160, 200, 100\}$ , each with MFM and $2 \times 2$ pooling.
Two fully connected layers (typically $300$ units).
The combined outputs are concatenated (or differenced) and passed to a final classifier:

$p = \operatorname{sigmoid}(w^\top [h_1; h_2] + b)$

Pairwise binary cross-entropy loss:

$L_{CE} = - [y \log p + (1-y) \log (1-p)]$

where $y=1$ for both-bona fide and $y=0$ for a spoofed pair.

Note: While a classical Siamese network often employs a margin-based contrastive loss, (Adiban et al., 2019) uses a cross-entropy loss for pair classification. The contrastive loss variant would be:

$L_{\mathrm{contrastive}} = y\,d(h_1, h_2)^2 + (1-y)\,\max(0, m-d(h_1, h_2))^2$

3. Training and Optimization Strategy

Autoencoder is trained independently using mean squared error (MSE) reconstruction loss, weight decay, and typical hyperparameters: batch size $200$, learning rate $10^{-3}$ , sigmoid hidden activations, linear output.
Siamese CNN is trained after the AE converges. Batches are constructed with balanced bona fide/spoof pairs ($200$ per batch), one leg fixed to a bona fide reference, the other to a random sample.
Dropout is applied with $p\approx0.5$ .
Training is continued for $\sim 20$ –$30$ epochs using SGD or Adam.
Modules are trained separately and not fine-tuned jointly, consistent with the paper’s protocol.

4. Empirical Results, Ablations, and Performance

Evaluation leverages the ASVspoof 2019 Physical Access benchmark:

Model	Eval EER (%)	t-DCF
Baseline (CQCC+GMM)	11.04	0.2454
Siamese-AE (Config 3, D=90, Y=70)	0.62	0.0110

Relative improvements: $\Delta$ EER $\approx$ 10.42 percentage points, $\Delta$ t-DCF $\approx$ 0.2344.
Ablation findings: Best CQCC dimension at $D=90$ (balances frequency resolution and redundancy); optimal AE bottleneck at $Y=70$ (sufficient replay-noise retention with minimal speech loss); best Siamese “Config 3” (three convolutional layers, 300-unit FC, max-pooling, MFM nonlinearity).
Robustness: With only $60\%$ of the training data, the system matches the baseline, indicating robustness to data scarcity.

5. Design Insights and Implementation Recommendations

Feature learning mechanics: The autoencoder’s bottleneck enforces the retention of both speech spectral cues and convolutional/noise artifacts from replay, which are essential for the Siamese classifier. The separation of training ensures that the AE “preconditions” the embedding space for discriminability, and the Siamese net accesses clean, informative features.
Hyperparameter recommendations: $D \approx 90$ , $Y \approx 70$ for AE, three convolutional layers with MFM+/max-pooling and $300$ hidden units for the Siamese net, batch size $200$.
Aggregation: Majority vote over multiple bona fide references in test-time scoring increases robustness to embedding outliers.

Implementation should adhere to the architectural separation (AE training prior to Siamese network training) and leverage majority voting for final decision aggregation.

6. Broader Context and Significance

Siamese Autoencoder frameworks blend unsupervised feature extraction (autoencoding) with supervised pairwise discrimination (Siamese/triplet loss), enabling:

Enhanced countermeasure development in adversarial/attack scenarios (e.g., replay/impersonation attacks in speech).
Encoding of subtle nuisance factors (e.g., replay noise) directly into learnable embedding spaces.
Robustness to limited labeled data via pretraining. In the studied application, this methodology produces significant advances in replay attack detection accuracy over established GMM and hand-engineered feature-based systems. The approach is extensible to other domains requiring fine-grained distinction between superficially similar signals, provided an analogous sequence of: preprocessing → autoencoder (feature bottleneck) → Siamese pairwise classification is feasible.

A plausible implication is that further advances may be obtained by end-to-end training, exploration of other contrastive loss formulations, or incorporation of temporal context directly within the AE/Siamese branches, depending on the specifics of the application domain.

PDF Markdown Chat (Pro)

References (1)

Replay Spoofing Countermeasure Using Autoencoder and Siamese Network on ASVspoof 2019 Challenge (2019)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Siamese Autoencoder Framework.