Mel Patchify Block in Audio Tokenization

Updated 19 December 2025

Mel Patchify Block is a module that converts Mel-spectrograms into overlap-aware patch embeddings using convolutional patchification for deep learning.
It uses a five-layer convolutional stem with overlapping receptive fields to preserve continuity and capture fine-grained spectral features in audio.
The block integrates learnable positional encodings to form a compact token sequence, enabling efficient Transformer and GNN processing in underwater acoustics.

The Mel Patchify Block is a specialized architectural component designed to convert Mel-spectrogram representations of audio into sequences of compact, overlap-aware patch embeddings suitable for downstream deep learning models employing Transformers and Graph Neural Networks. This module is fundamental to the UATR-GTransformer framework for underwater acoustic target recognition, where it enables efficient and inductively-biased tokenization of complex non-stationary and nonlinear underwater acoustic signals (Feng et al., 12 Dec 2025).

1. Signal Preprocessing and Mel-Spectrogram Construction

The Mel Patchify Block operates on preprocessed raw audio sampled at 16 kHz. The initial step applies a pre-emphasis filter $y[n] = x[n] - \alpha x[n-1]$ with $\alpha=0.97$ to accentuate high-frequency content. The pre-emphasized waveform $y[n]$ is partitioned into overlapping frames of length $L=400$ samples (25 ms) with hop size $R=160$ samples (10 ms), employing a Hann window $w[n] = 0.5 - 0.5\cos(2\pi n/(L-1))$ , $n=0,\dots,L-1$ .

The framewise Short-Time Fourier Transform (STFT) is computed:

$X(m,k) = \sum_{n=0}^{L-1} y[mR + n]w[n]e^{-j2\pi nk/N_{\mathrm{FFT}}}$

for frequency bins $k=0,\dots,N_{\mathrm{FFT}}-1$ , followed by projection through a 128-band triangular Mel filterbank. The $m$ -th Mel filter is defined for frequency bin $k$ as:

$F_m(k) = \begin{cases} 0, & k < f[m-1] \ \frac{k - f[m-1]}{f[m] - f[m-1]}, & f[m-1] \le k < f[m] \ \frac{f[m+1] - k}{f[m+1] - f[m]}, & f[m] \le k < f[m+1] \ 0, & k \ge f[m+1] \end{cases}$

The resulting log-Mel spectrogram is

$M(m, m_{\mathrm{Mel}}) = \log\left( \sum_{k=0}^{N_{\mathrm{FFT}}-1} F_{m_{\mathrm{Mel}}}(k) |X(m,k)| \right),$

with $m_{\mathrm{Mel}} = 1, \dots, 128$ . For a 5 s waveform, the spectrogram $M$ is zero-padded or truncated to $T = 512$ frames, yielding a matrix $M \in \mathbb{R}^{512 \times 128}$ .

2. Convolutional Patchification and Local Feature Encoding

To create patch embeddings, the Mel Patchify Block uses a five-layer convolutional “stem” rather than non-overlapping slicing. Starting from $\mathbf{C}_0 = M \in \mathbb{R}^{1 \times 512 \times 128}$ , the process applies five consecutive $3 \times 3$ convolutions:

$\mathbf{C}_i = \mathrm{ReLU}\Big(\mathrm{BN}\big(\mathrm{Conv}_{3\times3}(\mathbf{C}_{i-1};C_i,s_i,p=1)\big)\Big),$

where $C_i$ is the number of output channels, $s_i$ is the stride, and $p=1$ denotes padding. Parameter settings are:

$(C_1,C_2,C_3,C_4,C_5)=(12,24,48,96,96), \quad (s_1,s_2,s_3,s_4,s_5) = (2,2,2,2,1).$

Each convolutional layer with stride 2 halves the corresponding dimension, resulting in an output tensor $\mathbf{C}_5 \in \mathbb{R}^{96 \times 32 \times 8}$ . Each spatial position $(u,v)$ in this output indexes a $16 \times 16$ patch covering $512/32 = 16$ time frames and $128/8 = 16$ Mel bands. The use of kernel size 3 with stride 2 leads to overlapping receptive fields: adjacent patches share one row or column of the input at each downsampling stage, ensuring continuity across patch borders.

3. Positional Encoding and Patch Embedding Sequence Construction

To encode spatial location, each patch vector $\mathbf{x}_i \in \mathbb{R}^{96}$ at position $(u,v)$ is augmented with an independent learnable 2D positional embedding $PE_i \in \mathbb{R}^{96}$ :

$\mathbf{x}_i \leftarrow \mathbf{x}_i + PE_i, \qquad i = 1, \dots, 256,$

where $256 = 32 \times 8$ is the number of patches. The final output is formed by flattening the spatial grid into a patch sequence:

$\mathbf{X}_0 = \left[ \mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_{256} \right]^{\mathsf{T}} \in \mathbb{R}^{256 \times 96},$

which directly feeds the following Transformer Encoder in the GTransformer block (Feng et al., 12 Dec 2025).

4. Motivations and Design Rationale

The Mel Patchify Block’s architecture addresses several critical requirements for robust time–frequency representation learning:

Local–global context fusion: The convolutional stem fuses local context within each $16 \times 16$ region, enabling each patch embedding to encode “fine-grained” spectral texture.
Continuity via overlapping patches: Overlapping receptive fields (from kernel size 3, stride 2 convolutions) mitigate block artifacts and preserve coherence for frequency components crossing patch boundaries—essential for spectro-temporal phenomena such as narrowband lines and chirps.
Efficient sequence length: Progressive strided convolution rapidly reduces the $512 \times 128$ input to a tractable $32 \times 8$ grid, which balances feature granularity against computational considerations for subsequent multi-head self-attention.
Explicit spatial information: Learnable 2D positional encodings provide absolute localization, supporting the Transformer's global attention in both time and frequency axes.

5. Integration in the UATR-GTransformer Pipeline

Within the UATR-GTransformer, the Mel Patchify Block serves as the front-end, transforming the Mel spectrogram into a sequence of overlap-aware embeddings. These embeddings become input tokens for the GTransformer block, where a Transformer Encoder models patchwise mutual information to generate Mel-graph embeddings. Subsequently, a Graph Neural Network (GNN) captures local patch neighborhood structure, followed by a feed-forward network for feature transformation. This layered approach enables the model to combine convolutional inductive biases with global self-attention and graph-based relational learning, tailored to the unique non-Euclidean, non-stationary characteristics of underwater acoustic data (Feng et al., 12 Dec 2025).

6. Advantages in Feature Extraction and Recognition Tasks

The Mel Patchify Block enables several advantages for challenging acoustic tasks:

Improved spectral feature continuity: Overlapping, learned patchification preserves the integrity of spectro-temporal features, reducing the risk of fragmentation across patch boundaries.
Reduced computational burden: Downsampling compresses the long time–frequency input into a compact token grid, maintaining adequate context while ensuring tractability for self-attention mechanisms.
Enhanced inductive bias: Convolutional filters inject prior knowledge about local time–frequency structure, which stabilizes training and improves generalization, especially for non-stationary or non-Gaussian signal classes.

A plausible implication is that analogous patchification strategies may benefit related domains where spectral continuity and local context are vital, such as general-purpose audio event detection or biomedical time–frequency signal analysis.

PDF Markdown Chat (Pro)

References (1)

Graph Embedding with Mel-spectrograms for Underwater Acoustic Target Recognition (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Mel Patchify Block.