Mel Patchify Block in Audio Tokenization
- Mel Patchify Block is a module that converts Mel-spectrograms into overlap-aware patch embeddings using convolutional patchification for deep learning.
- It uses a five-layer convolutional stem with overlapping receptive fields to preserve continuity and capture fine-grained spectral features in audio.
- The block integrates learnable positional encodings to form a compact token sequence, enabling efficient Transformer and GNN processing in underwater acoustics.
The Mel Patchify Block is a specialized architectural component designed to convert Mel-spectrogram representations of audio into sequences of compact, overlap-aware patch embeddings suitable for downstream deep learning models employing Transformers and Graph Neural Networks. This module is fundamental to the UATR-GTransformer framework for underwater acoustic target recognition, where it enables efficient and inductively-biased tokenization of complex non-stationary and nonlinear underwater acoustic signals (Feng et al., 12 Dec 2025).
1. Signal Preprocessing and Mel-Spectrogram Construction
The Mel Patchify Block operates on preprocessed raw audio sampled at 16 kHz. The initial step applies a pre-emphasis filter with to accentuate high-frequency content. The pre-emphasized waveform is partitioned into overlapping frames of length samples (25 ms) with hop size samples (10 ms), employing a Hann window , .
The framewise Short-Time Fourier Transform (STFT) is computed:
for frequency bins , followed by projection through a 128-band triangular Mel filterbank. The -th Mel filter is defined for frequency bin as:
The resulting log-Mel spectrogram is
with . For a 5 s waveform, the spectrogram is zero-padded or truncated to frames, yielding a matrix .
2. Convolutional Patchification and Local Feature Encoding
To create patch embeddings, the Mel Patchify Block uses a five-layer convolutional “stem” rather than non-overlapping slicing. Starting from , the process applies five consecutive convolutions:
where is the number of output channels, is the stride, and denotes padding. Parameter settings are:
Each convolutional layer with stride 2 halves the corresponding dimension, resulting in an output tensor . Each spatial position in this output indexes a patch covering $512/32 = 16$ time frames and $128/8 = 16$ Mel bands. The use of kernel size 3 with stride 2 leads to overlapping receptive fields: adjacent patches share one row or column of the input at each downsampling stage, ensuring continuity across patch borders.
3. Positional Encoding and Patch Embedding Sequence Construction
To encode spatial location, each patch vector at position is augmented with an independent learnable 2D positional embedding :
where is the number of patches. The final output is formed by flattening the spatial grid into a patch sequence:
which directly feeds the following Transformer Encoder in the GTransformer block (Feng et al., 12 Dec 2025).
4. Motivations and Design Rationale
The Mel Patchify Block’s architecture addresses several critical requirements for robust time–frequency representation learning:
- Local–global context fusion: The convolutional stem fuses local context within each region, enabling each patch embedding to encode “fine-grained” spectral texture.
- Continuity via overlapping patches: Overlapping receptive fields (from kernel size 3, stride 2 convolutions) mitigate block artifacts and preserve coherence for frequency components crossing patch boundaries—essential for spectro-temporal phenomena such as narrowband lines and chirps.
- Efficient sequence length: Progressive strided convolution rapidly reduces the input to a tractable grid, which balances feature granularity against computational considerations for subsequent multi-head self-attention.
- Explicit spatial information: Learnable 2D positional encodings provide absolute localization, supporting the Transformer's global attention in both time and frequency axes.
5. Integration in the UATR-GTransformer Pipeline
Within the UATR-GTransformer, the Mel Patchify Block serves as the front-end, transforming the Mel spectrogram into a sequence of overlap-aware embeddings. These embeddings become input tokens for the GTransformer block, where a Transformer Encoder models patchwise mutual information to generate Mel-graph embeddings. Subsequently, a Graph Neural Network (GNN) captures local patch neighborhood structure, followed by a feed-forward network for feature transformation. This layered approach enables the model to combine convolutional inductive biases with global self-attention and graph-based relational learning, tailored to the unique non-Euclidean, non-stationary characteristics of underwater acoustic data (Feng et al., 12 Dec 2025).
6. Advantages in Feature Extraction and Recognition Tasks
The Mel Patchify Block enables several advantages for challenging acoustic tasks:
- Improved spectral feature continuity: Overlapping, learned patchification preserves the integrity of spectro-temporal features, reducing the risk of fragmentation across patch boundaries.
- Reduced computational burden: Downsampling compresses the long time–frequency input into a compact token grid, maintaining adequate context while ensuring tractability for self-attention mechanisms.
- Enhanced inductive bias: Convolutional filters inject prior knowledge about local time–frequency structure, which stabilizes training and improves generalization, especially for non-stationary or non-Gaussian signal classes.
A plausible implication is that analogous patchification strategies may benefit related domains where spectral continuity and local context are vital, such as general-purpose audio event detection or biomedical time–frequency signal analysis.