Peak-Aware CGAN for GC-MS Data Synthesis

Updated 5 February 2026

Peak-aware conditional generative model is a framework that synthesizes GC-MS data by replicating spectral peaks and interference patterns essential for chemical analysis.
It employs a CGAN architecture with a novel peak-aware attention mechanism to accurately emulate high-intensity peaks under noise and chemical interference.
The model enhances chemical detection by augmenting training datasets, achieving high fidelity metrics such as cosine similarity > 0.94 and PCC > 0.94.

A peak-aware conditional generative model is an artificial intelligence framework tailored to synthesize gas chromatography-mass spectrometry (GC-MS) data under complex chemical interference conditions. Its central innovation is a peak-aware attention mechanism integrated within a conditional generative adversarial network (CGAN), designed to reliably reproduce sharp spectral peaks and realistic interference patterns that characterize GC-MS measurements of chemical mixtures subjected to nonspecific peaks, retention time shifts, and background noise. The approach allows robust generation of synthetic spectra consistent with specified chemical and solvent conditions, enabling improved training of AI-based chemical discrimination models when experimental data are limited or costly to obtain (Yoon et al., 29 Jan 2026).

1. Conditional Generative Model Architecture

The model employs a CGAN where the generator $G$ creates synthetic 1D GC-MS spectra $\hat{x} \in \mathbb{R}^{T}$ (with $T = 5 347$ time-intensity bins), conditioned on a concatenated solvent-chemical encoding $c$ . The main architectural components are:

Condition Encoding: Solvent label $c_s$ (one-hot, dimension 4) and target-chemical label $c_t$ (one-hot, dimension 6) are concatenated then embedded via a learnable layer to generate $E_c \in \mathbb{R}^{100}$ .
Generator: Takes input noise $z \in \mathbb{R}^{100}$ $z \in R^{100}$ and $E_c$ $E_{c}$ .
- Stage 1: Multi-head self-attention (4 heads) fuses conditional information:
- $H_1 = \text{MHA}(Q=E_c, K=E_c, V=E_c)$ .
- Stage 2: $H_1$ concatenated with $z$ is projected through 16 residual/linear blocks (hidden dim 32) to produce $F_{\text{up}} \in \mathbb{R}^{T \times 100}$ .
- Stage 3: Second MHA layer outputs $H_2 \in \mathbb{R}^{T \times 100}$ .
- Stage 4: Peak-aware attention (see Section 2) is applied to reweight $H_2$ .
- Stage 5: Final 1D convolution or linear projection maps $H_2$ to $\hat{x}$ .
Discriminator: Accepts $x$ $x$ or $\hat{x}$ $\overset{x}{^}$ and $c$ $c$ .
- Features are extracted by multiple 1D convolutional layers (kernel sizes 3, 5, 7), with layer normalization and LeakyReLU activations.
- Condition embedding $E_c$ is broadcast-added to intermediate features.
- Output is a scalar $D(x, c)$ by least-squares GAN (real label 1, fake 0).

2. Peak-Aware Attention Mechanism

To accurately emulate the sharp local maxima constituting spectral peaks, the generator applies a slope-based, differentiable attention mechanism:

Slope Calculation: For a signal $x = [x_1, ..., x_T]$ , the slope at time $t$ is $s_t = |x_t - x_{t-1}|$ for $t=2,...,T$ .
Exponential Weighting and Normalization:

$\alpha_t = \exp(s_t) / \sum_{j=2}^{T} \exp(s_j)$ with $\sum_{t=2}^{T} \alpha_t = 1$ .

Zero-padding and Smoothing: $[0, \alpha_2, ..., \alpha_T]$ is convolved (1D) and activated via sigmoid: $\tilde{\alpha} = \sigma(\text{Conv1D}(\cdot)) \in \mathbb{R}^T$ .
Feature Reweighting: For generator feature maps $H_2 \in \mathbb{R}^{T \times d}$ , each element is multiplied by $\tilde{\alpha}$ , so $\hat{X}_{t,i} = H_{2_{t,i}} \cdot \tilde{\alpha}_t$ .

Larger local slopes result in greater attention allocation, prioritizing the accurate synthesis of high-intensity peaks over smooth baseline regions.

3. Loss Functions and Training Procedure

The model is trained under a min–max regime alternating between generator and discriminator updates, with the following objectives:

Adversarial Loss (LSGAN):

$L_D = \frac{1}{2} \mathbb{E}_{x \sim p_{\text{data}}, c}[(D(x, c) - 1)^2] + \frac{1}{2} \mathbb{E}_{z \sim p_z, c}[D(G(z|c), c)^2]$

$L_G^{\text{adv}} = \mathbb{E}_{z \sim p_z, c}[(D(G(z|c), c) - 1)^2]$

Spectral Reconstruction Loss:

$L_G^{\text{rec}} = \lambda \cdot \mathbb{E}_{x \sim p_{\text{data}}, z, c} [\| \text{STFT}(x) - \text{STFT}(G(z|c)) \|_2^2]$

Total Generator Loss:

$L_G = L_G^{\text{adv}} + L_G^{\text{rec}}$

Optimization: Alternating updates to $D$ (one gradient step of $L_D$ ) and $G$ (one gradient step of $L_G$ ) via Adam (generator learning rate $1 \times 10^{-4}$ , discriminator $1 \times 10^{-5}$ , $\beta_1 = 0.5$ , $\beta_2 = 0.9$ ), batch size 128, for 100,000 iterations.

4. Data Generation and Interference Modeling

The framework is trained and evaluated on real GC-MS spectra produced by reacting known chemical surrogates with common interfering materials (e.g., brick, soil, grass, asphalt, kerosene, acetone) across four solvents (EtOH, MeOH, MC, THF). Interference effects such as retention time shifts, appearance of nonspecific peaks, and increased background noise are present within the training data.

During synthetic data generation, noise $z \sim \mathcal{N}(0, I)$ and randomly selected condition pairs $(c_s, c_t)$ are encoded and passed through the generator. No explicit procedural noise model is introduced; thus, emulated interference patterns, peak distortions, and background irregularities arise from the conditional and statistical modeling capacity of $G$ trained on real measurements.

5. Evaluation Metrics and Model Performance

The model is evaluated quantitatively at the spectrum level and for downstream detection efficacy:

Spectrum-Level Metrics:
- Cosine Similarity:
$\text{Cos}(x, \hat{x}) = \frac{x \cdot \hat{x}}{\|x\|_2 \, \|\hat{x}\|_2}$ - Pearson Correlation Coefficient (PCC):

$\text{PCC}(x, \hat{x}) = \frac{ \sum_t (x_t - \bar{x})(\hat{x}_t - \overline{\hat{x}} ) }{ \sqrt{ \sum_t (x_t - \bar{x})^2 } \sqrt{ \sum_t ( \hat{x}_t - \overline{\hat{x}} )^2 } }$ - Peak Count Matching: The number of distinct peaks per spectrum is preserved within ±1.

Under all single-agent and multi-agent interference conditions, the model achieves $\text{Cos} > 0.94$ and $\text{PCC} > 0.94$ , frequently exceeding 0.99, indicating high fidelity in synthetic spectra. Overlaid chromatograms demonstrate alignment of major/minor peak features in complex chemical mixtures.

Downstream Classification: A transformer-based classifier trained on increasing proportions of synthetic data attains an F1-score improvement from approximately 0.33 (with 123 synthetic samples) to approximately 0.87 (with 922 synthetic samples), validating the utility of generated data for chemical substance discrimination in the presence of interference.

6. Implementation, Preprocessing, and Training Workflow

All model and data pipeline aspects are explicitly defined to ensure reproducibility:

Key Hyperparameters:

Component	Specification	Value
Solvent label	One-hot dimension	4
Chemical label	One-hot dimension	6
Embedding layer	Output dimension	100
Generator	Depth	16
Generator	Hidden dim	32
Spectrum length	Output dimension	5,347
Training	Batch size	128
Learning rate (G)	$LR_G$	$1 \times 10^{-4}$
Learning rate (D)	$LR_D$	$1 \times 10^{-5}$
Optimizer	Adam $(\beta_1, \beta_2)$	(0.5, 0.9)
Total Iterations		100,000

Preprocessing:
- Raw chromatograms are baseline-corrected and resampled onto a fixed $T=5 347$ grid.
- Peak intensities are min–max normalized prior to being passed into the network and slope attention module.
Algorithmic Outline: As stated in Algorithm 1 of the reference,
- Sample real ( $x$ , $c_s$ , $c_t$ ), compute $D(x,c)$ .
- Sample $z$ , generate $\hat{x} = G(z|c)$ , compute $D(\hat{x},c)$ .
- Calculate adversarial losses ( $L_D$ , $L_G^{adv}$ ), spectral reconstruction loss ( $L_G^{rec}$ ), and optimize $D$ and $G$ alternately.
- 3. Store generated spectra and conditions.
- 4. Train the downstream detector $M$ on combined real and synthetic data using peak-aware features.

7. Significance and Application Scope

The peak-aware conditional generative model enables effective simulation of GC-MS measurements in scenarios characterized by substantial interference and limited labeled data. By incorporating a differentiable attention mechanism that emulates peak sharpness, the method preserves both global and local spectral features. Its use facilitates the generation of training datasets for downstream AI-based chemical detection models, ultimately reducing false alarms, improving detection accuracy, and matching physical measurement diversity without explicit noise modeling. Applications include robust chemical screening in forensics, environmental monitoring, industrial quality control, and scenarios where interference is inevitable or sample acquisition is constrained (Yoon et al., 29 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Conditional Generative Framework with Peak-Aware Attention for Robust Chemical Detection under Interferences (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Peak-Aware Conditional Generative Model.