Papers
Topics
Authors
Recent
Search
2000 character limit reached

Peak-Aware CGAN for GC-MS Data Synthesis

Updated 5 February 2026
  • Peak-aware conditional generative model is a framework that synthesizes GC-MS data by replicating spectral peaks and interference patterns essential for chemical analysis.
  • It employs a CGAN architecture with a novel peak-aware attention mechanism to accurately emulate high-intensity peaks under noise and chemical interference.
  • The model enhances chemical detection by augmenting training datasets, achieving high fidelity metrics such as cosine similarity > 0.94 and PCC > 0.94.

A peak-aware conditional generative model is an artificial intelligence framework tailored to synthesize gas chromatography-mass spectrometry (GC-MS) data under complex chemical interference conditions. Its central innovation is a peak-aware attention mechanism integrated within a conditional generative adversarial network (CGAN), designed to reliably reproduce sharp spectral peaks and realistic interference patterns that characterize GC-MS measurements of chemical mixtures subjected to nonspecific peaks, retention time shifts, and background noise. The approach allows robust generation of synthetic spectra consistent with specified chemical and solvent conditions, enabling improved training of AI-based chemical discrimination models when experimental data are limited or costly to obtain (Yoon et al., 29 Jan 2026).

1. Conditional Generative Model Architecture

The model employs a CGAN where the generator GG creates synthetic 1D GC-MS spectra x^RT\hat{x} \in \mathbb{R}^{T} (with T=5347T = 5 347 time-intensity bins), conditioned on a concatenated solvent-chemical encoding cc. The main architectural components are:

  • Condition Encoding: Solvent label csc_s (one-hot, dimension 4) and target-chemical label ctc_t (one-hot, dimension 6) are concatenated then embedded via a learnable layer to generate EcR100E_c \in \mathbb{R}^{100}.
  • Generator: Takes input noise zR100z \in \mathbb{R}^{100} and EcE_c.
    • Stage 1: Multi-head self-attention (4 heads) fuses conditional information:
    • H1=MHA(Q=Ec,K=Ec,V=Ec)H_1 = \text{MHA}(Q=E_c, K=E_c, V=E_c).
    • Stage 2: H1H_1 concatenated with zz is projected through 16 residual/linear blocks (hidden dim 32) to produce FupRT×100F_{\text{up}} \in \mathbb{R}^{T \times 100}.
    • Stage 3: Second MHA layer outputs H2RT×100H_2 \in \mathbb{R}^{T \times 100}.
    • Stage 4: Peak-aware attention (see Section 2) is applied to reweight H2H_2.
    • Stage 5: Final 1D convolution or linear projection maps H2H_2 to x^\hat{x}.
  • Discriminator: Accepts xx or x^\hat{x} and cc.
    • Features are extracted by multiple 1D convolutional layers (kernel sizes 3, 5, 7), with layer normalization and LeakyReLU activations.
    • Condition embedding EcE_c is broadcast-added to intermediate features.
    • Output is a scalar D(x,c)D(x, c) by least-squares GAN (real label 1, fake 0).

2. Peak-Aware Attention Mechanism

To accurately emulate the sharp local maxima constituting spectral peaks, the generator applies a slope-based, differentiable attention mechanism:

  • Slope Calculation: For a signal x=[x1,...,xT]x = [x_1, ..., x_T], the slope at time tt is st=xtxt1s_t = |x_t - x_{t-1}| for t=2,...,Tt=2,...,T.
  • Exponential Weighting and Normalization:

αt=exp(st)/j=2Texp(sj)\alpha_t = \exp(s_t) / \sum_{j=2}^{T} \exp(s_j) with t=2Tαt=1\sum_{t=2}^{T} \alpha_t = 1.

  • Zero-padding and Smoothing: [0,α2,...,αT][0, \alpha_2, ..., \alpha_T] is convolved (1D) and activated via sigmoid: α~=σ(Conv1D())RT\tilde{\alpha} = \sigma(\text{Conv1D}(\cdot)) \in \mathbb{R}^T.
  • Feature Reweighting: For generator feature maps H2RT×dH_2 \in \mathbb{R}^{T \times d}, each element is multiplied by α~\tilde{\alpha}, so X^t,i=H2t,iα~t\hat{X}_{t,i} = H_{2_{t,i}} \cdot \tilde{\alpha}_t.

Larger local slopes result in greater attention allocation, prioritizing the accurate synthesis of high-intensity peaks over smooth baseline regions.

3. Loss Functions and Training Procedure

The model is trained under a min–max regime alternating between generator and discriminator updates, with the following objectives:

  • Adversarial Loss (LSGAN):

LD=12Expdata,c[(D(x,c)1)2]+12Ezpz,c[D(G(zc),c)2]L_D = \frac{1}{2} \mathbb{E}_{x \sim p_{\text{data}}, c}[(D(x, c) - 1)^2] + \frac{1}{2} \mathbb{E}_{z \sim p_z, c}[D(G(z|c), c)^2]

LGadv=Ezpz,c[(D(G(zc),c)1)2]L_G^{\text{adv}} = \mathbb{E}_{z \sim p_z, c}[(D(G(z|c), c) - 1)^2]

LGrec=λExpdata,z,c[STFT(x)STFT(G(zc))22]L_G^{\text{rec}} = \lambda \cdot \mathbb{E}_{x \sim p_{\text{data}}, z, c} [\| \text{STFT}(x) - \text{STFT}(G(z|c)) \|_2^2]

  • Total Generator Loss:

LG=LGadv+LGrecL_G = L_G^{\text{adv}} + L_G^{\text{rec}}

  • Optimization: Alternating updates to DD (one gradient step of LDL_D) and GG (one gradient step of LGL_G) via Adam (generator learning rate 1×1041 \times 10^{-4}, discriminator 1×1051 \times 10^{-5}, β1=0.5\beta_1 = 0.5, β2=0.9\beta_2 = 0.9), batch size 128, for 100,000 iterations.

4. Data Generation and Interference Modeling

The framework is trained and evaluated on real GC-MS spectra produced by reacting known chemical surrogates with common interfering materials (e.g., brick, soil, grass, asphalt, kerosene, acetone) across four solvents (EtOH, MeOH, MC, THF). Interference effects such as retention time shifts, appearance of nonspecific peaks, and increased background noise are present within the training data.

During synthetic data generation, noise zN(0,I)z \sim \mathcal{N}(0, I) and randomly selected condition pairs (cs,ct)(c_s, c_t) are encoded and passed through the generator. No explicit procedural noise model is introduced; thus, emulated interference patterns, peak distortions, and background irregularities arise from the conditional and statistical modeling capacity of GG trained on real measurements.

5. Evaluation Metrics and Model Performance

The model is evaluated quantitatively at the spectrum level and for downstream detection efficacy:

  • Spectrum-Level Metrics:
    • Cosine Similarity:

    Cos(x,x^)=xx^x2x^2\text{Cos}(x, \hat{x}) = \frac{x \cdot \hat{x}}{\|x\|_2 \, \|\hat{x}\|_2} - Pearson Correlation Coefficient (PCC):

    PCC(x,x^)=t(xtxˉ)(x^tx^)t(xtxˉ)2t(x^tx^)2\text{PCC}(x, \hat{x}) = \frac{ \sum_t (x_t - \bar{x})(\hat{x}_t - \overline{\hat{x}} ) }{ \sqrt{ \sum_t (x_t - \bar{x})^2 } \sqrt{ \sum_t ( \hat{x}_t - \overline{\hat{x}} )^2 } } - Peak Count Matching: The number of distinct peaks per spectrum is preserved within ±1.

Under all single-agent and multi-agent interference conditions, the model achieves Cos>0.94\text{Cos} > 0.94 and PCC>0.94\text{PCC} > 0.94, frequently exceeding 0.99, indicating high fidelity in synthetic spectra. Overlaid chromatograms demonstrate alignment of major/minor peak features in complex chemical mixtures.

  • Downstream Classification: A transformer-based classifier trained on increasing proportions of synthetic data attains an F1-score improvement from approximately 0.33 (with 123 synthetic samples) to approximately 0.87 (with 922 synthetic samples), validating the utility of generated data for chemical substance discrimination in the presence of interference.

6. Implementation, Preprocessing, and Training Workflow

All model and data pipeline aspects are explicitly defined to ensure reproducibility:

  • Key Hyperparameters:
Component Specification Value
Solvent label One-hot dimension 4
Chemical label One-hot dimension 6
Embedding layer Output dimension 100
Generator Depth 16
Generator Hidden dim 32
Spectrum length Output dimension 5,347
Training Batch size 128
Learning rate (G) LRGLR_G 1×1041 \times 10^{-4}
Learning rate (D) LRDLR_D 1×1051 \times 10^{-5}
Optimizer Adam (β1,β2)(\beta_1, \beta_2) (0.5, 0.9)
Total Iterations 100,000
  • Preprocessing:

    • Raw chromatograms are baseline-corrected and resampled onto a fixed T=5347T=5 347 grid.
    • Peak intensities are min–max normalized prior to being passed into the network and slope attention module.
  • Algorithmic Outline: As stated in Algorithm 1 of the reference,
    • Sample real (xx, csc_s, ctc_t), compute D(x,c)D(x,c).
    • Sample zz, generate x^=G(zc)\hat{x} = G(z|c), compute D(x^,c)D(\hat{x},c).
    • Calculate adversarial losses (LDL_D, LGadvL_G^{adv}), spectral reconstruction loss (LGrecL_G^{rec}), and optimize DD and GG alternately.
    • 3. Store generated spectra and conditions.
    • 4. Train the downstream detector MM on combined real and synthetic data using peak-aware features.

7. Significance and Application Scope

The peak-aware conditional generative model enables effective simulation of GC-MS measurements in scenarios characterized by substantial interference and limited labeled data. By incorporating a differentiable attention mechanism that emulates peak sharpness, the method preserves both global and local spectral features. Its use facilitates the generation of training datasets for downstream AI-based chemical detection models, ultimately reducing false alarms, improving detection accuracy, and matching physical measurement diversity without explicit noise modeling. Applications include robust chemical screening in forensics, environmental monitoring, industrial quality control, and scenarios where interference is inevitable or sample acquisition is constrained (Yoon et al., 29 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Peak-Aware Conditional Generative Model.