Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

USAD: Unsupervised Data Augmentation Diffusion Network

Updated 6 July 2025
  • The paper introduces a novel unsupervised diffusion data augmentation strategy that synthesizes balanced sensor data using a conditioned denoising diffusion probabilistic model.
  • It details a multi-branch spatio-temporal network leveraging both temporal and spatial attention to extract fine and coarse-grained features for improved human activity recognition.
  • The approach employs adaptive multi-loss fusion and cross-branch feature fusion, achieving robust convergence and state-of-the-art accuracy on benchmark HAR datasets.

The Unsupervised Data Augmentation Spatio-Temporal Attention Diffusion Network (USAD) is a neural architecture developed to address the persistent challenges in human activity recognition (HAR), namely scarcity of labeled samples for rare activities, insufficient high-level feature extraction, and efficient real-time deployment on lightweight embedded systems. USAD introduces an integrative optimization strategy—involving unsupervised, statistics-guided diffusion model-based data augmentation, a sophisticated multi-branch spatio-temporal network incorporating both spatial and temporal attention mechanisms, cross-branch feature fusion, and a dynamically adaptive multi-loss training protocol—to substantially improve model robustness and generalization in both balanced and imbalanced training scenarios (2507.02827).

1. Statistics-Guided Unsupervised Diffusion Data Augmentation

USAD leverages a denoising diffusion probabilistic model (DDPM) conditioned on statistics calculated from the original training data to synthesize realistic and class-balanced sensor signal sequences without explicit supervision. The data augmentation process includes:

  • Computation of Global and Local Statistics: For each original time-series sensor signal x0RLx_0 \in \mathbb{R}^L, global statistics—mean (μ\mu), standard deviation (σ\sigma), and skewness (γ\gamma)—are extracted as:

μ=1Li=1Lx0,i,σ=1Li=1L(x0,iμ)2,γ=1Li=1L(x0,iμσ)3,\mu = \frac{1}{L} \sum_{i=1}^{L}x_{0,i}, \quad \sigma = \sqrt{\frac{1}{L} \sum_{i=1}^L (x_{0,i}-\mu)^2}, \quad \gamma = \frac{1}{L} \sum_{i=1}^L \left(\frac{x_{0,i}-\mu}{\sigma}\right)^3,

while local statistics are generated using Z-score normalization.

  • Conditioned Synthetic Generation: The extracted statistics, stacked into a feature vector fR4Lf \in \mathbb{R}^{4L} and augmented with activity-specific prototype features (μy=E[fy]\mu_y = \mathbb{E}[f|y]), supply a conditioning mechanism to guide the diffusion model.
  • Diffusion Process: The synthetic generation uses a classic DDPM forward process, injecting Gaussian noise:

xt=αˉtx0+1αˉtϵ, ϵN(0,I),x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon,\ \epsilon \sim \mathcal{N}(0, I),

with a cosine noise schedule ensuring stability. The reverse/dediffusion direction relies on a conditioned generator GθG_\theta employing adaptive group normalization and class/time embeddings, restoring data consistency with the conditioning features and realistic statistics.

Through this paradigm, the model produces high-fidelity, balanced synthetic samples, thus effectively mitigating both data scarcity and class imbalance issues in HAR datasets.

2. Multi-Branch Spatio-Temporal Interaction Network

To process sequential sensor data, USAD introduces a multi-branch convolutional architecture with each branch designed to extract features at different spatial and temporal scales:

  • Parallel Residual Branches: The input feature map is split into KK groups, each further divided into RR subgroups, resulting in parallel processing paths utilizing convolutional kernels of sizes 3×33\times3, 5×55\times5, and 7×77\times7 within each branch.
    • Fine-Grained (e.g., 3×33\times3): Captures local, minute temporal and spatial variations.
    • Coarse-Grained (e.g., 7×77\times7): Emphasizes broader temporal context and inter-sensor dependencies.
  • Residual Connections: Each branch includes residual blocks to facilitate information preservation and gradient flow.

This multi-scale design enables adaptive extraction of both local and global dynamics inherent in multichannel sensor time series, enhancing the learning of discriminative representations for diverse human activities.

3. Temporal and Spatial Attention Mechanisms

USAD incorporates two core attention modules:

  • Temporal Attention: Operates by aggregating spatial information (global pooling), passing channel-descriptors through fully connected layers, and generating per-timestep weights. This module prioritizes critical time points, enabling the model to focus computational resources on temporally salient events within the activity sequence.
  • Spatial Attention: Employs both average and max pooling along sensor/channel dimensions to produce an attention map highlighting informative sensor channels or regions. This enhances inter-sensor correlation modeling and allows the network to accentuate spatial features important for activity discrimination.

The integration of these modules ensures that both intra-sequence temporal dependencies and inter-sensor spatial interactions are adaptively weighted throughout the network, strengthening the model's robustness to irrelevant or ambiguous segments in the data.

4. Cross-Branch Feature Fusion and Adaptive Loss Design

  • Cross-Branch Feature Fusion Unit: After attention processing, the network concatenates the most representative features from each branch (as determined by a soft attention mechanism) along the channel dimension, further augmented with residual connections. This fusion strategy maximizes representative power while maintaining computational efficiency.
  • Adaptive Multi-Loss Function Fusion:

    • Combines cross-entropy loss (for multi-class prediction), focal loss

    Lossfl=αt(1pt)γlog(pt)\text{Loss}_{fl} = -\alpha_t (1 - p_t)^{\gamma} \log(p_t)

    (to emphasize hard-to-classify, minor-class samples), and a label smoothing loss (to avoid overconfidence in predicted probabilities). - The composite loss is:

    Losstotal=ω0Lossslnll+ω1Lossfl+ω2Lossce\text{Loss}_{total} = \omega_0 \text{Loss}_{sl-nll} + \omega_1 \text{Loss}_{fl} + \omega_2 \text{Loss}_{ce}

    with weights (ω0,ω1,ω2)(\omega_0, \omega_1, \omega_2) adjusted adaptively during training according to live model performance indicators (e.g., accuracy), ensuring balanced gradient contributions and improved convergence behaviors.

This design not only promotes convergence stability but also further boosts the discriminative capacity of the learned features.

5. Empirical Evaluation on Public Benchmarks

USAD was rigorously tested on three leading HAR datasets:

Dataset Accuracy (%)
WISDM 98.84
PAMAP2 93.81
OPPORTUNITY 80.92

Additional metrics—Precision, Recall, F1-score, G-mean, and AUC—consistently reflect substantial improvements relative to baselines such as traditional CNNs, LSTMs, ResNet50, and SE-Res2Net. Notably, USAD achieved marked G-mean and AUC gains, evidencing robustness against class imbalance and strong recognition for minority classes.

Ablation experiments confirm that each architectural element—diffusion-based augmentation, multi-branch features, and attention modules—contributes materially to performance enhancement.

6. Deployment on Embedded Devices

The computational efficiency and compactness of USAD’s multi-branch design, along with its adaptive loss optimization, facilitate real-world application on embedded devices (e.g., Raspberry Pi 5, quad-core Cortex-A76, 8 GB memory). The system operates under strict inference latency requirements needed for wearable devices, outpacing many state-of-the-art methods in parameter efficiency and memory consumption. Practical tests on WISDM and OPPORTUNITY datasets demonstrate feasibility for live activity prediction in resource-constrained environments.

7. Broader Context and Significance

USAD exemplifies modern trends in HAR and spatio-temporal learning by systematically combining unsupervised generative modeling (via diffusion processes), advanced feature extraction (multi-scale convolutions), sophisticated attention mechanisms, and adaptive training losses. Its architecture reflects a convergence of advances from multiple subfields, including probabilistic generative modeling, deep convolutional sequence modeling, and attention-based learning.

While directly targeted at HAR, the architectural innovations of USAD—including statistics-guided diffusion augmentation, multi-scale attention interaction, and dynamic loss adaptation—are plausible candidates for adaptation to other domains characterized by sequential or spatial-temporal data, especially where data sparsity, imbalance, and real-time inference constraints are prevalent (2507.02827).

In summary, USAD establishes a highly effective paradigm for combining unsupervised augmentation, attention-driven sequence modeling, and efficient optimization to advance performance and deployability in sensor-based activity recognition.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)