1D UNet: Efficient Sequential Segmentation
- 1D UNet is a neural network architecture that replaces 2D convolutions with 1D operations, maintaining U-Net’s multi-scale structure via skip connections.
- It employs an encoder–decoder design with PixelUnshuffle/PixelShuffle operations and residual blocks to significantly reduce parameters and boost computational efficiency.
- Applications include image segmentation and time-series event detection, achieving up to 71% model reduction and substantial FLOP savings compared to traditional U-Net models.
A 1D UNet is an encoder–decoder neural network architecture adapted from the classic U-Net, in which one-dimensional (1D) convolutional layers replace the typical two-dimensional (2D) convolutional layers. Designed for data modalities with separable spatial or sequential structure—such as time series, audio, or even images via specific reshaping—the 1D UNet preserves the characteristic multi-scale topology and skip connections of its predecessor while achieving substantial improvements in computational efficiency and model compactness. Its recent manifestations include channel-wise 1D convolutional variants for image segmentation (Byun et al., 2024) and residual 1D UNet models for event segmentation in time-series data, such as electroencephalography (EEG) (&&&1&&&).
1. Architectural Principles of the 1D UNet
Fundamental to all UNet variants is the encoder–decoder paradigm, comprising sequential downsampling blocks followed by corresponding upsampling blocks connected via skip connections. In the 1D UNet, each block replaces 2D convolutions with 1D convolutions configured for the data’s shape.
OneNet (Byun et al., 2024) implements 1D convolutions channel-wise, specifically for image segmentation. The downsampling occurs through PixelUnshuffle operations, which transfer spatial windows into the channel dimension. Each encoder block consists of two channel-wise 1D convolutional layers that operate on a flattened spatial dimension, optionally interleaved with spatial mixing 1D convolutions. The decoder mirrors the encoder, employing PixelShuffle for upsampling and channel-wise 1D convolutions for feature generation. This strategy enables significant parameter reduction and efficient inference.
For time-series data, such as EEG, AugUNet1D (Sengupta et al., 1 Jan 2026) employs strided 1D convolutions, residual blocks, and max-pooling for encoder downsampling. Each residual block includes two Conv1d layers (kernel size 3, padding 1) with batch normalization and ReLU, with an identity shortcut to enhance gradient flow and enable deeper architectures. The decoder stage uses ConvTranspose1d for upsampling, concatenates skip connections, and applies further residual blocks.
2. Mathematical Operations and Data Transformations
The central mathematical operation in the 1D UNet is the 1D convolution, defined for an input and kernel (size , padding ) as: $(x * w)[n] = \sum_{k=0}^{K-1} w[k]\;x[n + k - p}$
In OneNet, after PixelUnshuffle transformation, spatial dimensions () of the input are aggregated into the channel dimension, producing for a batch size , with where is the scale factor. Subsequent channel-wise 1D convolution is realized by reshaping to (), and executing: where is the convolutional weight matrix.
PixelUnshuffle and PixelShuffle, respectively, downsample and upsample by reordering elements between spatial and channel dimensions:
In time-series applications, max-pooling and transposed convolution further control down/up-sampling:
3. Computational Efficiency and Model Compactness
Substituting 2D convolutions with 1D convolutions yields substantial reductions in both parameter count and total FLOPs. In OneNet, an encoder block with scale requires: In contrast, a typical 2D block necessitates: Therefore, each block involves only 22% as many parameters.
Summed across layers for (four layers), OneNet’s encoder-only network uses $16.39$M parameters ( reduction vs $31.04$M for U-Net), while a fully 1D encoder–decoder variant ("OneNet_{ed,4}") reduces further to $9.08$M ( reduction).
In terms of FLOPs, for images:
- U-Net₄: $104.7$ GFLOPs
- OneNetₑ,₄: $78.4$ GFLOPs ()
- OneNet₍ₑd₎,₄: $22.9$ GFLOPs ()
AugUNet1D (Sengupta et al., 1 Jan 2026) does not quantify FLOPs or parameter count directly in the dataset, but model compression and acceleration are plausible implications.
4. Application Domains and Implementation Paradigms
The channel-wise 1D UNet paradigm is adaptable to a variety of settings:
- Image Segmentation: OneNet applies channel-wise 1D convs to images, leveraging pixel-repositioning for multi-scale spatial context extraction, and preserves segmentation accuracy relative to standard U-Net architectures (Byun et al., 2024).
- Time Series Event Segmentation: AugUNet1D detects spike wave discharges in continuous mouse EEG recordings through windowed 1D convolutions and residual topology, achieving state-of-the-art event-level segmentation (Sengupta et al., 1 Jan 2026).
- Generalization to Other Structured Data: The architecture is applicable to any modality with separable spatial or sequential structure, including audio spectrograms and other dense prediction tasks.
For implementation, OneNet provides PyTorch-style pseudocode for encoder blocks, demonstrating the integration of PixelUnshuffle and Conv1d layers to flatten spatial dimensions into a “time” axis, followed by channel-wise and optional spatial mixing convolutions.
5. Performance Benchmarks and Comparative Evaluation
Empirical results extracted from benchmark studies:
Semantic Segmentation (Image Domain) (Byun et al., 2024)
| Method | VOC mIoU | PET_F mIoU | PET_S mIoU | Heart mIoU | Brain mIoU | Lung mIoU |
|---|---|---|---|---|---|---|
| U-Net₄ | 0.182 | 0.316 | 0.713 | 0.063 | 0.001 | 0.009 |
| ResNet₃₄-U-Net | 0.332 | 0.597 | 0.801 | 0.065 | 0.079 | 0.009 |
| MobileNet-U-Net | 0.166 | 0.252 | 0.664 | 0.047 | 0.011 | 0.008 |
| OneNetₑ,₄ | 0.160 | 0.216 | 0.636 | 0.066 | 0.105 | 0.009 |
| OneNet₍ₑd₎,₄ | 0.149 | 0.172 | 0.535 | 0.062 | 0.099 | 0.008 |
Encoder-only OneNet achieves 47% parameter reduction with ≤ 1% drop on medical imaging tasks, and 10–15% on general-purpose masks. The full encoder–decoder reduces model size by 71%, with a further 5–10% mIoU drop on more complex scenes.
SWD Event Detection in EEG (Time Series) (Sengupta et al., 1 Jan 2026)
- Point-wise F1-score:
- Vanilla 1D UNet: 0.59 ± 0.15
- 1D Residual UNet: 0.4268 (no augmentation)
- AugUNet1D: 0.90 ± 0.01
- Twin Peaks: 0.69 ± 0.00
- Event-level F1-score:
- AugUNet1D: 0.95 ± 0.04
- Twin Peaks: 0.71 ± 0.15
Ablation studies confirm amplitude scaling as the most effective augmentation in enhancing cross-subject generalization, with scaling-only achieving F1 = $0.8609$. Combining scaling, noise, and inversion achieves the best aggregate metric (F1 = $0.8848$).
6. Data Augmentation, Training Methodologies, and Robustness
AugUNet1D (Sengupta et al., 1 Jan 2026) uses targeted data augmentation:
- Amplitude scaling: ,
- Gaussian noise: ,
- Signal inversion:
The model is trained per-window (T=2000 timepoints) on Dice loss: with Adam optimization, early stopping, and a cosine-annealing learning rate schedule. Validation is performed on held-out mouse recordings, with results averaged over three runs.
Robustness to reduced training set size is empirically validated: AugUNet1D achieves F1 = $0.8192$ with 5% of labeled data and $0.8618$ with 90%, suggesting effective utilization of augmentation and strong generalization.
7. Significance, Applicability, and Future Implications
The transition from 2D to 1D convolutions in UNet architectures enables practical edge deployment of segmentation models, thanks to drastic reductions in parameters and computational demands—up to 71% model size reduction and 78% FLOPs decrease as shown in OneNet (Byun et al., 2024). In time-series event segmentation, the residual 1D UNet (AugUNet1D) attains state-of-the-art performance in SWD detection with precise temporal localization, outperforming both classical ML and time–frequency algorithms (Sengupta et al., 1 Jan 2026).
A plausible implication is that further exploration of 1D UNet variants, augmented by pixel reorganization techniques and robust data augmentation schemes, may extend their utility across other structured modalities and downstream tasks—provided strict adherence to architectural principles that exploit separable spatial or temporal dependencies.