ADformer: Differential & Multi-Granularity Transformer
- ADformer is a dual Transformer architecture employing differential and multi-granularity attention to model complex spatio-temporal data in diverse domains.
- In passenger demand forecasting, it unifies fine-grained local patterns with high-level spatial aggregations, achieving lower MAE, RMSE, and MAPE scores.
- For EEG-based Alzheimer’s assessment, it captures both local neural oscillations and global channel interactions, improving binary classification F1 scores.
ADformer refers to two distinct Transformer-based neural architectures, each designed for complex, high-dimensional spatio-temporal data modeling but applied to different domains: (1) Aggregation Differential Transformer for passenger demand forecasting (Wang et al., 3 Jun 2025), and (2) Multi-Granularity Transformer for EEG-based Alzheimer’s disease assessment (Wang et al., 2024). Both architectures aim to overcome limitations of earlier attention models by capturing correlations at multiple scales and by carefully structuring model attention mechanisms to learn both local and global dependencies.
1. Architectures and Core Design Principles
ADFormer for Passenger Demand Forecasting
ADFormer for demand forecasting is a stack of identical "Spatial–Temporal Transformer Encoder" layers, topped with a linear prediction head. Its architecture integrates four attention modules per layer—two spatial (Spatial Differential Attention (SDA), Spatial Cluster Attention (SCA) at cluster aggregation levels) and two temporal (Temporal Self-Attention (TSA), Temporal Aggregation Attention (TAA)). This design explicitly unifies original fine-grained correlations (e.g., per-region demand patterns) and high-level aggregated relations (e.g., functional urban zones).
Embedding Pipeline:
- Input: (time, region, input dims)
- Additional features: time-of-day (), day-of-week ()
- Embedding: , projected into a latent space for each region and timestep.
Encoder Layer Fusion:
Each encoder layer computes
followed by a two-layer MLP with skip connections and layer normalization.
ADformer for EEG-Based Alzheimer’s Disease Assessment
ADformer for EEG focuses on multi-granularity representation learning from multichannel time series. Its core is two parallel branches: one capturing temporal multi-scale structure ("patch branch"), the other modeling different spatial granularities ("channel branch"). Local (intra-granularity) and cross-scale (inter-granularity) dependencies are handled via a two-stage self-attention scheme.
Multi-Granularity Embedding:
- Patch branch: Segments EEG into patches of varying lengths , projects and embeds each, with per-granularity and positional encodings.
- Channel branch: Applies one-dimensional convolutions to subsets of EEG channels (e.g., grouping electrodes), followed by projection and granularity-encoding.
Two-Stage Self-Attention:
- Intra-granularity: Handles local temporal or channel relationships within each granularity via self-attention over patch or channel embeddings.
- Inter-granularity: Aggregates global features across all granularities by exchanging information between learned router tokens from each scale.
2. Differential and Multi-Granularity Attention Mechanisms
Differential Attention in Demand Forecasting
Spatial Differential Attention (SDA) replaces standard attention with a subtraction-based, denoising operator. For each encoder layer:
- Two spatial attention maps (, ) are computed with separate query/key projections, targeting primary and spurious affinities, respectively.
- The final attention map is 0, with 1 a learned scalar parameterized as
2
- SDA output: 3.
Spatial Cluster Attention (SCA) operates on input aggregated across region clusters, learning at 4 hierarchical levels to capture broader functional structure. Temporal Aggregation Attention (TAA) uses 5 latent slots enabling high-level temporal abstraction, with learned restoration masks injecting temporal and calendar signals into the cross-slot affinity computation.
Multi-Granularity Attention in EEG Assessment
Both branches apply self-attention locally (within each patch or aggregated channel group) and globally (across granularity levels via router tokens). The routing mechanism maintains computational efficiency at 6 complexity, avoiding the quadratic scaling with total input length.
3. Aggregation and Integration of High-Level Correlations
Spatial and Temporal Aggregation in Demand Forecasting
Multi-level clustering of input regions is achieved via DTW-based similarity and hierarchical clustering, yielding a cluster assignment matrix 7. Aggregated demand is computed as 8, before embedding.
Cluster-level attention employs a learnable mask 9 to project from clusters back to individual regions, while temporal aggregation uses a restoration mask 0 constructed from time features.
The joint architecture concatenates outputs from SDA, SCA (all levels), TSA, and TAA at each layer, enabling simultaneous learning of local and holistic dependencies. Empirical analysis shows that integrating high-level spatial (e.g., functional zones) and temporal (e.g., periodic demand cycles) contexts substantially improves long-horizon forecasting accuracy (Wang et al., 3 Jun 2025).
Granular Fusion in EEG-Based AD Assessment
By design, ADformer fuses temporal and spatial structures in EEG through:
- Explicit modeling of both rapid and slow neural oscillations using patches of multiple lengths (fine-to-coarse time-scales)
- Channel grouping at different resolutions (e.g., frontal/parietal/occipital sets)
- Layerwise attention exchange via router tokens implementing inter-granularity information flow This results in improved discrimination of subtle, scale-dependent EEG biomarkers.
4. Experimental Setups and Results
Passenger Demand Forecasting
Key datasets: NYC-Taxi and NYC-Bike (N=263, 30 min intervals, 2016/2023), Xi’an-Taxi (11000 hexagonal regions). The model was evaluated on short (30 min), medium (90 min), and long (3 h) horizon prediction windows.
Results (NYC-Taxi, 30 min):
- PDFormer: MAE=5.625, RMSE=11.597, MAPE=18.92%
- ADFormer: MAE=5.461, RMSE=11.342, MAPE=18.24%
For all datasets and horizons, ADFormer achieves the lowest MAE, RMSE, and MAPE among both GCN- and attention-based baselines. Ablation experiments confirm the necessity of both differential attention and multi-level aggregation; spatial aggregation is particularly impactful for taxi data (Wang et al., 3 Jun 2025).
Alzheimer’s Disease Assessment
Five EEG datasets totaling 525 subjects and 207,851 samples were used, with both subject-dependent and subject-independent splits:
- ADSZ: 48 subjects, F1=92.96% (ADformer)
- APAVA: 23 subjects, F1=77.97%
- ADFD: 65 subjects, F1=75.19%
- CNBPM: 126 subjects, F1=93.58%
Compared to EEGNet, TCN, and Crossformer, ADformer provides a 3–5 point improvement in subject-independent F1 score for binary AD-vs-control discrimination. Leave-subjects-out experiments also demonstrate improved stability and accuracy (Wang et al., 2024).
5. Analytical Results and Implementation Recommendations
Contribution Analysis
For demand forecasting:
- Both spatial and temporal components are essential; removing either severely degrades performance.
- Multi-level aggregation leads to especially significant gains for spatially heterogeneous datasets.
- The utility of SDA increases with the prominence of spatial dependencies.
For EEG:
- Multi-granularity representation allows modeling of both fine (high-frequency) and coarse (low-frequency) signal features without handcrafted filtering.
- Cross-granularity interactions provide complementary information not easily learned by CNNs or vanilla Transformers.
Implementation Tips
Passenger Demand Forecasting:
- Employ Flash-Attention for memory and speed efficiency.
- Hyperparameters: 2, 3 layers, 4 spatial levels, 5 temporal slots, AdamW (6), weight decay 7.
EEG Assessment:
- Adam optimizer, learning rate 8, no warmup.
- Data augmentations: temporal flip, channel shuffle, frequency masking, jitter, dropout.
- Standard 6-layer encoder, 9, FFN hidden=256.
- SWA for robust subject-independent training.
6. Limitations and Future Directions
For both ADformer instantiations, domain-specific limitations persist. In demand forecasting, scenarios dominated by temporal regularities reduce the marginal benefit of spatial modules. In EEG assessment, three-class (AD vs. MCI vs. FTD) performance and ERP-based fusion remain open challenges, while further advances may be realized through pre-training and domain-adaptive approaches (Wang et al., 2024).
A plausible implication is that the multi-granularity and differential attention paradigms can be generalized to other domains involving high-dimensional, multiscale spatio-temporal data, given appropriate adaptation of aggregation structure and attention routing.
Key References:
- "ADFormer: Aggregation Differential Transformer for Passenger Demand Forecasting" (Wang et al., 3 Jun 2025)
- "ADformer: A Multi-Granularity Transformer for EEG-Based Alzheimer's Disease Assessment" (Wang et al., 2024)