Distract Your Attention Network (DAN)
- DAN is a neural network architecture that uses distraction mechanisms—like context subtraction and attention priming—to mitigate redundant focus across input features.
- It applies explicit regularization techniques, such as multi-head cross attention and partition loss, to enhance diversity in both text summarization and visual recognition tasks.
- Empirical results demonstrate that distraction strategies in DAN significantly improve performance metrics, achieving notable gains over traditional attention models.
The Distract Your Attention Network (DAN) is a class of neural network architectures leveraging the principle of intentionally "distracting" model attention mechanisms to promote coverage diversity across model predictions. DAN frameworks have been developed for varied domains, most notably in abstractive document summarization (Chen et al., 2016) and facial expression recognition (Wen et al., 2021). In both cases, DAN augments classical attention mechanisms with regularizers or multi-head strategies that explicitly force the model to explore different content regions or feature subspaces, thereby counteracting the known tendency of standard attention to fixate redundantly on salient positions.
1. Foundational Concepts and Motivation
Standard neural attention mechanisms enable models to dynamically focus on parts of the input sequence or feature map, typically yielding effective local context utilization. However, for tasks involving long-form documents or spatially complex visual data, attention may collapse onto a limited subset of the input, leading to coverage redundancy and missed content. DAN, as introduced in "Distraction-Based Neural Networks for Document Summarization" (Chen et al., 2016), combats this by explicitly subtracting or penalizing repeated focus at both the attention and context-aggregation level. Subsequently, in the context of facial expression recognition, multi-head cross-attention architectures and partition losses are employed to diversify attention maps (Wen et al., 2021).
2. DAN for Document Summarization
The DAN summarization architecture is based on an encoder–decoder structure with bidirectional GRU input encoding and a two-level GRU decoder:
- Input Encoding: Documents are tokenized as sequences , mapped to -dim embeddings, and encoded via bi-GRU layers, yielding hidden annotations .
- Decoding: Generation step processes with GRU for , producing , which is fused with the "distracted" context in GRU to yield . Output distribution:
- Standard Attention: At each step , raw alignment scores are computed, softmaxed to , and used to compute a provisional context .
DAN introduces three distraction mechanisms:
- M1 (Content Distraction): Instead of , the context vector is penalized using all past contexts,
reducing repeated inclusion of earlier context.
- M2 (Attention Distraction): Raw attention score is contextually primed:
directly penalizing attention locations with high cumulative prior focus.
- M3 (Decoding Distraction): During beam search, three diversity metrics on attention (KL divergence), context vector (cosine), and decoder state (cosine) are integrated as additional objectives, with tuned weights .
The loss function is the standard negative log-likelihood of gold summaries, with the distraction terms acting as internal regularizers rather than explicit losses. The model is optimized via mini-batch SGD with Adadelta (ρ=0.95, ε=10⁻⁶), and outperforms strong bi-GRU baselines on long document datasets, e.g., on CNN R-1=27.1 versus 21.3 for the baseline (Chen et al., 2016).
3. DAN for Facial Expression Recognition
For visual recognition, DAN adopts a hybrid ResNet-18 backbone, multi-head cross attention network, and class-separability-regularized training:
- Feature Extraction and FCN: Images are processed by a ResNet-18 backbone. Global Average Pooling yields feature vectors input to a Feature Clustering Network (FCN) with "affinity loss", which pulls features toward their class centers while maximizing inter-class separability:
with centers and center variance .
- Multi-Head Cross Attention (MAN): Parallel attention heads are deployed. Each head applies spatial attention (via convolutional units) followed by channel attention (two-layer FC and sigmoid), producing an attended vector . No standard transformer-style softmax(QK/√d)V is employed; instead, attention is realized through convolutional gating and element-wise product.
- Attention Fusion Network (AFN) & Partition Loss: Each head's output is scaled by log-softmax over heads. To enforce decorrelation, a "partition loss"
encourages variance across head outputs for each feature channel, promoting attention diversity.
- Overall Loss and Training: The final loss integrates cross-entropy, affinity loss, and partition loss: . Optimizers and learning rates are dataset-specific. Data augmentation includes flips, cutout, and color jitter.
On RAF-DB, DAN achieves 89.70% versus 86.25% for base ResNet-18; on AffectNet-7, DAN attains 65.69% versus 56.97%. Increasing (heads) up to 4 improves accuracy, after which returns saturate (Wen et al., 2021).
4. Distinctive Methodological Elements
The following table summarizes the main components of each DAN variant:
| Domain | Distraction Mechanism | Regularization Type |
|---|---|---|
| Document Summarization | M1 (context subtraction), M2 (attention history priming), M3 (diversity on decoding) | Implicit via context/attention subtraction |
| Facial Expression Recognition | Multi-head cross spatial+channel attention, partition loss, affinity loss | Explicit via affinity and partition loss |
Both frameworks adopt "distraction" to penalize repetitive focus: in text, mathematically subtracting context/attention history; in vision, enforcing head-wise attention diversity via loss terms and parallel attention pathways.
5. Empirical Results and Performance Analysis
In abstractive summarization, DAN yields substantial ROUGE improvements on long documents (e.g., CNN R-1 +5.8 absolute over baseline), with relative improvements higher for longer documents (29.0% R-1 gain on long subset vs. 25.9% on short). On short-text datasets (LCSTS), distraction mechanisms offer negligible effect, indicating the value of distraction scales with input redundancy and length (Chen et al., 2016).
For facial expression recognition, DAN sets state-of-the-art on all tested benchmarks: RAF-DB (89.70%), AffectNet-8 (62.09%), AffectNet-7 (65.69%), and SFEW 2.0 (53.18%). Ablation studies indicate that partition loss and affinity loss each provide measurable gains; multi-head attention (up to K=4) further increases accuracy. Confusion matrices show strong performance on "Happy," "Neutral," and "Surprise," with challenges remaining for "Disgust" and "Fear" (Wen et al., 2021).
6. Practical Implementation Considerations
Summarization DAN uses a 25k-token vocabulary (CNN) with UNK replacement and pointer mechanism during decoding. Embedding size is 120 (CNN) or 500 (LCSTS); bi-directional hidden size reaches up to 1200. Optimal for distraction are determined via grid search.
For visual DAN, ResNet-18 serves as a lightweight backbone. Model size (ResNet-18 plus four MAN heads) is 19.7 million parameters, 2.23G MACs. Batch size is 256 on a single P100 GPU across datasets.
7. Theoretical and Applied Implications
The introduction of distraction mechanisms into neural attention architectures demonstrates marked empirical benefits when input redundancy or complexity would otherwise yield collapsed or repetitive focus. In language, explicit mathematical subtraction mitigates attention collapse over long sequences, while in visual models, multi-head and partition regularizers enforce exploration of diverse spatial and channel-wise features. These approaches preserve or modestly extend the computational cost of their respective baselines while achieving consistent gains on length- or region-sensitive benchmarks.
A plausible implication is that similar distraction strategies could generalize to other domains afflicted by coverage redundancy, such as video modeling or multi-label classification, provided that analogous forms of attention regularization are developed.