Reciprocal Attention Value Mixing (RAVM)
- Reciprocal Attention Value Mixing (RAVM) is an augmentation of the standard attention mechanism that integrates a query–value interaction to produce query-aware values.
- It employs a learned gating mechanism to adaptively blend transformed value representations with the original values, enhancing semantic output.
- Empirical evaluations show consistent performance gains across various models and tasks with only a modest computational overhead.
Reciprocal Attention Value Mixing (RAVM) is an augmentation of the standard attention mechanism in neural architectures, designed to leverage direct interactions between queries and values. Unlike conventional attention, which computes output as a weighted sum of value vectors exclusively modulated by query–key affinities, RAVM introduces a query–value interaction function to produce query-aware values, which are then combined via a learned gating mechanism. This enhances the semantic expressiveness of the output while maintaining computational efficiency and compatibility with existing attention-based pipelines (Wu et al., 2020).
1. Standard Attention Mechanism and Its Limitations
In standard scaled dot-product attention, each output vector is formed as
where is the query matrix, the key matrix, and the value matrix. The affinity between each query and key produces an attention weight, but the mechanism does not explicitly connect queries with the value content beyond selection. Despite substantial advances, this design overlooks inherent relationships between queries and values. As a result, standard attention may be suboptimal for tasks where query-dependent value adaptation is relevant (Wu et al., 2020).
2. Query–Value Interaction Function
The central innovation in RAVM is the incorporation of a query–value interaction function, , which learns to combine information from both a query and a value .
2.1 Additive-Attention (Single-Query) Setting
For a single query and sequence of values :
- Project : 0, with 1,
- Elementwise interaction: 2 (Hadamard product),
- Learn a gate:
3
4 is a learned parameter vector; 5 denotes concatenation; 6 is the sigmoid function,
- Output query-aware value:
7
Thus, 8 combines query–value interaction and the original value via a learned convex mixture.
2.2 Dot-Product Attention (Multi-Query) Setting
For full attention matrices 9, RAVM avoids 0 pairwise query–value mixing by using a query summary per value:
- Reverse attention: 1,
- Query summary: 2,
- Elementwise interaction: 3,
- Gate for each row:
4
5, 6
- Query-aware values:
7
Each 8 mixes base and query-adapted values, where 9 is row-wise broadcast.
3. RAVM Pipeline Integration and Intuitive Rationale
With query-aware values 0 computed, RAVM reinserts them into the canonical attention calculation, leaving query–key affinity computations intact:
- Compute attention weights as in standard attention, 1,
- Compute output using query-adaptive values:
2
Here, the only difference from vanilla attention is substituting 3 with 4. The weight matrix 5 is unaffected, ensuring compatibility with existing architectures and preserving model interpretability (Wu et al., 2020).
4. Algorithmic Complexity and Implementation Details
Let 6 be sequence length, 7 hidden dimension, 8 batch size. The dominant complexity terms are:
- Attention: 9,
- Value-mixing transforms: 0.
The overhead from value mixing is proportional to 1 and is typically modest, since 2 in standard settings. The gate (3) is implemented as a rowwise projection and sigmoid over concatenated representations. All parameters (4, 5, 6) are initialized using Xavier/Glorot; dropout is used throughout projections and on 7, with 8 in 9.
Common hyperparameters include 0 or 1, number of heads = 2 (additive) or 3 (Transformer), dropout 4, and batch size between 5 and 6. Adam is used for optimization, with learning rates 7 for CNN/LSTM variants and 8 for Transformers (Wu et al., 2020).
5. Empirical Results and Ablation Studies
RAVM was evaluated on four datasets:
- Text classification: AG’s News (4-way), Amazon Electronics (5-way sentiment)
- Chinese NER: SIGHAN Bakeoff-3 and -4
RAVM was tested as an augmentation to CNN-Att, LSTM-Att, HAN, Transformer, Transformer-CRF, and CNN+Transformer-CRF. Representative accuracy or F1 improvement is summarized (Table 1):
| Baseline | w/o RAVM | w/ RAVM | 9 |
|---|---|---|---|
| CNN-Att | 92.32% | 92.66% | +0.34 |
| LSTM-Att | 92.20% | 92.68% | +0.48 |
| HAN | 92.12% | 92.74% | +0.62 |
| Transformer | 93.11% | 93.40% | +0.29 |
| Transformer-CRF | 84.83 | 85.33 | +0.50 |
| CNN+Transf-CRF | 87.04 | 87.35 | +0.31 |
Ablation analysis on AG’s News with CNN-Att revealed:
- 0: 92.32%
- 1: 92.40%
- 2: 92.55%
- Full RAVM gated 3: 92.66%
This suggests incremental gains between naive feature addition, summation, and gating, with the gating mechanism (RAVM) yielding optimal trade-off between original value and query–value interaction (Wu et al., 2020).
6. Significance and Model Integration
RAVM augments the standard attention mechanism by introducing a learnable, query-adaptive "value adapter" that mixes, via a sigmoid gate, between the original value and its query-dependent transformation. The model leaves attention weight computation untouched, preserving the interpretability and defensibility of attention-based models. Consistent improvements of 0.3–0.6 points across tasks indicate robustness at a modest computational overhead. A plausible implication is that RAVM’s explicit Q–V adaptation is particularly beneficial for tasks where query-specific semantic features in values are underexploited by naive attention (Wu et al., 2020).