Papers
Topics
Authors
Recent
2000 character limit reached

BSARec: Beyond Self-Attention for RecSys

Updated 26 December 2025
  • BSARec is a sequential recommendation architecture that integrates a frequency bias module using FFT to isolate and emphasize high-frequency user signals.
  • It fuses standard multi-head self-attention with a parallel frequency domain branch, effectively mitigating oversmoothing and capturing short-term interest shifts.
  • Empirical results on datasets like MovieLens-1M and LastFM show measurable improvements, with up to 28.5% increase in hit rate on certain benchmarks.

BSARec ("Beyond Self-Attention for Sequential Recommendation") is an architectural augmentation of Transformer-based sequential recommendation models. Its core motivation is to address the low-pass filtering (oversmoothing) characteristic of standard self-attention, which prioritizes long-term dependencies at the expense of modeling short-term, high-frequency interest shifts in user behavior. BSARec achieves this by introducing a frequency-domain bias module within each Transformer block, leveraging the discrete Fourier transform (DFT) to isolate and re-weight high-frequency signals in user interaction sequences.

1. Foundations: Self-Attention and Oversmoothing

Transformer-based sequential recommendation models such as SASRec employ multi-head self-attention to encode user histories x=[x1,,xn]\mathbf{x} = [x_1, \ldots, x_n]. Each item in the sequence is mapped to an embedding and combined with positional encodings; these are then processed through stacked self-attention and feed-forward layers with residual connections and LayerNorm.

Empirical analysis shows that, in sequential recommendation, repeated application of self-attention causes token representations to become increasingly similar—a phenomenon termed "oversmoothing." Theoretically, this manifests as the dominance of low-frequency modes in the singular-value spectrum of the transformed sequence, as proven via spectral analysis of the softmax attention matrix. The evolution of representations under repeated self-attention operations converges to a rank-1 (constant) vector, suppressing high-frequency, short-term pattern information (Shin et al., 2023).

2. BSARec Architecture: Frequency Bias in Transformer Encoders

BSARec extends the canonical Transformer stack with an explicit frequency-rescaling branch alongside the standard self-attention within each encoder block. The process operates as follows:

  • Input: The embedded user interaction sequence E[xi]Rd\mathbf{E}[x_i] \in \mathbb{R}^d with added positional encoding P[i]P[i].
  • Parallel Branching:
    • Self-Attention: Standard multi-head self-attention computes AA using causal masking and output hidden state hnh_n as in SASRec.
    • Frequency-Bias (BSA) Module:
    • 1. Fast Fourier Transform (FFT) is applied to the sequence (AA or input embeddings), yielding FCn×dF \in \mathbb{C}^{n \times d}.
    • 2. Frequency masking is performed: frequencies ω<c|\omega| < c are low-pass, ωc|\omega| \geq c are high-pass.
    • 3. High-frequency components are rescaled by a learned parameter β\beta, producing F(ω)F'(\omega).
    • 4. Inverse FFT is applied to reconstruct the filtered signal BB in the time domain.

The outputs of the self-attention branch (AA) and frequency bias branch (BB) are linearly fused:

A~=A+αB\widetilde{A} = A + \alpha B

where α[0,1]\alpha \in [0,1] is a learnable or tunable weighting factor determining the contribution of the frequency-bias path. Both outputs undergo normalization and are passed to the subsequent feed-forward network and next encoder block (D'Ercoli et al., 17 Jun 2025).

3. Mathematical Formulation

The fundamental computational steps for a single BSARec layer are:

  • Self-Attention:

A=MultiHead(Q,K,V)=Concat(head1,,headh)WOA = \mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \ldots, \mathrm{head}_h)W^O

Each head:

headi=softmax(QWiQ(KWiK)dk+M)VWiV\mathrm{head}_i = \mathrm{softmax}\left(\frac{QW^Q_i(KW^K_i)^\top}{\sqrt{d_k}} + M\right) VW^V_i

MM is a causal mask.

  • Frequency Module:
  1. DFT/FFT:

    F=FFT(A)F = \mathrm{FFT}(A)

  2. Masking:

    L(ω)=1ω<cH(ω)=1ωcL(\omega) = \mathbf{1}_{|\omega| < c} \qquad H(\omega) = \mathbf{1}_{|\omega| \geq c}

  3. High-Frequency Rescaling:

    F(ω)=L(ω)F(ω)+βH(ω)F(ω)F'(\omega) = L(\omega) F(\omega) + \beta H(\omega)F(\omega)

  4. Inverse FFT:

    B=Re(IFFT(F))B = \mathrm{Re}(\mathrm{IFFT}(F'))

  5. Output Fusion:

    A~=A+αB\widetilde{A} = A + \alpha B

  • Prediction & Loss: Next-item probabilities are computed via dot-product and softmax. Cross-entropy loss is used for optimization:

st,i=htE[i],P(it+1ht)=exp(st,it+1)jexp(st,j)s_{t,i} = h_t^\top E[i],\qquad P(i_{t+1}|h_t) = \frac{\exp(s_{t,i_{t+1}})}{\sum_j \exp(s_{t,j})}

L=t=1n1logP(it+1ht)\mathcal{L} = -\sum_{t=1}^{n-1} \log P(i_{t+1}| h_t)

This implementation establishes a lightweight, frequency-extractive inductive bias without disrupting the underlying SASRec flow (Shin et al., 2023, D'Ercoli et al., 17 Jun 2025, Hutter et al., 19 Dec 2025).

4. Practical Implementation and Hyperparameterization

BSARec is instantiated by subclassing SASRec; each standard TransformerEncoderLayer is replaced by a custom BSARecEncoderLayer. Key implementation characteristics:

  • FFT/IFFT operations are vectorized over all batch entries and heads. β can be either a scalar or a vector per embedding dimension.
  • Hyperparameters include frequency cutoff cc and weighting factor α\alpha; standard search spaces are c{1,3,5,7,9}c \in \{1,3,5,7,9\} and α{0.1,0.3,0.5,0.7,0.9}\alpha \in \{0.1,0.3,0.5,0.7,0.9\}.
  • Layer normalization is applied after both the self-attention and frequency branches. Model definition leverages PyTorch or compatible frameworks (e.g., EasyRec built on PyTorch Lightning) for reproducibility (D'Ercoli et al., 17 Jun 2025, Hutter et al., 19 Dec 2025).

Dataset-specific optimal settings (e.g., α=0.7\alpha=0.7, c=1c=1 for MovieLens-1M; α=0.3\alpha=0.3, c=1c=1 for Foursquare-NYC) are established via grid search.

5. Empirical Evaluation and Component Ablations

BSARec demonstrates consistent, though dataset-dependent, improvements over baseline self-attention models (notably SASRec) on established benchmarks such as MovieLens-1M, Foursquare-NYC, Amazon Beauty/Sports/Toys, LastFM, and Yelp:

Dataset Metric Best Baseline BSARec
LastFM HR@10 0.0547 0.0703* +28.5%
ML-1M NDCG@20 0.3731 0.3814* +2.2%
Foursquare NDCG@5 0.23340 0.26680 +14.3%

(* statistically significant at p<0.05p<0.05.)

Ablation studies confirm:

  • The inclusion of the frequency-bias path (α>0\alpha > 0) yields measurable gains over pure self-attention (α=0\alpha=0).
  • The best performance is achieved for intermediate values of α\alpha; extremes degrade either short- or long-term preference modeling.
  • The optimal frequency cutoff cc is typically at the lowest tested range (favoring strong high-pass filtering), with higher cc resulting in less short-term sensitivity (D'Ercoli et al., 17 Jun 2025, Shin et al., 2023, Hutter et al., 19 Dec 2025).

Comparison with alternative digital signal processing (DSP) branches, such as the discrete wavelet transform (DWT), shows marginal (<2%) and rarely significant improvement over the basic Fourier branch; simple residual addition matches the full DSP layer in some settings (Hutter et al., 19 Dec 2025).

6. User History Frequency, Padding, and Implementation Factors

The benefit of BSARec is conditional on user history characteristics. A "scaled DC" metric constructed from the DFT of user category sequences quantifies whether a user's history is dominated by low-frequency (repetitive) or high-frequency (divergent) signals. BSARec notably improves performance in user groups with high scaled DC (high-frequency) behavior.

Padding strategy—needed for fixed input length—has substantial impact. Zero-padding introduces artificial low-frequency content that can mask real user pattern frequencies, whereas nonconstant padding (reflect, cyclic, symmetric) better preserves true frequency profiles. Reflect or cyclic padding leads to +5–10% improvements on HR and NDCG metrics over zero-padding on datasets such as LastFM and Yelp (Hutter et al., 19 Dec 2025).

Additionally, implementation choices such as normalization ordering and FFT library can cause measurable performance drift. Re-implementations with standardized backbone (EasyRec) and careful hyperparameter alignment are critical for replicability and fair comparison (D'Ercoli et al., 17 Jun 2025).

7. Limitations, Future Extensions, and Interpretative Notes

The principal limitations of BSARec and related studies are:

  • The improvement conferred by the explicit Fourier-based branch, while statistically significant on certain datasets, is sometimes matched by simple residual connections. This suggests that much of the benefit may derive from architectural branching rather than any specific DSP technique (Hutter et al., 19 Dec 2025).
  • The method currently employs a fixed DFT basis for inductive bias; there is scope to explore learnable spectral filters, data-driven basis construction, or alternative localized convolutions.
  • The re-scaling vector β\beta is not extensively tuned in the original studies. Varying β\beta and replacing scalar with vector rescaling may yield further performance insight (D'Ercoli et al., 17 Jun 2025, Shin et al., 2023).
  • Performance generalization to larger, sparser, and non-timestamped datasets remains an open area. Furthermore, joint optimization of the frequency-bias branch with regularizers (contrastive, negative sampling) may unlock additional gains.

Collectively, BSARec represents a reproducible, conceptually simple enhancement to Transformer-based sequential recommendation, offering a principled approach to integrating frequency-specific inductive bias in user sequence modeling (Shin et al., 2023, D'Ercoli et al., 17 Jun 2025, Hutter et al., 19 Dec 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to BSARec.