Enhancing ID and Text Fusion via Alternative Training in Session-based Recommendation (2402.08921v1)

Published 14 Feb 2024 in cs.IR and cs.AI

Abstract: Session-based recommendation has gained increasing attention in recent years, with its aim to offer tailored suggestions based on users' historical behaviors within sessions. To advance this field, a variety of methods have been developed, with ID-based approaches typically demonstrating promising performance. However, these methods often face challenges with long-tail items and overlook other rich forms of information, notably valuable textual semantic information. To integrate text information, various methods have been introduced, mostly following a naive fusion framework. Surprisingly, we observe that fusing these two modalities does not consistently outperform the best single modality by following the naive fusion framework. Further investigation reveals an potential imbalance issue in naive fusion, where the ID dominates and text modality is undertrained. This suggests that the unexpected observation may stem from naive fusion's failure to effectively balance the two modalities, often over-relying on the stronger ID modality. This insight suggests that naive fusion might not be as effective in combining ID and text as previously expected. To address this, we propose a novel alternative training strategy AlterRec. It separates the training of ID and text, thereby avoiding the imbalance issue seen in naive fusion. Additionally, AlterRec designs a novel strategy to facilitate the interaction between the two modalities, enabling them to mutually learn from each other and integrate the text more effectively. Comprehensive experiments demonstrate the effectiveness of AlterRec in session-based recommendation. The implementation is available at https://github.com/Juanhui28/AlterRec.

References (35)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces AlterRec, which alternates training of separate ID and text networks to overcome the dominance of a single modality.
It employs hard negative sampling based on the partner network's predictions to fine-tune each modality effectively.
Experimental results demonstrate that AlterRec improves recommendation accuracy and enhances long-tail item performance compared to naive fusion methods.

This paper addresses the challenge of effectively combining item ID information and textual information (like titles and descriptions) for session-based recommendation (2402.08921). While ID-based methods are often effective, they struggle with sparse data (long-tail items) and ignore rich semantic signals in text. Methods integrating text commonly use a "naive fusion" approach: encode IDs and text separately, merge the embeddings (e.g., sum or concatenate), and train jointly.

The authors conduct a preliminary paper and find, counterintuitively, that this naive fusion often performs similarly to or even worse than training solely on IDs. Further analysis (using their implementation NFRec and existing models like UniSRec and FDSA) suggests an "imbalance issue": the ID modality, often being stronger initially, dominates the joint training process, leading the text modality to be undertrained and contribute little to the final performance. This aligns with findings in broader multi-modal learning research indicating fusion doesn't always beat the best single modality.

To overcome this imbalance and effectively leverage both modalities, the paper proposes AlterRec, a novel framework based on alternative training. Instead of joint fusion and training, AlterRec maintains two separate uni-modal networks:

ID Uni-modal Network: Uses an ID embedding layer and potentially a sequence encoder (like Mean pooling or Transformer) to generate session representations and predict next items based solely on IDs.
Text Uni-modal Network: Uses a text encoder (Sentence-BERT fixed due to computational cost, followed by an MLP) and a sequence encoder (Mean or Transformer) to generate session representations and predict next items based solely on text features.

The core innovation lies in the training strategy:

Separate Training: The ID and text networks are trained separately, avoiding the optimization imbalance seen in naive fusion.
Alternative Updates: The training alternates between the ID and text networks (e.g., train ID for $m_{gap}$ epochs, then text for $m_{gap}$ epochs, repeat).
Prediction-based Interaction: Crucially, the networks learn from each other implicitly. When training one network (e.g., text network), predictions from the other network (e.g., ID network) are used to generate training signals:
- Hard Negative Sampling: Instead of random negatives, items ranked highly by the ID network (but not the true target) are used as hard negatives for training the text network (ranks $k_1$ to $k_2$ ). This forces the text network to learn distinctions relevant to the ID network's perspective. The same process applies vice-versa.
- Positive Sample Augmentation (Optional): Items ranked top- $p$ by the ID network can be used as additional positive targets (weighted by $\beta$ ) when training the text network, providing extra signal, especially for sparse items. This variant is called AlterRec_aug.

The loss function for each network is typically cross-entropy, calculated using the true target(s) and the hard negative samples derived from the other network's predictions.

id_scores = id_network.predict(sessions)
id_rankings = argsort(id_scores, descending=True)

hard_negatives = id_rankings[:, k1:k2]
augmented_positives = id_rankings[:, 0:p] # Optional

text_scores = text_network.predict(sessions)

loss = calculate_cross_entropy_loss(
    text_scores,
    true_targets,
    hard_negatives,
    augmented_positives, # Optional
    beta # Optional weight
)

optimizer_text.step()

During inference, final prediction scores are a weighted sum of the scores from the converged ID and text networks:

$y_{\mathbf{s},i} = \alpha* y^{ID}_{\mathbf{s},i} + (1-\alpha)*y^{text}_{\mathbf{s},i}$

where $\alpha$ is a hyperparameter balancing the contribution of each modality.

Implementation Considerations:

Text Encoder: Using pre-trained Sentence-BERT and keeping it fixed during training significantly reduces computational cost. An MLP is used to project SBERT embeddings to the desired dimension.
Sequence Encoders: The choice between Mean pooling and Transformer for session embedding ( $g_{mean}$ vs $g_{Trans}$ ) depends on the dataset (empirical choice).
Hyperparameters: Key parameters include the negative sampling ranks ( $k_1, k_2$ ), augmentation rank ( $p$ ), augmentation weight ( $\beta$ ), alternation gap ( $m_{gap}$ ), and the final score weighting ( $\alpha$ ). Parameter analysis shows reasonable stability, with $\alpha=0.5$ often performing well.
Training Stages: The paper uses an initial phase ( $m_{random}$ epochs) where networks train with random negatives before switching to the alternating hard-negative strategy.

Experiments & Results:

Datasets: Homedepot (private) and Amazon-M2 (public, using Spanish, French, Italian subsets).
Baselines: Included ID-based (SASRec, BERT4Rec, SR-GNN, CORE, HG-GNN) and text-integrated (UniSRec, FDSA, S³-Rec, LLM2BERT4Rec variants).
Findings:
- AlterRec and AlterRec_aug consistently outperformed strong baselines across datasets and metrics (Hits@N, NDCG@N).
- Ablation studies confirmed the effectiveness of using hard negatives (vs. random negatives, which mimics independent training) and the contribution of both ID and text modalities within AlterRec.
- Analysis showed AlterRec successfully trains both modalities without the imbalance issue observed in naive fusion (Figure 8).
- AlterRec demonstrated significant improvements, particularly for long-tail items (Figure 9), indicating effective use of textual information where ID interactions are sparse.

Conclusion: The paper identifies a critical imbalance issue in naive fusion methods for session-based recommendation and proposes AlterRec, an effective alternative training strategy. By training ID and text networks separately but facilitating interaction through shared hard negative (and optionally positive) samples derived from each other's predictions, AlterRec achieves better integration of textual information, leading to state-of-the-art performance, especially on long-tail items. The implementation is available publicly.

Enhancing ID and Text Fusion via Alternative Training in Session-based Recommendation (2402.08921v1)

Summary

GitHub

Tweets

Enhancing ID and Text Fusion via Alternative Training in Session-based Recommendation (2402.08921v1)

Summary

Related Papers

GitHub

Tweets