Dual Item-Behavior Fusion Architecture

Updated 22 December 2025

Dual item-behavior fusion architecture is a modeling paradigm that integrates discrete behavior signals with item embeddings to capture multi-behavior dependencies.
It employs a dual-level fusion approach, using early fusion for input combination and intermediate fusion via behavior-aware self-attention to boost representation learning.
Behavior-level data augmentation combined with contrastive learning mitigates data sparsity and heterogeneity challenges, leading to improved recommendation accuracy.

A dual item-behavior fusion architecture is a neural modeling paradigm designed for multi-behavior sequential recommendation, where user interaction sequences record not only item identities but also diverse discrete behavior types (such as click, add-to-cart, or purchase). This architecture jointly integrates item and behavior information at both the input and intermediate processing stages, enabling flexible and robust preference modeling even under behavior-type heterogeneity and interaction data sparsity (Li et al., 15 Dec 2025). The following sections provide a detailed technical survey of the dual item-behavior fusion architecture as instantiated in state-of-the-art frameworks such as BLADE, situating it within the broader landscape of data augmentation, self-supervised learning, and representation fusion for sequential recommendation.

1. Motivation: Multi-Behavior Sequential Recommendation

Multi-behavior sequential recommendation extends conventional sequential recommendation by explicitly modeling multiple types of user–item interactions—such as page views, clicks, favorites, add-to-cart, and purchases—across time (Li et al., 15 Dec 2025). Two primary challenges arise:

Behavior Heterogeneity: Different behavior types convey distinct preference semantics and dependencies.
Data Sparsity: Certain types (notably high-value behaviors like purchases) are inherently sparse, causing cold-start and generalization issues.

Traditional architectures, which consider either item sequences only or naive concatenation of item–behavior pairs, are inadequate in capturing the fine-grained interdependencies between item and behavior signals.

2. Dual Fusion Modeling: Input and Intermediate Integration

A dual item-behavior fusion architecture implements fusion in two distinct stages (Li et al., 15 Dec 2025):

(a) Early Fusion (Input Level)

Each user history is encoded as a sequence of pairs $\{(v^1, b^1), \ldots, (v^L, b^L)\}$ , where $v^l$ is an item and $b^l \in \{0,1\}^{|\mathcal B|}$ encodes the presence of each of $|\mathcal B|$ behaviors at time $l$ .
Item embedding: $e_{v^l} = V[v^l]$
Behavior embedding: $\beta^l = \mathrm{softmax}(f_u \odot b^l) G$ $β^{l} = softmax (f_{u} ⊙ b^{l}) G$
- $f_u$ : user-specific modulation, $b^l$ : binary behavior vector, $G$ : behavior embedding table
Fusion function $f(e_{v^l}, \beta^l)$ (sum or gating) combines item and behavior representations to form the sequence input $E'$ .
The fused sequence $E'$ is passed through a Transformer stack to obtain hidden sequence $F$ .

(b) Intermediate Fusion

Behavior-aware self-attention (BASA) injects behavior features into attention score computation, allowing behavior types to modulate sequence modeling.
Behavior-guided Mixture-of-Experts (BGMoE) modulates expert weights and routing based on behavior context.
The intermediate output $O$ is merged with the early-fusion backbone $F$ via learned mixing parameter $\alpha$ :

$\check{U} = \alpha O + (1-\alpha) F$

Additional cross-attention uses next-step behavior embedding for per-step final output $U$ .

This dual-stage fusion enables item–behavior interactions to contribute both to the base encoding and to higher-level sequence modeling, facilitating robust, heterogeneity-aware representation learning (Li et al., 15 Dec 2025).

3. Behavior-Level Data Augmentation for Contrastive Learning

Behavior-level data augmentation, operating exclusively on behavior vectors $b^l$ , is tightly coupled with dual fusion for self-supervised contrastive learning (Li et al., 15 Dec 2025). This approach differs from item-level transformations by generating multiple randomized "views" of user behavior-type sequences while keeping item order fixed. The principal augmentation strategies are:

Augmentation	Mechanism	Objective
Co-occurrence Addition	Add new behavior type $b^+$ to $b^l$ via co-occurrence matrix $M$	Simulate realistic multi-behavior events
Frequency-Based Masking	Randomly drop present behaviors $\tilde b_i^l = 0$ w/ $q_i = \frac{m_i^c}{\sum_j m_j^c}$	Promote long-tail, mitigate frequent bias
Auxiliary Flipping	Flip auxiliary behavior bit $b_a^l \gets 1-b_a^l$	Reduce over-reliance on high-frequency type

Two independently augmented views per user are encoded via the dual-fusion network, producing hidden representations $h_u^{(1)}, h_u^{(2)}$ . The sequence-level contrastive loss

$\mathcal{L}_{\mathrm{SeqCL}} = -\log \frac{\exp(\mathrm{sim}(h_u^{(1)}, h_u^{(2)}) / \tau)}{\sum_{v \neq u} \exp(\mathrm{sim}(h_u^{(1)}, h_v^{(2)}) / \tau)}$

is then combined with the next-item prediction loss to optimize both augmentation-invariance and downstream accuracy.

4. Empirical Results and Architectural Impact

The dual item-behavior fusion architecture demonstrates substantial gains in multi-behavior sequential recommendation benchmarks (Li et al., 15 Dec 2025):

Ablation: Removing augmentations and contrastive loss yields a ≈20% relative drop in NDCG@5 (e.g., 0.0135→0.0107, KuaiSAR).
Augmentation effect: Each augmentation (especially co-occurrence addition) outperforms the no-augmentation baseline for operation rates $\rho \in [0.2, 0.5]$ .
Long-tail behaviors: ≥25% relative HR@5 improvement is observed in the rare-behavior user cohort.
Parameter Sensitivity: Optimal hyperparameters include temperature $\tau \approx 0.1$ , contrastive loss weight $\lambda \approx 0.1$ , and fusion mixing ratio $\alpha \in [0.4,0.6]$ .

These findings emphasize that dual fusion modeling, when paired with targeted behavior-level augmentation and contrastive learning, systematically enhances both item–behavior representation and recommendation performance, especially under heterogeneity and sparsity constraints.

5. Connections to Broader Research in Behavioral Data Augmentation

Dual fusion architectures extend behavior-level augmentation research by operating at both architectural and self-supervised signal levels (Dang et al., 2024). In contrast, classical heuristics (sliding-window, mask, crop, substitution (Song et al., 2022, Dang et al., 2024)) and model-based augmenters (e.g., GenPAS (Lee et al., 17 Sep 2025), L2Aug (Wang et al., 2022)) focus primarily on sequence- or user-level stochastic sampling without explicit intermediate fusion. BLADE’s dual fusion is distinctive in segregating behavior-specific and item-specific contributions throughout the network stack, in contrast to simpler concatenation or item-only fusion in earlier literature.

Behavior-level contrastive learning sits within a broader movement toward integrating augmentation, feature fusion, and representation learning—paralleling compositional data-augmentation themes in language modeling (Guo et al., 2020). These strategies consistently stress the necessity of maintaining semantic integrity (by not corrupting item sequences) while variationally exploring behavior-type patterns for maximum generalization (Li et al., 15 Dec 2025).

6. Limitations and Future Directions

Despite its demonstrated efficacy, the dual item-behavior fusion architecture exhibits several limitations (Li et al., 15 Dec 2025):

The fusion mechanism’s flexibility introduces additional hyperparameter complexity, notably in tuning $\alpha$ , $\lambda$ , and selection of gating mechanisms.
The approach assumes reliable and sufficiently granular behavior-type labels; behavior annotation noise or missingness may reduce the contrastive signal.
Augmentation operations are limited to combinatorial edits on behavior vectors and do not synthesize novel item–behavior correlations beyond co-occurrence statistics.
While computationally efficient at inference, training entails double forward-pass encoding for dual-view contrastive loss.

Potential extensions include the incorporation of graph-based co-occurrence regularizers to further expand the valid set of behavior combinations, adaptive fusion policies that allow for per-head behavior modulation in self-attention, and integration with generative augmentation modules to synthesize plausible but unseen multi-type behavior sequences.

7. Summary Table: Core Components of Dual Item-Behavior Fusion Architecture

Component	Functionality	Distinction
Early Item-Behavior Fusion	Input-level combination of item and behavior vectors	Directs embedding to capture co-dependent semantics
Intermediate Fusion	Injects behavior into attention/expert routing	Permits dynamic adaptation by behavior context
Behavior-Level Augmentation	Augments only behavior vectors $b^l$	Preserves item semantics while diversifying context
Contrastive Loss	Forces invariance under augmentations (SeqCL)	Enhances representation robustness and generalization

This architecture, particularly as realized in BLADE, represents the state-of-the-art for heterogeneity-aware, data-efficient multi-behavior recommendation in modern neural recommender frameworks (Li et al., 15 Dec 2025).