BERT & CNN-Enhanced Neural Collaborative Filtering

Updated 24 December 2025

The paper introduces a hybrid architecture that fuses user/item embeddings with BERT-encoded text and CNN-extracted image features.
It employs a multi-tower deep learning framework with a final MLP, integrating heterogeneous modalities through late fusion and dropout regularization.
Empirical results on the MovieLens dataset show improved Recall@10 and HitRatio@10 compared to both plain and BERT-enhanced NCF baselines.

BERT and CNN-Integrated Neural Collaborative Filtering (NCF) defines a hybrid recommender system architecture that explicitly incorporates heterogeneous user and item representations via transformer-based contextualized text encoding and visual analysis of item images, in addition to classical collaborative filtering signals. The model fuses user and item ID embeddings, BERT-derived item metadata features, and CNN-based image features in a multi-tower deep learning framework. Empirical evaluation demonstrates quantitative improvements over both plain Neural Collaborative Filtering (NCF) and BERT-enhanced NCF baselines on the MovieLens dataset (Munem et al., 17 Dec 2025).

1. Model Architecture and Feature Integration

The BERT and CNN-integrated NCF (“HNCF”; Editor's term) architecture consists of four parallel “towers” processing distinct input modalities before joint fusion in a high-capacity multilayer perceptron (MLP):

User-ID Tower: Receives user indices $u$ , mapping each to a learned embedding $e_u = E_u(u) \in \mathbb{R}^d$ where $E_u$ is the embedding matrix of shape $|U| \times d$ .
Item-ID Tower: Analogously, maps item indices $i$ to $e_i = E_i(i) \in \mathbb{R}^d$ with $E_i \in \mathbb{R}^{|I| \times d}$ .
Text Tower (BERT): Item metadata (title, genres, description, tags) is preprocessed (lowercased, cleaned, tokenized) and encoded using a transformer (“BERT uncased”, $L=12$ , $H=768$ , $A=12$ heads). The pooled [CLS] output $b_i \in \mathbb{R}^{768}$ provides deep contextualized semantic representation of item features.
Image Tower (CNN): Item image (poster) is resized to $224\times224\times3$ , normalized, passed through a VGG16 network with frozen convolutional weights (pretrained on ImageNet; include_top=False), and projected through a dense layer ($128$ units, ReLU activation) yielding $c_i \in \mathbb{R}^{128}$ .

The four outputs are concatenated:

$x_{u,i} = [e_u;\ e_i;\ b_i;\ c_i] \in \mathbb{R}^{2d+896}$

and passed through an MLP with dropout (hidden layers: $128$ and $64$ units, ReLU activations, dropout rate $0.2$) culminating in a sigmoid unit for $\hat{y}_{u,i}$ , the predicted user-item interaction probability. This design enables integrated learning from numeric, categorical, and image-based item descriptors alongside standard collaborative filtering signals.

2. Mathematical Formalization

Embedding and Feature Extraction

Given $u \in \{1,\ldots,|U|\}$ and $i \in \{1,\ldots,|I|\}$ :

User embedding: $e_u = E_u(u) \in \mathbb{R}^d$
Item embedding: $e_i = E_i(i) \in \mathbb{R}^d$ ( $d=768$ )
BERT-extracted text feature: $b_i = \mathrm{BERT}(f_i) \in \mathbb{R}^{768}$
CNN-extracted image feature: $c_i = \mathrm{CNN}(\mathrm{img}_i) \in \mathbb{R}^{128}$

Prediction Layer

The concatenated feature vector $x_{u,i}$ is passed through sequentially:

$h_1 = \mathrm{ReLU}(W_1 x_{u,i} + b_1)$ , $h_1' = \mathrm{Dropout}(h_1, 0.2)$
$h_2 = \mathrm{ReLU}(W_2 h_1' + b_2)$ , $h_2' = \mathrm{Dropout}(h_2, 0.2)$
$\hat{y}_{u,i} = \sigma(w_o^\top h_2' + b_o)$

Equivalently, $\hat{y}_{u,i} = \sigma(f_{\text{MLP}}([e_u;\ e_i;\ b_i;\ c_i]))$ .

Optimization and Loss

The training objective employs binary cross-entropy over observed interactions $y_{u,i} \in \{0,1\}$ :

$L = -\sum_{(u, i)} \left[ y_{u,i} \cdot \log \hat{y}_{u,i} + (1 - y_{u,i}) \cdot \log(1 - \hat{y}_{u,i}) \right]$

3. Data Pipeline and Preprocessing

All input modalities are preprocessed according to their type for uniform downstream consumption:

IDs: User and item IDs are cast to integer indices.
Textual/Categorical: All text (titles, genres, tags, description) is lowercased, cleaned of special characters and stopwords, tokenized with BERT’s tokenizer, and packed into token IDs, segment IDs, and attention masks.
Images: Posters are resized to $224 \times 224 \times 3$ arrays, converted to float32, and normalized by dividing pixel values by 255.

The feature pipeline supports missing modalities by design. Retrieval of posters is performed externally using APIs such as TMDB or IMDb.

4. Training Protocol and Hyperparameters

The model is trained and validated on a sample from the MovieLens-20M dataset, with associated posters crawled from TMDB/IMDb. Resource constraints dictate sampling 1% of all available user–item interactions, with a further subsample to 0.02%. The training/validation split allocates 80% of user–item pairs to training and 20% to validation. Leave-one-out evaluation is applied on 799 users in the test set.

Key hyperparameters are as follows:

Component	Specification	Value
User/item embed dim	$d$	$768$
BERT	uncased, $L=12$ , $H=768$ , $A=12$	pretrained, fixed config
CNN	VGG16 (ImageNet weights, frozen) + Dense(128)	output: $128$
MLP	Layer1: 128 (ReLU, dropout 0.2); Layer2: 64	output: sigmoid
Optimizer	Adam	lr = 0.001
Batch size	-	$8$
Epochs	-	25
Validation split	-	20%
Regularization	-	Dropout (0.2), frozen VGG16 layers

This scheme is designed to minimize overfitting (e.g., by freezing VGG16 layers and introducing dropout in MLP) and optimize heterogeneous representation learning.

5. Quantitative Evaluation and Empirical Results

Performance is measured using Recall@10 and HitRatio@10, following standard recommender benchmarks:

For user $u$ with true item set $R_u$ and predicted top- $K$ items $\hat{S}_u@K$ ,

$\mathrm{Recall@}K = \frac{| R_u \cap \hat{S}_u@K |}{| R_u |}$

$\mathrm{HitRatio@}K = \begin{cases}1 &\textrm{if } R_u \cap \hat{S}_u@K \neq \emptyset \ 0 &\textrm{otherwise}\end{cases}$

Metrics are reported as averages across the evaluation cohort.

Model	Recall@10	HitRatio@10
Plain NCF	0.449	0.161
BERT-based NCF	0.689	0.422
BERT+CNN HNCF	0.720	0.486

The hybrid model achieves an absolute improvement of 4% in recall and 6.4% in HitRatio over the BERT-only NCF baseline for 799 users. The increased performance is attributed to model capacity for integrating both contextualized textual metadata and visual image cues, yielding improved user–item affinity estimation through multimodal fusion (Munem et al., 17 Dec 2025).

6. Interpretations and Implications

Incorporating both BERT-based semantic representations and CNN-derived image features in a unified neural collaborative filtering framework empirically improves recommendation performance over models restricted to collaborative (ID-based) or text-enhanced signals alone. The model’s architecture allows it to leverage granular, heterogeneous content: visual semantics from item posters provide information orthogonal to textual metadata, capturing aspects such as movie mood or genre stylings visually present in posters. This suggests that hybrid models with parallel multimodal feature extraction, followed by late fusion via a joint MLP, can unlock further gains in domain-specific recommendation settings.

A plausible implication is that future recommender systems may benefit from further extension to additional modalities, sequence-aware modeling of user behavior, or end-to-end trainable visual-text encoders tailored to recommender objectives. Empirical validation on MovieLens demonstrates that such hybrid integration results in statistically significant improvements in Recall@10 and HitRatio@10, substantiating the value of multimodal deep feature fusion for collaborative filtering tasks (Munem et al., 17 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

BERT and CNN integrated Neural Collaborative Filtering for Recommender Systems (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to BERT and CNN-Integrated Neural Collaborative Filtering (NCF).