BERT & CNN-Enhanced Neural Collaborative Filtering
- The paper introduces a hybrid architecture that fuses user/item embeddings with BERT-encoded text and CNN-extracted image features.
- It employs a multi-tower deep learning framework with a final MLP, integrating heterogeneous modalities through late fusion and dropout regularization.
- Empirical results on the MovieLens dataset show improved Recall@10 and HitRatio@10 compared to both plain and BERT-enhanced NCF baselines.
BERT and CNN-Integrated Neural Collaborative Filtering (NCF) defines a hybrid recommender system architecture that explicitly incorporates heterogeneous user and item representations via transformer-based contextualized text encoding and visual analysis of item images, in addition to classical collaborative filtering signals. The model fuses user and item ID embeddings, BERT-derived item metadata features, and CNN-based image features in a multi-tower deep learning framework. Empirical evaluation demonstrates quantitative improvements over both plain Neural Collaborative Filtering (NCF) and BERT-enhanced NCF baselines on the MovieLens dataset (Munem et al., 17 Dec 2025).
1. Model Architecture and Feature Integration
The BERT and CNN-integrated NCF (“HNCF”; Editor's term) architecture consists of four parallel “towers” processing distinct input modalities before joint fusion in a high-capacity multilayer perceptron (MLP):
- User-ID Tower: Receives user indices , mapping each to a learned embedding where is the embedding matrix of shape .
- Item-ID Tower: Analogously, maps item indices to with .
- Text Tower (BERT): Item metadata (title, genres, description, tags) is preprocessed (lowercased, cleaned, tokenized) and encoded using a transformer (“BERT uncased”, , , heads). The pooled [CLS] output provides deep contextualized semantic representation of item features.
- Image Tower (CNN): Item image (poster) is resized to , normalized, passed through a VGG16 network with frozen convolutional weights (pretrained on ImageNet; include_top=False), and projected through a dense layer ($128$ units, ReLU activation) yielding .
The four outputs are concatenated:
and passed through an MLP with dropout (hidden layers: $128$ and $64$ units, ReLU activations, dropout rate $0.2$) culminating in a sigmoid unit for , the predicted user-item interaction probability. This design enables integrated learning from numeric, categorical, and image-based item descriptors alongside standard collaborative filtering signals.
2. Mathematical Formalization
Embedding and Feature Extraction
Given and :
- User embedding:
- Item embedding: ()
- BERT-extracted text feature:
- CNN-extracted image feature:
Prediction Layer
The concatenated feature vector is passed through sequentially:
- ,
- ,
Equivalently, .
Optimization and Loss
The training objective employs binary cross-entropy over observed interactions :
3. Data Pipeline and Preprocessing
All input modalities are preprocessed according to their type for uniform downstream consumption:
- IDs: User and item IDs are cast to integer indices.
- Textual/Categorical: All text (titles, genres, tags, description) is lowercased, cleaned of special characters and stopwords, tokenized with BERT’s tokenizer, and packed into token IDs, segment IDs, and attention masks.
- Images: Posters are resized to arrays, converted to float32, and normalized by dividing pixel values by 255.
The feature pipeline supports missing modalities by design. Retrieval of posters is performed externally using APIs such as TMDB or IMDb.
4. Training Protocol and Hyperparameters
The model is trained and validated on a sample from the MovieLens-20M dataset, with associated posters crawled from TMDB/IMDb. Resource constraints dictate sampling 1% of all available user–item interactions, with a further subsample to 0.02%. The training/validation split allocates 80% of user–item pairs to training and 20% to validation. Leave-one-out evaluation is applied on 799 users in the test set.
Key hyperparameters are as follows:
| Component | Specification | Value |
|---|---|---|
| User/item embed dim | $768$ | |
| BERT | uncased, , , | pretrained, fixed config |
| CNN | VGG16 (ImageNet weights, frozen) + Dense(128) | output: $128$ |
| MLP | Layer1: 128 (ReLU, dropout 0.2); Layer2: 64 | output: sigmoid |
| Optimizer | Adam | lr = 0.001 |
| Batch size | - | $8$ |
| Epochs | - | 25 |
| Validation split | - | 20% |
| Regularization | - | Dropout (0.2), frozen VGG16 layers |
This scheme is designed to minimize overfitting (e.g., by freezing VGG16 layers and introducing dropout in MLP) and optimize heterogeneous representation learning.
5. Quantitative Evaluation and Empirical Results
Performance is measured using Recall@10 and HitRatio@10, following standard recommender benchmarks:
- For user with true item set and predicted top- items ,
- Metrics are reported as averages across the evaluation cohort.
| Model | Recall@10 | HitRatio@10 |
|---|---|---|
| Plain NCF | 0.449 | 0.161 |
| BERT-based NCF | 0.689 | 0.422 |
| BERT+CNN HNCF | 0.720 | 0.486 |
The hybrid model achieves an absolute improvement of 4% in recall and 6.4% in HitRatio over the BERT-only NCF baseline for 799 users. The increased performance is attributed to model capacity for integrating both contextualized textual metadata and visual image cues, yielding improved user–item affinity estimation through multimodal fusion (Munem et al., 17 Dec 2025).
6. Interpretations and Implications
Incorporating both BERT-based semantic representations and CNN-derived image features in a unified neural collaborative filtering framework empirically improves recommendation performance over models restricted to collaborative (ID-based) or text-enhanced signals alone. The model’s architecture allows it to leverage granular, heterogeneous content: visual semantics from item posters provide information orthogonal to textual metadata, capturing aspects such as movie mood or genre stylings visually present in posters. This suggests that hybrid models with parallel multimodal feature extraction, followed by late fusion via a joint MLP, can unlock further gains in domain-specific recommendation settings.
A plausible implication is that future recommender systems may benefit from further extension to additional modalities, sequence-aware modeling of user behavior, or end-to-end trainable visual-text encoders tailored to recommender objectives. Empirical validation on MovieLens demonstrates that such hybrid integration results in statistically significant improvements in Recall@10 and HitRatio@10, substantiating the value of multimodal deep feature fusion for collaborative filtering tasks (Munem et al., 17 Dec 2025).