A Topology-Aware Positive Sample Set Construction and Feature Optimization Method in Implicit Collaborative Filtering

Published 20 Feb 2026 in cs.IR | (2602.18288v1)

Abstract: Negative sampling strategies are widely used in implicit collaborative filtering to address issues like data sparsity and class imbalance. However, these methods often introduce false negatives, hindering the model's ability to accurately learn users' latent preferences. To mitigate this problem, existing methods adjust the negative sampling distribution based on statistical features from model training or the hardness of negative samples. Nevertheless, these methods face two key limitations: (1) over-reliance on the model's current representation capabilities; (2) failure to leverage the potential of false negatives as latent positive samples to guide model learning of user preferences more accurately. To address the above issues, we propose a Topology-aware Positive Sample Set Construction and Feature Optimization method (TPSC-FO). First, we design a simple topological community-aware false negative identification (FNI) method and observe that topological community structures in interaction networks can effectively identify false negatives. Motivated by this, we develop a topology-aware positive sample set construction module. This module employs a differential community detection strategy to capture topological community structures in implicit feedback, coupled with personalized noise filtration to reliably identify false negatives and convert them into positive samples. Additionally, we introduce a neighborhood-guided feature optimization module that refines positive sample features by incorporating neighborhood features in the embedding space, effectively mitigating noise in the positive samples. Extensive experiments on five real-world datasets and two synthetic datasets validate the effectiveness of TPSC-FO.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces TPSC-FO, integrating topology-based community detection and false negative conversion to leverage latent positives in implicit CF.
It employs personalized noise filtration and neighborhood-guided mixup for feature denoising, yielding significant gains in Recall@20 and NDCG@20.
Extensive experiments on both real-world and synthetic datasets demonstrate robust performance improvements and practical scalability across recommendation models.

Topology-Aware Positive Sample Set Construction and Feature Optimization in Implicit Collaborative Filtering

Motivation and Problem Setting

Negative sampling is central to implicit collaborative filtering (CF), facilitating efficient training and mitigating data sparsity and class imbalance. However, the prevailing approach—sampling negatives from uninteracted items—unavoidably introduces false negatives. These are items that reflect latent user preferences but were not exposed, leading to mislearning and bias in user embedding representations. Existing strategies address this either by tweaking the negative sampling distribution based on model-predicted features or by heuristically avoiding hard negatives with high positive sample similarity. These methods are limited by their dependence on model quality and their failure to harness false negatives as positive supervisory signals. This paper introduces a framework—Topology-aware Positive Sample Set Construction and Feature Optimization (TPSC-FO)—that leverages network community topology and embedding neighborhood information to systematically identify, convert, and denoise false negatives.

Figure 1: The impact of positive/negative sample similarity, revealing the tradeoff between gradient magnitude and negative sampling error.

Topological Community-Based False Negative Identification

The hypothesis is that false negatives are fundamentally unexposed positives highly correlated with true user preferences, and that topological community structure in the user-item network can reveal such correlations. By invoking the homophily principle, the authors propose the ComFNI module: Apply community detection algorithms (Leiden, Infomap) to partition the bipartite user-item graph and select candidate false negatives where a user and an item reside in the same community in both partitions. Empirical evaluation shows a false negative identification (FNI) ratio exceeding 80% in synthetic null experiments, confirming the utility of community structure.

Figure 2: ComFNI performance on Amazon-beauty and Epinions, quantifying FNI efficacy across different removal percentages.

Personalized noise filtration follows, employing Alternating Least Squares (ALS) embeddings to compute user-item similarity and dynamically threshold candidates, eliminating topologically similar but preference-divergent items. This refinement yields a topology-aware positive sample set for each user, effectively augmenting the positive supervisory signals used during training.

Figure 3: Illustrative workflow comparison—conventional negative sampling, drop-/reweight-based denoising leading to sparse positives, and TPSC-FO pipeline with topological community detection and neighborhood-guided denoising.

Neighborhood-Guided Feature Optimization

Expansion of the positive sample set increases the risk of embedding noise—originating from both user behavior artifacts and imperfect community detection. To address this, the FO module introduces neighborhood-guided feature optimization:

For each positive (user,item) interaction, select a random neighborhood of positives from the topology-aware sample set.
Aggregate their embeddings (mean pooling) to extract shared features, then apply mixup to combine this with the target item's embedding, thereby denoising while preserving expectation.
The variance analysis shows that this mechanism consistently reduces embedding noise without diluting user preference signals.
Figure 4: The impact of positive and negative sample similarity—model performance as a function of controlled candidate pool size.

Numerical Evaluation and Empirical Results

Extensive evaluations across five real-world datasets (Amazon-beauty, Amazon-home, Epinions, Gowalla, Tmall) and two synthetic datasets demonstrate the empirical superiority of TPSC-FO. For LightGCN, TPSC-FO consistently outperforms six negative sampling baselines (DNS, MixGCF, AHNS, etc.) and five denoising models (T-CE, R-CE, DeCA, DCF, PLD), with Recall@20 and NDCG@20 improvements ranging from 5% to upwards of 30% on dense data.

On synthetic datasets with artificially induced false negatives, the performance gap broadens further, supporting the claim that TPSC-FO identifies and leverages latent positives.
Training efficiency is high: community detection and positive set construction are pre-processing steps, and FO adds minimal overhead per epoch.
Figure 5: Recall@20 versus wall-clock time, indicating convergence efficiency.

Figure 6: Sensitivity analysis—the impact of quantile threshold $k$ and neighborhood size $n$ on performance.

Systematic parameter analysis reveals optimal configurations: quantile threshold $k=30$ and neighborhood size $n=10$ maximize the tradeoff between false negative recall and denoising. Robustness checks show that various community detection algorithms (Leiden, Infomap, Louvain, etc.) and MF-based embedding methods (ALS, SVD, NMF) consistently deliver gains; ALS is preferred on sparse, large-scale datasets.

Figure 7: The performance contributions of different community detection strategies in TPSC.

TPSC-FO generalizes well to classical matrix-factorization recommendation models, and substantially boosts performance when integrated with hard negative sampling variants (DNS+, MixGCF+, AHNS+), demonstrating broad applicability.

Figure 8: Integrating TFNS into three negative sampling methods leads to further improvements.

Case Study

A user-level case study on Amazon-beauty, with removal of 20% of interactions, validates that the community-based strategy identifies all flipped false negatives. Personalized noise filtration selects high-similarity items, and the introduction of FO further improves ranking accuracy for those items. When deployed, TPSC-FO elevates these items’ ranking in the recommendation list, verifying practical impact.

Figure 9: Community assignments and ranking improvement for user~3930 via personalized filtration and neighborhood optimization.

Implications and Future Directions

The TPSC-FO framework transforms negative sampling by explicitly modeling false negatives as latent positives using network topology, moving beyond model-dependent heuristics. Theoretical implications include rethinking supervision signal construction in implicit feedback systems: latent positives are best uncovered via structural community analysis, not via loss-based statistical features. Practically, this method is applicable to any recommender architecture and readily integrates with hard negative sampling and denoising pipelines, elevating both convergence and generalization in real-world deployments. The neighborhood mixup approach for feature optimization may generalize to other domains requiring robust embedding representations under noisy supervision.

Future research may explore:

Adaptive community granularity tuning and dynamic embedding thresholding for false negative filtration.
Extending topology-aware positive sample construction to temporal and multi-modal data.
Cross-domain community detection for collaborative filtering in heterogeneous interaction graphs.

Conclusion

This work presents a principled topology-aware solution to false negative identification and conversion in implicit collaborative filtering, complemented by neighborhood-guided denoising. Extensive experiments demonstrate that TPSC-FO robustly improves recommendation accuracy, leverages latent positives, and minimally increases computational overhead. The approach opens new avenues in negative sampling strategies, emphasizing network topology and embedding neighborhood signals to construct high-quality supervision.

Markdown Report Issue