Positive Sample Propagation (PSP)

Updated 27 March 2026

Positive Sample Propagation (PSP) is a method that selectively identifies and reinforces high-quality positive sample pairs to mitigate noise and sparsity in data.
It integrates local empirical observations with global reconstruction or similarity measures, enriching supervision for improved feature discrimination.
Empirical evidence shows that PSP enhances recommendation metrics in collaborative filtering and boosts accuracy in audio–visual event localization tasks.

Positive Sample Propagation (PSP) is a paradigm for selectively identifying, constructing, and propagating high-quality positive sample pairs within learning pipelines that rely on discriminative supervision. PSP has found distinct formulations and theoretical justifications in collaborative filtering, especially for implicit feedback datasets (Wu et al., 20 Feb 2026), and in cross-modal temporal event localization tasks, particularly in audio-visual reasoning (Zhou et al., 2022, Zhou et al., 2021). PSP systematically reinforces strong or likely-positive pairwise relationships to address the noise, sparsity, or ambiguity in raw data, yielding improved generalization or discriminative feature learning.

1. Foundational Concepts and Objectives

In implicit collaborative filtering (CF), PSP is defined as the process of constructing a refined positive sample set from observed user–item interactions, explicitly up-weighting pairs that exhibit both local (empirical) and global (reconstructed) support. The two central objectives are:

Denoising false positives by emphasizing interactions supported by both user history and low-rank (SVD-based) global structure.
Supervision enrichment by injecting additional likely-positive pairs, inferred from patterns beyond observed sparsity (Wu et al., 20 Feb 2026).

In cross-modal temporal localization, such as audio-visual event (AVE) detection, PSP refers to identifying highly correlated audio–visual pairs among all possible segment-wise combinations, pruning weak/noisy cross-modal links, and propagating feature information solely along high-similarity (“positive”) edges. This selective propagation ensures only semantically aligned modalities reinforce each other, leading to more robust event localization (Zhou et al., 2022, Zhou et al., 2021).

2. Methodologies: Graphs, Similarities, and Propagation

Implicit Collaborative Filtering

Given an interaction dataset $D = \{ (u, p⁺) \}$ , the methodology follows these steps (Wu et al., 20 Feb 2026):

Graph construction: Build user–item bipartite graph with adjacency $A(u, p) = 1$ iff $(u, p)$ observed.
Normalization: $\tilde{A}(u, p) = \frac{A(u,p)}{\sqrt{\text{rowD}(u)\cdot\text{colD}(p)}}$ , compensating for popularity.
Global reconstruction: Compute low-rank approximation $\tilde{A}_{\text{SVD}}$ via randomized SVD; for each user $u$ , select top- $K$ entries to build global graph $G_{\text{SVD}}$ .
Fusion and weighting: Merge the empirical and reconstructed graphs; define edge weight $\hat{W}(u,p)\in\{0,1,s\}$ depending on whether $(u, p)$ is present in local, global, or both ( $s>1$ for high-confidence pairs).
Replication-based sampling: For each $(u, p)$ in the fused graph, include the pair $s$ times if high-confidence or once otherwise.

Let $T$ denote the number of temporal segments. Each video segment is represented by visual and audio features ( $v_t, a_t$ ) encoded through Bi-LSTM. PSP operates by (Zhou et al., 2022, Zhou et al., 2021):

All-pair similarity: Compute $\beta^{va}_{ij}$ between all visual and audio segment pairs using projected dot products.
Pruning: Apply ReLU, $\ell_1$ normalization, and threshold $\tau$ to retain only strong connections, yielding sparse attention matrices $\gamma^{va}, \gamma^{av}$ .
Feature propagation: Each modality’s segment feature aggregates positive (strongly correlated) cross-modal features, e.g., $a^{\rm psp} = \gamma^{av} (v^{\text{lstm}} W_2^v) + a^{\text{lstm}}$ .

In both domains, only strong (“positive”) pairwise relationships, as determined by similarity or structural agreement, are propagated and reinforced in learning.

3. Algorithmic Realizations and Loss Formulations

Implicit CF: PSP-NS Plugin

Activity-aware user weighting: Assigns each user $u$ a weight $t_u = 1/\log(a\cdot |P_u^{\hat G}| + 1)$ , reducing the dominance of highly active users and up-weighting updates from inactive users.
Pairwise ranking objective: Incorporates both replication (for high-confidence samples) and user weights into the loss, e.g., for BPR:

$\ell_u(m) = -t_u\log\sigma(m), \quad m = e_u^T e_{p⁺} - e_u^T e_{p^-}$

This structure enables independent negative sampling across replicated positives, increasing margin diversity and coverage (Wu et al., 20 Feb 2026).

Pair similarity loss (fully supervised): For synchronized segments, define ${\cal L}_{\text{avpsp}} = \text{MSE}(S, G) = \frac{1}{T} \sum_{t=1}^{T} (S_t - G_t)^2$ , where $S_t$ is the normalized similarity between audio–visual PSP features at $t$ , and $G$ the normalized ground-truth vector.
Unsupervised/Weakly-supervised extension: A segment-level weighting branch selects temporally relevant features when only video-level labels are present (Zhou et al., 2021).

The propagation and loss structure are crafted so that only the strongest cross-modal correspondences contribute to feature enhancement and optimization.

4. Theoretical Rationale and Margin Improvement

PSP is mathematically analyzed from a margin-improvement perspective in implicit CF. The expected margin gain after each SGD update, given positive sample accuracy and coverage, is (Wu et al., 20 Feb 2026):

$\mathbb{E}[m_{\text{new}} - m_{\text{old}}] \geq \eta \,\mathbb{E}[(1-\sigma(m))\|\nabla_\theta m\|^2] - O(\eta^2)$

Accuracy and coverage of the positive set are directly increased by the PSP construction, translating to larger expected margin increases and improved top- $k$ ranking metrics. The inclusion of user weight $t_u$ in the loss further ensures proportionally stronger updates for underrepresented, inactive users.

In cross-modal AVE learning, a similar logic underpins the use of pairwise similarity losses and contrastive segment-level and video-level objectives: by propagating and reinforcing only the most semantically plausible associations, intra-class cohesion is tightened, and inter-class separation is enhanced, as evidenced by empirical improvements in AVE localization accuracy (Zhou et al., 2022, Zhou et al., 2021).

5. Integration in Broader Pipelines and Extensions

Plug-and-play compatibility: In implicit CF, PSP-NS is agnostic to the backbone (matrix factorization, GCNs) and can enhance any negative sampling method. For example, DNS-PSP and MixGCF-PSP combinations provide further empirical gains (Wu et al., 20 Feb 2026).
Contrastive augmentation: In audio–visual localization, PSP acts as the foundation for additional segment-level (PSA $_S$ ) and video-level (PSA $_V$ ) contrastive modules, enabling multi-scale discriminative feature learning. In the CPSP framework, losses combine cross-entropy, pairwise similarity, and two levels of contrastive objectives (Zhou et al., 2022).
Supervision modes: PSP formulations are adaptable to fully supervised, weakly supervised, and self-supervised settings, using auxiliary branches or losses as needed.

6. Empirical Evidence and Quantitative Impact

Domain/Task	Metric(s)	Main Result	Citation
Implicit CF (Yelp)	Recall@30, Precision@30	+32.11% R@30, +22.90% P@30 (vs strongest baseline)	(Wu et al., 20 Feb 2026)
Audio-Visual Event Loc.	AVE localization (full/weak)	77.8% (full), 73.5% (weak) state-of-the-art accuracy	(Zhou et al., 2021)
Audio-Visual Event Loc.	New dataset (VGGSound-AVEL100k)	Improved generalization and feature discriminability	(Zhou et al., 2022)

Extensive ablation studies confirm the necessity of PSP components: omitting propagation, using naïve attention, or removing margin objectives each empirically degrades performance. In CF, each step—SVD fusion, replication-based reweighting, activity weighting—contributes measurably to top- $k$ recommendation metrics.

7. Practical Considerations and Hyperparameters

Typical architectures: In audio–visual tasks, backbone extractors include VGG-19 (visual) and VGGish (audio), with Bi-LSTM encoding before PSP.
Thresholding: $\tau \approx 0.095$ to prune $\approx 80\%$ of weak cross-modal links in AVE tasks (Zhou et al., 2022, Zhou et al., 2021).
Batching and Optimization: Segments per batch, Adam optimizer, and loss weightings (e.g., $\lambda_1=100$ for pair similarity) are tuned in line with task-specific requirements.
Graph parameters in CF: SVD rank $q$ , replication factor $s$ , user-activity sensitivity $a$ are tunable, with empirical tuning necessary for optimal results (Wu et al., 20 Feb 2026).

In summary, Positive Sample Propagation provides a rigorous, empirically validated methodology for enhancing positive supervision in both collaborative filtering and cross-modal alignment. By selectively propagating and reinforcing high-confidence sample pairs, PSP improves representational discrimination and learning efficacy, yielding superior performance across diverse tasks and supervision regimes (Wu et al., 20 Feb 2026, Zhou et al., 2022, Zhou et al., 2021).

Markdown Report Issue Upgrade to Chat

References (3)

A Simple yet Effective Negative Sampling Plugin for Constructing Positive Sample Pairs in Implicit Collaborative Filtering (2026)

Contrastive Positive Sample Propagation along the Audio-Visual Event Line (2022)

Positive Sample Propagation along the Audio-Visual Event Line (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Positive Sample Propagation (PSP).