This paper presents a novel method called CARE (Common Author relation-based REcommendation) for recommending scientific articles to researchers. The core idea is to leverage the observation that researchers often look for papers written by the same authors they have previously found relevant ("author-based search pattern"). However, the authors recognize that not all researchers exhibit this pattern. Therefore, CARE is designed to first identify suitable target researchers and then apply a specialized recommendation algorithm for them (Xia et al., 2020 ).
Problem Addressed:
- Existing article recommenders often use generic algorithms for all users, ignoring individual search behaviors like focusing on specific authors.
- Content-based methods can be complex and computationally expensive due to the large volume of text in articles.
- Standard collaborative filtering often ignores valuable information like authorship links between papers.
CARE Methodology:
The CARE method consists of two main components:
- Target Researcher Selection:
- This module identifies researchers who are likely to have an "author-based search pattern" based on their historical reading preferences (articles saved in their library).
Two features are defined to quantify this pattern:
-
FE1
: The ratio of article pairs within a researcher's library that share common authors to the total number of possible article pairs. A higher ratio suggests the researcher collects papers linked by authorship. (where N is the number of articles in the library) FE2
: The ratio of articles written by the single most frequently occurring author in the researcher's library to the total number of articles in the library. A higher ratio indicates a focus on a specific author's work.
-
* Researchers whose FE1
or FE2
values exceed predefined thresholds are considered suitable targets for the CARE ranking algorithm.
- Graph-based Article Ranking:
- For the selected target researchers, a heterogeneous graph is constructed.
- : Set of researcher nodes.
- : Set of article nodes.
- : Edges representing reading history (researcher read article ).
- : Edges representing common author relations (article and article share at least one author).
- A Random Walk with Restart (RWR) algorithm is applied to this graph.
- The walk starts at the target researcher node ().
- At each step, the walker moves to a neighboring node with probability (based on calculated transition probabilities ) or restarts at with probability .
- Transition probabilities (, , ) are calculated based on the adjacency matrices representing reading relations () and common author relations (). For instance, the probability of moving from article to article () is:
This normalizes the probability based on the total number of connections (to researchers or other articles) from article .
- The algorithm iteratively updates the scores of all nodes until convergence. The final scores of the article nodes () represent their relevance to the target researcher.
- Top-N ranked articles not already in the researcher's library are recommended.
- For the selected target researchers, a heterogeneous graph is constructed.
Implementation Considerations:
- Data Requirements: Requires researcher reading history (e.g., from CiteULike libraries) and author information for each article. The authors crawled CiteULike to obtain author data missing from the original dataset version.
- Graph Construction: Building the adjacency matrices and is the first step. requires pairwise comparison of author lists for all articles, which can be computationally intensive for large datasets. Defining "common authors" (e.g., requiring at least two shared authors, as done in the paper) can mitigate noise from common names.
- RWR Parameters: The restart probability and the number of iterations (
maxStep
) need tuning. The paper found worked well. - Scalability: RWR on large graphs can be computationally demanding. Techniques like graph partitioning or approximation methods might be needed for very large datasets.
- Feature Thresholds: The thresholds for
FE1
andFE2
need to be determined, potentially via cross-validation on a hold-out set, to balance the trade-off between the number of targeted researchers and the performance gain.
Pseudocode for RWR (Algorithm 1 in paper):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
Algorithm Graph-based article ranking Input: Graph G, restart probability α, target researcher v0, max iterations maxStep, Transition matrix T Output: Ranking scores for articles ScoreArticle[1..m] Initialize ScoreAll[1..n+m] = 0 ScoreAll[v0] = 1 for step = 0 to maxStep-1: Initialize tmpScore[1..n+m] = 0 for each node vx in G: for each neighbor vy of vx: tmpScore[vy] = tmpScore[vy] + α * ScoreAll[vx] * T(vx, vy) # Add restart probability tmpScore[v0] = tmpScore[v0] + (1 - α) ScoreAll = tmpScore ScoreArticle = ScoreAll[n+1 .. n+m] // Extract scores for article nodes Return ScoreArticle |
Evaluation and Results:
- The experiments were conducted on a CiteULike dataset.
- Key Finding: CARE significantly outperformed the Baseline (RWR without common author relations or researcher selection) only when applied to the researchers selected using
FE1
andFE2
. When applied to all researchers, its performance was similar or slightly worse than the Baseline. - This validates the paper's two main hypotheses: (1) incorporating common author relations helps for specific researchers, and (2) the features
FE1
andFE2
effectively identify these researchers. - Increasing the thresholds for
FE1
andFE2
generally led to higher precision, recall, and F1 scores for CARE on the selected subset, further confirming the features' relevance. - Two alternative features (
FE3
: absolute number of common author pairs;FE4
: ratio of authors common to all articles) were tested and found ineffective.
Practical Implications:
This research provides a practical approach for enhancing scientific article recommendations by tailoring the algorithm to user behavior. Instead of a single complex model, it proposes a two-stage process: identify users who follow authors, then apply a graph-based method incorporating authorship links for those users. This hybrid strategy can lead to more relevant recommendations for a specific user segment without negatively impacting others, potentially improving user satisfaction on academic platforms.