FRONTIER-RevRec: Reviewer Recommendation Benchmark
- FRONTIER-RevRec is a large-scale, curated dataset containing over 478K papers and 177K reviewers across five domains, enabling comprehensive evaluation of reviewer recommendation algorithms.
 - It employs a multi-stage cleaning and filtering methodology to ensure robust metadata profiling and mitigate cold-start effects in academic peer review.
 - Experiments indicate that PLM-based semantic methods significantly outperform collaborative approaches, underscoring the importance of deep textual analysis in reviewer matching.
 
FRONTIER-RevRec is a large-scale, rigorously curated dataset designed to benchmark and advance reviewer recommendation methodologies in academic peer review. Comprising extensive records from the Frontiers open-access publishing platform between 2007 and 2025, it aims to enable comprehensive, comparative evaluation of algorithms across disciplines while revealing distinct structural and methodological challenges unique to academic recommendation tasks.
1. Dataset Construction and Content Scope
FRONTIER-RevRec was constructed from authentic peer-review records sourced directly from Frontiers, with a meticulous multi-stage cleaning procedure. Domain filtering retained papers within five primary disciplinary domains: Engineering, Health, Humanities and Social Sciences, Science, and Sustainability. Metadata completeness assessments ensured robust profiling capability for both papers and reviewers. To mitigate cold-start effects, reviewer-paper interaction filtering excluded interactions with fewer than two assignments on a case-by-case basis, preserving dataset integrity while maintaining breadth.
The final dataset encompasses 478,379 papers and 177,941 distinct reviewers, sourced from 209 journals and 1,736 specialized sections across multiple disciplines, including clinical medicine, biology/biochemistry, psychology, engineering, and social sciences. Each paper record contains granular metadata: titles, abstracts, DOIs, authors’ names/identifiers, journal/section information, publication dates, and associated reviewer names/identifiers. This allows the derivation of granular reviewer expertise profiles from longitudinal review assignments.
| Metric | Value | Coverage | 
|---|---|---|
| Papers | 478,379 | 209 journals, 5 domains | 
| Distinct reviewers | 177,941 | 1,736 sections | 
| Timespan | 2007–2025 | Multidisciplinary | 
A plausible implication is that this scale and diversity directly address prior limitations in domain coverage and sample size that have hampered reproducible benchmarking in reviewer recommendation research.
2. Evaluation Methodologies and Comparative Performance
Experiments on FRONTIER-RevRec systematically compared three methodological families:
- Collaborative signal–based methods: e.g., LightGCN, GF-CF, leveraging reviewer-paper interaction graphs.
 - Review-based methods: variations of NARRE and DeepCoNN, treating paper titles as input text.
 - Pure content-based methods: range from TF-IDF to advanced PLMs such as BERT and LLaMA2.
 
Results reveal that content-based methods, particularly those utilizing PLMs, substantially outperform collaborative filtering, achieving the highest scores in Precision@k, Recall@k, and NDCG@k:
where is the set of relevant reviewers for paper , and the top- recommended reviewers. NDCG@k is calculated by
normalized by the ideal DCG. LLaMA2 models yield benchmark-leading results, while LightGCN and GF-CF are least effective.
This suggests that deep semantic extraction from textual content is critical for reviewer matching in academic contexts.
3. Network Structure and Its Impact on Algorithmic Performance
A notable distinguishing characteristic emerges from network topology analysis. Unlike commercial recommendation datasets (Amazon, Yelp) which exhibit a single large connected component and short diameters (7–8 hops), the academic review graph is highly fragmented (6,137 connected components), with its largest component spanning 48 hops. Only ~5% reachability is observed within 30 hops for both positive and negative reviewer-paper pairs; in contrast, collaborative signals in commercial domains are highly discriminative for similar pairs.
| Network Property | Academic (FRONTIER-RevRec) | Commercial (Amazon/Yelp) | 
|---|---|---|
| Connected components | 6,137 | 1 | 
| Diameter (largest comp.) | 48 | 7–8 | 
| Reachable pairs <30 hops | ~5% | Highly concentrated | 
The absence of strong collaborative linkage fundamentally limits the efficacy of graph-based CF in academic reviewer recommendation. A plausible implication is that domain specialization and sparse reviewer-paper assignment patterns diminish collaborative signal utility, necessitating semantic approaches.
4. LLM–Based Semantic Alignment
The dataset motivates semantic matching via PLMs. LLaMA2 and BERT encode paper titles and abstracts into dense vectors, capturing intricate topical semantics. These embeddings effectively represent discipline-specific relationships: papers within a discipline present higher cosine similarity than those across disciplines. Table 3 presents quantified intra-/inter-category similarities.
Fuzzy c-means clustering analyses illustrate that PLM-derived representations yield superior cluster quality (NFMI and FPC metrics) compared to graph-based features (see Figure 1). This suggests that well-trained PLMs can form a more coherent representation space for accurate reviewer-paper alignment, improving cluster cohesion and facilitating robust retrieval.
The significance of these findings is twofold: they empirically validate that semantic signals supersede collaborative signals in reviewer-paper matching, and they suggest the utility of PLMs to model expertise-context alignment in sparse, multidisciplinary graphs.
5. Aggregation Strategies in Reviewer Recommendation Pipelines
Two critical aggregation levels are identified:
- Word-to-paper aggregation: Methods evaluated include CNN, LSTM, GRU, average pooling, and self-attention for fusing word embeddings into paper-level vectors. For short, information-dense paper titles, average pooling (and self-attention) provided optimal performance, while sequential models like LSTM and GRU yielded no additional benefit.
 - Paper-to-reviewer aggregation: Reviewer profiles synthesized from their reviewed paper embeddings used averaging, GRU, LSTM, and attention mechanisms. Sequential models (especially LSTM) excelled by emphasizing recency and learning temporal dynamics of evolving expertise.
 
A plausible implication is that selecting aggregation strategies attuned to the structure of input data (e.g., information density, chronological relevance) can materially impact recommendation quality.
6. Benchmarking, Significance, and Prospects
FRONTIER-RevRec constitutes a methodological benchmark for reviewer recommendation, eliminating prior bottlenecks in domain coverage and sample size. It supports reproducible comparison of algorithms and pipeline designs. The dataset’s structural analysis guides researchers toward methods more suitable for academic peer review—highlighting that LLM–based and semantically attentive methods are preferable over collaborative signal-heavy paradigms.
This supports developments in systems that more precisely align manuscript content with reviewer expertise, potentially streamlining peer review assignments and increasing process efficiency.
7. Access and Research Facilitation
The dataset and related resources are publicly available at:
This repository provides the full dataset and scripts for experimental benchmarking, facilitating further research and system development in reviewer recommendation.
In summary, FRONTIER-RevRec delivers authentic, diverse, and structurally elucidative data for reviewer recommendation, elucidates intrinsic differences from commercial domains, establishes semantic content matching as crucial, and identifies optimal aggregation strategies for pipeline development (Peng et al., 18 Oct 2025).