AutoRev: Graph-Based Peer Review
- The paper introduces AutoRev, an automatic peer review system that leverages a hierarchical and sequential graph to capture academic document structure.
- It employs a two-phase passage selection: dense passage retrieval followed by a graph attention network, ensuring precise and context-aware extraction.
- Empirical results on ICLR-2024 indicate a 58.72% improvement over state-of-the-art baselines, highlighting the efficacy of its graph-based methodology.
AutoRev is an automatic peer review system for academic research papers, introducing a graph-based passage extraction framework tightly integrated with selective context provision for downstream LLM review generation. Its core innovation is the explicit modeling of academic document structure as a hierarchical and sequential graph, which supports more effective content selection for review tasks and is broadly adaptable to other long-context NLP tasks. Empirical results on ICLR-2024 peer review datasets demonstrate that AutoRev offers a substantial performance margin over state-of-the-art baselines, attributed primarily to the graph-structured passage selection strategy rather than only LLM power. This system also illustrates the benefits and challenges of applying graph-based document representations for long-context NLP.
1. Graph-Based Document Representation
AutoRev formalizes academic papers as undirected, hierarchical graphs 𝒢 = (𝒱, ), with nodes (𝒱) corresponding to document elements at multiple granularities:
- root node for the entire paper,
- heading nodes for top-level sections,
- subheading nodes nested under headings (reflecting arbitrary section depth),
- passage nodes (splits of text blocks using newline boundaries),
- sentence nodes obtained through sentence tokenization (e.g., via NLTK).
Edges () are drawn from two distinct classes:
- hierarchical edges (ₕᵢₑᵣ) reflecting parent-child relations down the document tree (e.g., section-to-passage, passage-to-sentence),
- sequential edges (ₛₑq) linking consecutive elements at the same hierarchical level (e.g., adjacent passages, adjacent sentences).
This dual-edge construction encodes both the vertical structure (document hierarchy) and local horizontal flow (reading order). Such structure-aware graphs enable context-aware information propagation during passage selection and downstream review generation, in contrast to standard sequential or flat chunking approaches.
2. Passage Selection via DPR-GNN Cascade
The passage extraction stage leverages a two-phase method:
- First, a Dense Passage Retrieval (DPR) module computes deep semantic similarity between k-gram review pseudo-queries (formed from consecutive gold-standard review sentences) and all candidate paper passages, retrieving top-m likely review-relevant passages as reference targets.
- Second, these DPR hits provide a weakly-supervised training target for a Graph Neural Network (GNN), implemented as a Graph Attention Network (GAT) operating over the document graph 𝒢. The GAT leverages both hierarchical and sequential edge connections to propagate context and aggregate evidence for relevance estimation.
During inference, the trained GNN predicts the most salient document passages—those likely most critical to the review—purely from the document graph and without gold reviews. This extracted passage set serves as a condensed, content-focused input for the LLM-based review generator.
The GAT's node update step for node i is:
with
3. Downstream Review Generation with LLMs
The selected passage subset (typically 2,900–4,150 tokens, sharply reduced from an average full paper length of ~9,445 tokens) is then provided as input to a fine-tuned LLM, which is prompted to generate reviews using a structured template (summary, strengths, weaknesses, suggestions). By drastically condensing the context to critical content while retaining structural cues and logical dependencies (manifest in the graph), the system aligns the inference context more tightly with review needs and outperforms approaches that either naively truncate or arbitrarily select content.
Notably, the LLM is not prompted with externally written reviews; it operates solely on extracted passages. This design ensures scalability and avoids dependence on structured review data at inference time.
4. Empirical Performance and Benchmarking
Automatic evaluations on the ICLR-2024 review dataset use ROUGE-1/2/L (both Recall and F1) and BERTScore-F1 as core metrics, quantifying n-gram and embedding-space semantic overlap with reference reviews. Additionally, qualitative evaluation leverages an LLM-as-a-judge pipeline to assess confidence, thoroughness, constructiveness, and helpfulness.
Experimental results reveal that the best AutoRev configuration (e.g., Llama (5, 5)) achieves an average 58.72% improvement in these metrics compared to SOTA baselines SEA-E and SEA-EA. The graph-based passage selection accounts for this margin; without it, the performance of strong LLMs is considerably degraded due to context dilution.
5. Generalization and Applicability to Other NLP Tasks
The document graph construction and GNN-based passage retrieval framework introduced by AutoRev are broadly applicable to other long-context, structure-rich NLP tasks, including:
- Question answering, where targeted retrieval of answer-relevant content from large documents is needed;
- Summarization, especially for extractive or hybrid extractive-abstractive pipelines where structural cues are critical for coherence;
- Document representation, where graph-structured encodings can be leveraged by downstream classifiers or retrieval systems.
This modular design admits adaptation to diverse document topologies and tasks where preserving macro- and micro-structure is as critical as capturing semantic content.
6. Technical and Practical Limitations
AutoRev’s graph construction assumes documents possess recognizable hierarchical structure (sections, passages) and that these can be parsed reliably. Documents with significant formatting irregularity, unorthodox sectionations, or poorly defined passage boundaries may require additional preprocessing and customization of the graph induction pipeline.
While the reduction in input length is beneficial for computational tractability and memory constraints, there is an attendant risk of omission: critical content (such as nuanced limitations or non-obvious caveats) may be missed if the passage extractor under-attends to subtle, review-relevant details not easily detected by semantic similarity or graph attention. Furthermore, inference speed and compute cost may increase for extremely large or complex document graphs, particularly in real-time applications.
7. Future Research Directions
Anticipated future work includes extension of AutoRev's framework to non-ML academic domains for review, refinement of graph construction (potentially incorporating advanced hierarchical discourse models or hypergraphs to capture finer document nuances), and richer hybrid passage selection strategies. Incorporation of additional document modalities (e.g., figures, tables, pseudo-code) is proposed to support holistic review and question answering. Advancements in self-supervised learning for graph-based retrieval—and its application to diverse large-document NLP scenarios—are also indicated as promising research avenues.
Summary Table: AutoRev System Components
Component | Method/Model | Purpose |
---|---|---|
Document representation | Hierarchical & sequential graph | Encodes structure and local order |
Passage retrieval | DPR (retrieval) + GNN (GAT) | Extracts salient review-relevant passages |
Review generation | Fine-tuned LLM | Generates structured reviews from key passages |
Evaluation | ROUGE, BERTScore, LLM-as-judge | Quantifies lexical/semantic/human quality |
AutoRev's design—rooted in graph-based representation and attention-driven extraction—demonstrates a robust, scalable approach to review generation under input length constraints, and establishes a foundation for broader adaptation in document-centric NLP tasks (Chitale et al., 20 May 2025).