PeerArg: Transparent Peer Review Pipeline

Updated 20 May 2026

PeerArg is a pipeline that combines large language models and symbolic bipolar argumentation to transform peer reviews into interpretable acceptance verdicts.
It employs a three-stage architecture that extracts quantitative bipolar argumentation frameworks from reviews, aggregates them via argumentation and decision-vector paths, and applies symbolic processing for traceability.
Evaluated on ICLR, ACL, and medical datasets, PeerArg consistently outperforms end-to-end LLM baselines in acceptance prediction accuracy while providing enhanced transparency.

PeerArg is a pipeline for supporting and elucidating the peer review and decision-making process in scientific publication, integrating LLMs with symbolic knowledge representation in the form of quantitative bipolar argumentation frameworks. It processes sets of peer reviews to predict paper acceptance, with all reasoning steps intermediate and inspectable, enabling greater interpretability than end-to-end black-box LLM approaches. PeerArg is evaluated on datasets covering ICLR conference, ACL reviews, and medical journal reviews, and its variants surpass a strong end-to-end LLM baseline in acceptance prediction accuracy (Sukpanichnant et al., 2024).

1. Pipeline Architecture

PeerArg consists of a three-stage architecture, combining neural and symbolic methods to convert raw reviews to an acceptance/rejection verdict.

Stage 1 – Review QBAF Extraction

Each review is converted into a quantitative bipolar argumentation framework (QBAF) $Q_i = \langle X_i, Att_i, Supp_i, \beta_i \rangle$ , whose arguments comprise:

Text arguments $T_i$ : one per sentence.
Aspect arguments $A = \{\mathit{APR, CLA, NOV, EMP, CMP, SUB, IMP}\}$ (seven review aspects).
Decision argument: representing the final accept/reject outcome.

Processing involves:

Aspect classification: Using a few-shot LLM, each sentence is tagged with one or more aspects.
Sentiment analysis: Each sentence is assigned a sentiment (positive/neutral/negative) and strength in $[0,1]$ via the CardiffNLP RoBERTa Twitter sentiment model.
Base-score setting: Arguments are assigned an initial score $\beta$ (default 0.5 or the sentence's sentiment strength); aspects and decision start with $\beta(a)=0.5$ .
QBAF semantics computation: DF-QuAD or MLP-based semantics propagate argument strengths ( $\sigma(a) \in [0,1]$ ) throughout the graph; aspect $\to$ Decision edges become attacks or supports depending on $\sigma(\mathrm{aspect})$ .
Review “vote”: The computed $\sigma(\mathrm{Decision})$ gives a quantitative vote per review.

Stage 2 – Combination of Review QBAFs

Each review-QBAF is trimmed to retain only aspect arguments and Decision. These are merged into a pre-Multi-Party Argumentation Framework (pre-MPAF) $T_i$ 0:

$T_i$ 1
$T_i$ 2 is the undirected union of all aspect $T_i$ 3Decision edges across reviews
$T_i$ 4 collects the $T_i$ 5 review-specific scores for each argument.

Stage 3 – Pre-MPAF Aggregation

Two alternative routes yield the final verdict:

Argumentation path: Aggregate each aspect’s strength by mean. Build combined MPAF ( $T_i$ 6) using thresholds on the mean. Recompute strengths via DF-QuAD or MLP semantics. Decide “accept” if $T_i$ 7, else “reject.”
Decision-vector path: Discretize each review’s Decision strength to a categorical vote (binary or five-level). Aggregate via majority or all-accept rules for a final outcome.

2. Argumentation-Based Knowledge Representation

PeerArg deploys argumentation theory at its core, modeling review reasoning structures explicitly as graphs.

Bipolar Argumentation Frameworks (BAF):

$T_i$ 8

$T_i$ 9 is a finite set of arguments, with attack ( $A = \{\mathit{APR, CLA, NOV, EMP, CMP, SUB, IMP}\}$ 0) and support ( $A = \{\mathit{APR, CLA, NOV, EMP, CMP, SUB, IMP}\}$ 1) relations ( $A = \{\mathit{APR, CLA, NOV, EMP, CMP, SUB, IMP}\}$ 2).

Quantitative BAFs (QBAF):

$A = \{\mathit{APR, CLA, NOV, EMP, CMP, SUB, IMP}\}$ 3

where $A = \{\mathit{APR, CLA, NOV, EMP, CMP, SUB, IMP}\}$ 4 assigns a base-score to each argument.

DF-QuAD Semantics:

Propagate strengths recursively:

$A = \{\mathit{APR, CLA, NOV, EMP, CMP, SUB, IMP}\}$ 5

with $A = \{\mathit{APR, CLA, NOV, EMP, CMP, SUB, IMP}\}$ 6 aggregating supporters/attackers and non-linear combination via $A = \{\mathit{APR, CLA, NOV, EMP, CMP, SUB, IMP}\}$ 7.

MLP-Based Semantics:

Iterative updates:

$A = \{\mathit{APR, CLA, NOV, EMP, CMP, SUB, IMP}\}$ 8

$A = \{\mathit{APR, CLA, NOV, EMP, CMP, SUB, IMP}\}$ 9

until fixed point $[0,1]$ 0.

Review QBAFs are constructed for each review, connecting text sentences to aspects and aspects to the decision. After computation, sentence nodes are discarded to derive trimmed QBAFs, which constitute the input to the pre-MPAF aggregation stage.

3. LLM Integration and Symbolic Processing

PeerArg integrates LLMs in two capacities:

Few-shot LLM for Aspect Classification: 4-bit quantized Mistral-7B-v0.1, primed with aspect definitions and labeled examples, tags sentences by aspect.
Sentiment Analysis: CardiffNLP RoBERTa (Twitter sentiment), providing discrete sentiment class and confidence.

Additionally, a baseline end-to-end LLM approach predicts paper acceptance solely from raw reviews using few-shot Mistral-7B, employing a templated prompt with labeled and unlabeled review sets. All other pipeline components—argument graph construction, semantics application, and aggregation—are purely symbolic, ensuring transparent processing.

4. Acceptance Prediction Algorithms

PeerArg supports two families of aggregation mechanisms to combine multi-review QBAF data into a single accept/reject decision:

Argumentation-based (MPAF)

Compute aspect-wise mean strength $[0,1]$ 1 from $[0,1]$ 2.
Edges $[0,1]$ 3 are support or attack depending on $[0,1]$ 4.
Base scores set as $[0,1]$ 5 for aspects; $[0,1]$ 6 for decision.
Apply argumentation semantics (DF-QuAD/MLP) for $[0,1]$ 7, thresholded at $[0,1]$ 8 for final verdict.

Decision-vector Aggregation

Each review’s $[0,1]$ $[0, 1]$ 9 is discretized:
- Binary: $\beta$ 0 reject, $\beta$ 1 accept.
- Five-level: $\beta$ 2 (strong reject), ..., $\beta$ 3 (strong accept).
Aggregation functions:
- Majority or all-accept in binary and five-level modes, mapping votes to numerical weights and summing.

These two algorithms address the inherent subjectivity and potential bias of peer review by exposing intermediate representations and giving users control over aggregation logic.

5. Datasets and Quantitative Results

PeerArg was evaluated on three peer review datasets:

PRA (Peer-Review-Analyze): ICLR 2018 reviews, sentence-level aspect and sentiment labels.
PeerRead (ACL 2017 subset): Reviews with accept/reject decisions and reviewer aspect scores.
MOPRD (medical): Medical open journal reviews, including editorial decisions.

Baselines:

A strong end-to-end LLM (few-shot Mistral 7B) serves as the comparative baseline.

Experimental settings (selected):

Text argument base-scores: sentiment strength vs. default 0.5.
QBAF semantics: DF-QuAD vs. MLP.
Decision strength: binary vs. five-level.
Aggregation: argumentation (DF-QuAD/MLP on MPAF) vs. voting (majority/all-accept).

Key accuracy results:

Dataset	PeerArg Best Variant	Macro-F1 (%)	End-to-end LLM (%)
PRA	Sentiment+MLP+binary+majority	76.64	72.5
PeerRead (default aspect)	Sentiment+DF-QuAD+argumentation	66.29	58.0
PeerRead (reviewer aspect)	Default+DF-QuAD+5-level+all-accept	76.56	—
PRA (overall best)	Sentiment+DF-QuAD+5-level+majority	76.64	72.5
PeerRead (overall best)	Sentiment+DF-QuAD+5-level+majority	61.87	58.0
MOPRD (overall best)	Sentiment+DF-QuAD+5-level+majority	68.32	59.5

This suggests PeerArg variants consistently outperform the end-to-end LLM baseline on all datasets.

6. Interpretability and Trust Characteristics

PeerArg delivers high interpretability by explicitly modeling reviewer reasoning:

Intermediate artifacts: Each review is mapped to a QBAF, making explicit which sentences support or attack each aspect.
Aspect-level influence: The combined framework clarifies how each aspect impacts the final decision with transparent numerical strengths.
End-to-end traceability: Every step, from raw text to aspect assignment, sentiment detection, base-score setting, argument strength computation, and aggregation, is explicit and accessible.
Customizability: Thresholds and aggregation rules may be tuned or substituted, enabling control over the method’s transparency and outcome logic.

By contrast, the end-to-end LLM emits a black-box "accept"/"reject" output without rationale. PeerArg's symbolic reasoning artifacts enhance transparency and trustworthiness for users seeking interpretable, auditable peer review decision support (Sukpanichnant et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

PeerArg: Argumentative Peer Review with LLMs (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PeerArg.