ArgTersely: Benchmark for Counter-Argument Generation

Updated 26 March 2026

ArgTersely is a large-scale, human-annotated benchmark for sentence-level counter-argument generation, focusing on concise rebuttals.
It integrates specialized model architectures, including Arg-LlaMA and baselines like GPT-3, with rigorous evaluation metrics such as BLEU, ROUGE, and Arg-Judge scores.
The framework emphasizes targeted instruct-tuning, Chain-of-Thought prompting, and detailed error analysis to address logical fallacies and stance errors.

ArgTersely is a large-scale, human-annotated benchmark and framework for sentence-level counter-argument generation, oriented toward producing concise, direct responses that rebut or weaken argumentative statements. It provides a comprehensive dataset, specialized model architectures, and learned evaluation metrics, all aimed at advancing research in sentence-level, brevity-focused computational argumentation. The benchmark is rooted in data from the ChangeMyView (CMV) debate forum and addresses unique challenges in data annotation, generation strategies, and evaluation of diverse argumentative rebuttals (Lin et al., 2023).

1. Dataset Construction and Annotation Protocol

All data in ArgTersely originate from the ChangeMyView subreddit, where debate takes place via original posts paired with replies that quote specific sentences. The raw extraction yields 20,626〈topic, quoted sentence, reply〉triplets. Annotation produces argument–counter-argument (A–CA) pairs by segmenting replies into sentences, filtering for completeness and ethical safety, and correcting grammar. Annotation is performed by at least two annotators with debate experience and bachelor’s degrees, with ∼30% disagreement resolved by a third annotator. The process required 24 annotators over 42 days. Annotators are instructed to validate only those sentences that (i) directly rebut or weaken the referenced argument, (ii) express a discernible viewpoint, and (iii) are self-contained and ethically safe. Sentences that are off-topic, incomplete, purely polite (“safe replies”), or ethically dubious are omitted.

Final dataset statistics are as follows:

Data Split	# A–CA Pairs	Avg. Words/Arg	Avg. Words/CA
Train	28,197	21.74	25.09
Valid	1,000	21.57	27.44
Test	2,000	19.96	34.92
Total	31,197	—	—

Although argument “types” are not explicitly annotated, later prompts introduce error classes: Factual Error, Logical Fallacy, and Confirmation Bias.

2. Baseline and Comparative Models

ArgTersely standardizes evaluation by including several baselines:

BART: Pre-trained encoder–decoder, fine-tuned on A–CA pairs.
GPT-2: Decoder-only LM, fine-tuned similarly.
DialoGPT: Conversational GPT-2 variant, fine-tuned on the benchmark data.
LlaMA-7B: Decoder-only LM, fine-tuned zero-shot for counter-argumentation.
Alpaca-LoRA: LlaMA-7B augmented via LoRA on the Alpaca instruction set.
GPT-3 (davinci): Few-shot prompted, not fine-tuned.

Each model is trained on 28,197 pairs and validated/tested on 1,000 and 2,000 pairs, respectively. Fine-tuning employs maximum likelihood with standard hyperparameters (learning rate ≈1e-4–3e-4, batch size ≈32–64).

3. Arg-LlaMA Architecture and Pipeline

The Arg-LlaMA model builds on the LlaMA-7B backbone, employing targeted instruct-tuning with Low-Rank Adapters (LoRA) applied to all major projection matrices in self-attention layers (rank r=16, α=16). Arg-LlaMA’s instruction set aggregates 2,772 argumentation-focused instructions, generated from hand-crafted seeds and expanded via ChatGPT augmentation, and merges these with the Alpaca set. Generation incorporates Chain-of-Thought (CoT) templates targeting error classes (Factual Error, Logical Fallacy, Confirmation Bias) and post-generation filtering via a BERT-based classifier.

Training objectives are as follows:

Autoregressive LM loss:

$\mathcal{L}_{\text{LM}} = -\sum_{t} \log p_\phi(y_t\mid y_{<t},\tau, x)$

Filter (BERT-based 4-way classifier) loss:

$\mathcal{L}_{\text{filter}} = -\log P_\theta(s_i\mid x, y_i)$

LM tuning uses learning rate 3×10⁻⁴, batch=256, gradient accumulation=16, for 5 epochs (4×RTX3090, AdamW). The BERT filter is trained with learning rate 1×10⁻⁵, batch=64, for 2 epochs.

4. Arg-Judge Evaluator and Evaluation Protocols

Arg-Judge is a BERT-base classifier aligned with the filtering model in Arg-LlaMA. Given “[CLS] original argument [SEP] candidate counter-argument [SEP]” as input, it predicts one of four ranking classes and also outputs a continuous rescaled score $\hat s\in[0,4]$ . Training uses Ranking Data (RD) from annotation logs: 20,000 samples, each with a ground-truth rebuttal, an alternative from the same thread, a “safe reply,” and an off-topic response. Loss matches the filter cross-entropy.

Evaluation combines several metrics:

BLEU-1, ROUGE-L, METEOR: Standard n-gram overlap.
ChatGPT Eval: Composites stance-opposition and completeness sub-scores:

$S_{\mathrm{gpt}} = \tfrac{1}{2}S_{\mathrm{st}} + \tfrac{1}{2}S_{\mathrm{com}}$

Arg-Judge Score:

$S_{aj} = \max\Bigl\{\tfrac{1}{9}-\tfrac{\hat s}{36},\;1-\tfrac{9}{4}\,\hat s\Bigr\} \in [0,1]$

Human Evaluation: Likert (1–5) on Grammaticality, Appropriateness, Content Richness, Logic, Persuasiveness, and Top-1 preference rate.

5. Empirical Results and Analysis

Arg-LlaMA achieves state-of-the-art performance on all automatic and human metrics:

Model	BLEU-1	ROUGE-L	METEOR	ChatGPT Eval	Arg-Judge
Arg-LlaMA	18.60	22.41	19.29	50.48	55.78
Alpaca-LoRA	15.56	18.67	—	—	47.51
GPT-3	—	—	—	—	42.33

Human evaluation rates Arg-LlaMA highest across all five dimensions (mean ≈4.1–4.3), with Top-1 preference at 62% (versus 27% for Alpaca-LoRA, 10% for LlaMA). Wilcoxon tests confirm statistical significance ( $p<0.05$ ).

Case analysis underscores Arg-LlaMA’s capacity for identifying implicit logical fallacies, e.g., rebutting by exposing invalid premises, in contrast to baseline models’ tendency to produce non-oppositional or “safe reply” content.

6. Error Typology and Ablation Studies

Non-instruct-tuned LMs commonly fail by extending the argument’s stance (stance error) or yielding substance-free “safe replies.” Even with CoT prompting, models may produce merely partial rebuttals if critical premises are overlooked.

Ablation reveals the contributions of system components:

Component Changed	Arg-Judge Score
Full Arg-LlaMA Pipeline	55.78
Remove argumentation-focused instructions	51.30
Remove all instruct-tuning (LlaMA only)	43.77
Replace CoT with generic prompt	48.13
Use only a single error template	51.90
Remove BERT-based filter	35.47

These results demonstrate the importance of argumentation-specific instructions, CoT prompting, multi-error coverage, and model-based filtering for both robustness and alignment with human preferences.

7. Extensions, Limitations, and Significance

Proposed future extensions include diversifying Arg-Judge’s human-preference data and scaling evaluators to larger LLM backbones, as well as exploring richer argument taxonomies (e.g., causal, probabilistic structures). Current constraints include validation of Arg-Judge only on CMV debates and BERT-scale evaluators; reliance on CMV-specific community norms potentially induces subtle annotation biases.

ArgTersely establishes the first comprehensive, large-scale human-annotated benchmark for sentence-level counter-argument generation. Its data construction, instruct-tuned arg-LlaMA model, multi-type error CoT prompting, and learned BERT-based evaluation together offer a rigorous empirical and methodological foundation for research on succinct, direct counter-argumentation at the sentence level (Lin et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Argue with Me Tersely: Towards Sentence-Level Counter-Argument Generation (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ArgTersely.