ContArgA Corpus Dataset
- ContArgA Corpus is a large-scale dataset featuring debates from a decade-long period on debate.org with comprehensive user profiles.
- The dataset provides detailed annotations, including vote-based winner labels, linguistic features, and extensive metadata for rigorous argument analysis.
- It facilitates interdisciplinary research by integrating computational linguistics, social psychology, and social science methodologies with robust predictive models.
The ContArgA Corpus is a large-scale, richly annotated dataset designed to support research on user and language effects in online argumentation. Collected from debate.org and spanning a ten-year period, ContArgA provides extensive metadata on both debate content and participant profiles, facilitating analysis at the intersection of computational linguistics, social psychology, and computational social science (Durmus et al., 2019).
1. Corpus Composition and Scale
ContArgA encompasses debates conducted on debate.org between October 2007 and November 2017, offering unprecedented scope in argumentation data with a specific emphasis on user characteristics.
- Debates: 78,376 two-sided (pro/con) debates
- Audience Comments: 606,102 comments
- Votes: 199,210 fine-grained vote annotations (per debater, per criterion)
- Unique Users: 45,348 (including both debaters and voters)
Debate Structure and Participation
- Debate Length Distribution:
- Rounds: ,
- 65% of debates have 3–5 rounds; 20% exceed 6 rounds; 15% are 2-round mini debates
- Sentences per debate: ,
- Comments per debate: ,
- Votes per debate: ,
- User Activity Distribution:
- 42% of users appear in only one debate; 28% in 2–5 debates; 20% in 6–20 debates; the top 10% account for 59% of participations
- Mean debates per user: ,
- Votes per user: 55% cast a single vote; 30% cast 2–10 votes; remaining users are highly active
2. User Profile and Metadata Schema
ContArgA aligns each of its 45,348 users to a comprehensive, flat-profile schema, directly reflected in the released CSV/JSON data.
Demographic Attributes
- user_id: Unique identifier (string)
- age: Self-reported age bracket (integer)
- gender: {Male, Female, Other, Unspecified}
- education_level: {High_School, Undergraduate, Graduate, …}
- ethnicity: {White, Black, Asian, Hispanic, Other, Unspecified}
- income_bracket: {<25K, 25–50K, 50–75K, …}
- occupation: Free-text field
Ideological and Belief Attributes
- political_ideology: {Left, Center-Left, Center, Center-Right, Right}
- religious_ideology: {Atheist, Agnostic, Christian, Muslim, Other}
- stance_[TOPIC]: Values in {Pro, Con, Undecided, Unspecified} for each of 48 major issues
Platform-Activity and Social-Network Attributes
- debates_participated, debates_won, debates_lost
- comments_posted, votes_cast, opinion_questions_asked, poll_votes_cast
- in_degree, out_degree: Graph degree in voter/commenter–debater bipartite network
- hub_score, authority_score: HITS scores computed from the platform’s user interaction graphs
3. Debate Content, Annotation, and Feature Extraction
Vote-Based Winner Annotations
Each debate is annotated via community voting on four dimensions:
- Convincingness (3 points to winner)
- Reliability of sources (2 points)
- Conduct during debate (1 point)
- Spelling and grammar (1 point)
The winner is determined by aggregate points. Ties or debates with fewer than 5 voters are filtered from prediction analyses.
Linguistic and Structural Features
For each side-turn (all text from pro or con), extracted features include:
- Token counts, type/token ratio
- Sentiment (positive/negative lexicons), subjectivity scores
- Argumentation lexicon matches (“therefore”, “hence”)
- Modality features (modal verbs, hedges)
- Evidence markers (numbers, citations, “according to”, URLs)
- Discourse marker counts (“but”, “however”, “moreover”)
- Politeness strategies [Danescu-Niculescu-Mizil et al. 2013]
- Swear-word counts, personal-pronoun counts
- tf–idf feature vectors (unigrams/bigrams)
- Conversational flow features [Zhang et al. 2016], e.g., overlap in key terms across rounds
ContArgA does not include classical argument mining annotations such as premise-claim structure.
4. Statistical Distribution and Winner Prediction Models
Summary statistics (over 1,635 debates used for winner prediction experiments) are as follows:
| Feature | Mean (μ) | Variance (σ²) |
|---|---|---|
| debater_experience | 4.85 | 27.3 |
| debater_success_prior | 0.47 | 0.12 |
| audience_similarity_idx | 0.22 | 0.05 |
| social_hub_score | 0.013 | 1.1e–4 |
| text_length_sentences | 42.7 | 156.4 |
| subjectivity_score | 0.33 | 0.08 |
| hedge_word_ratio | 0.018 | 2.4e–4 |
A logistic regression model for predicting winning likelihood is specified as:
where denotes the logistic sigmoid and are the coefficients for corresponding linguistic features.
- Accuracy:
- All user features: 68.43% (majority baseline 57.23%)
- All linguistic features: 60.28%
- Combined (user + linguistic): 71.35%
- Effect Sizes (standardized coefficients, user-only model):
- (experience): +0.12,
- (success_prior): +0.21,
- (audience_similarity): +0.08,
- (hub_score): +0.10,
These results indicate that debater seniority, prior success, and audience similarity are statistically significant predictors of debate outcomes, even after controlling for language features.
5. Data Structure, Distribution, and Licensing
ContArgA is accessible for non-commercial academic use under a Creative Commons Attribution-NonCommercial license at http://www.cs.cornell.edu/~esindurmus/. The dataset is organized in four primary directories:
- debates/: One JSON per debate, containing debate_id, topic_category, rounds (turn texts), timestamp
- users/: ‘user_profiles.csv,’ containing demographic, belief, and activity columns as above
- comments/: JSONL, one comment per line, fields include debate_id, commenter_id, round_id, comment_text, timestamp
- votes/: CSV, each row {debate_id, voter_id, target_debater_id, criteria_scores}
No IRB or additional data-use agreements are necessary for academic use; access is granted upon accepting the license terms.
6. Applications and Limitations
ContArgA’s design facilitates empirical inquiry into several open questions:
- Modeling persuasion strategies and audience adaptation (e.g., do debaters actively modify style for particular voters?)
- Longitudinal user studies, including prediction of opinion evolution or platform disengagement (“churn”)
- Evaluation of argument quality conditioned on reader demographics
A plausible implication is that the combination of demographic, ideological, and network positioning data with multi-round argumentative content offers unique leverage for studies across NLP, social psychology, and network science.
Limitations
- All demographic and ideological information is self-reported and may contain noise or missing values
- There are no human-annotated argumentation structures (such as explicit claim-premise links) beyond vote-based scoring
- Debate.org’s user population is skewed toward U.S. political issues, affecting generalizability
- Conversational flow features are limited to unigram divergence, with no advanced discourse parsing
Overall, ContArgA is the first corpus to densely couple user profile metadata with large-scale argumentative content, supporting modeling tasks for both language and user effects in computational argumentation (Durmus et al., 2019).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free