Socrates Dataset: Multi-Domain Benchmarks

Updated 14 June 2026

Socrates Dataset comprises diverse, state-of-the-art benchmarks covering social science, LLM reasoning, mediation, and atmospheric photolysis.
Its construction involves LLM-driven data standardization, iterative code synthesis, and rigorous filtering to ensure high-quality, reproducible data.
The dataset enables robust applications from hypothesis screening and bias analysis to diagnostic probing, mediation performance, and atmospheric simulation studies.

The term “Socrates Dataset” can refer to several distinct, high-profile datasets and benchmarks widely cited in computational social science, machine learning, automated mediation, and atmospheric modeling. Each instance is independently defined in the literature. The following overview details the main Socrates datasets referenced in contemporary research, with an emphasis on their origins, construction methodologies, structure, metrics, and representative applications.

1. Socrates (SocSci210): Large-Scale Human Response Dataset for LLM Finetuning

Socrates (also termed SocSci210) is a comprehensive dataset of individual-level human responses aggregated from 210 peer-reviewed social science experiments, designed to support LLM finetuning for behavioral prediction in diverse social science domains. The dataset encompasses 2.9 million response records from 400,491 unique participants, sourced primarily from the National Science Foundation’s TESS repository, and covers disciplines such as political science, behavioral psychology, economics, and sociology (Kolluri et al., 6 Sep 2025).

Dataset Construction Pipeline

Acquisition and Standardization: Raw project files (CSV, SPSS, PDF codebooks) are downloaded from OSF/TESS. Data tables are standardized to CSV, and text extraction is applied for codebooks and DOCs; column names are normalized.
Reconstruction Agent: An LLM-driven pipeline ingests the entire study context (paper text, codebook, and tabular data), identifies conditions ( $c_1,\ldots,c_j$ ) and outcome questions ( $o_1,\ldots,o_i$ ), and executes iterative “generate-and-test” code synthesis, verified against held-out samples and participant-level mappings.
Filtering: Of 321 deduplicated TESS studies, 210 were reconstructed successfully. Failures were due to complex file structures impeding automated parsing.

Record Schema and Structure

Field	Type	Description
persona_id	string	Unique participant identifier
Demographics	JSON or flat	51 fields: age, gender, income, etc.
condition_id	string	Encodes experimental condition
condition_label	text	Description of the condition
outcome_id	string	Identifier of measured outcome
outcome_prompt	text	Full question/explanation text
response_value	int or binary	Encoded response
response_scale	object	{“r_min”, “r_max”}

Demographic metadata spans 51 variables, enabling granular bias and subgroup analysis (e.g., political party ID, urban/rural classification, education, marital status).

File Formats and Access

CSV (one flat file per study)
JSONL (one record per line, full schema)
SQLite database (field-indexed)

The dataset is released under a permissive MIT-style license for non-commercial research purposes and is accessible from https://stanfordhci.github.io/socrates.

2. SOCRATES (ShOrtCut-fRee lATent rEaSoning): Multi-Hop Factual Reasoning Benchmark

SOCRATES is a benchmark for evaluating pretrained LLMs’ ability to perform shortcut-free latent multi-hop reasoning without recalling head-to-answer entity pairs from pretraining or resorting to relation-object frequency biases (Yang et al., 2024).

Construction and Filtering Protocol

Fact pair selection is based on structured Wikidata triples: $(e_1, r_1, e_2)$ and $(e_2, r_2, e_3)$ , with “bridge” entities (e.g., country, year).
Shortcut filtering via document-level co-occurrence heuristics in massive web-scale corpora (Dolma v1.5, OSCAR, etc.) ensures test cases lack any $e_1$ – $e_3$ co-occurrence.
“Guessable” queries are ablated and removed if LLMs can answer based on relation pattern frequency alone.
Test set size: 7,232 shortcut-free multi-hop queries.

Evaluation Metrics

Latent Composability (LC):

$LC = \frac{|\{ t \in T : EM_1(t)=1, EM_2(t)=1, EM_m(t)=1 \}| }{|\{ t \in T : EM_1(t)=1, EM_2(t)=1 \}|}$

where $EM_1, EM_2$ refer to single-hop exact match and $EM_m$ to the multi-hop.

Chain-of-Thought composability measures explicit multi-hop reasoning (step-wise) accuracy.

Key Findings

Overall LC for best models (GPT-4o, Claude 3.5) is under 10%, with strong variance for bridge entity type: LC $>$ 80% for “country”, $o_1,\ldots,o_i$ 05–7% for “year”.
Rigorous filtering reduces artifacts; unfiltered variants can inflate LC by 2–5 $o_1,\ldots,o_i$ 1.
The dataset does not include a training/validation split; it is strictly for evaluation.

3. SoCRATES: Benchmark for Proactive LLM Mediation

SoCRATES is a large-scale, multi-domain benchmark for automated evaluation of LLM-based mediators across realistic conflict negotiation scenarios (Yun et al., 4 Jun 2026). It is constructed from agentically synthesized cases rooted in real disputes.

Construction Workflow

Domains: Transactional, Healthcare, Environmental, B2B, Public Policy, International, Legal, Intra-organizational (5 hard scenarios each, total 40).
Agentic pipeline: Web search via LLM (o4-mini-deep-research) generates structured seeds, which are rewritten by GPT-5.4 into a fixed schema. Consensus is tested in multi-agent simulations without mediation to isolate “hard” cases.
Socio-cognitive perturbations: Each scenario is mutated along five axes—Strategic Posture (Thomas–Kilmann mode), Party Composition, History Length, Emotional Reactivity (scalar $o_1,\ldots,o_i$ 2), and Cultural Identity (Hofstede’s six dimensions)—yielding 15 conditions per base scenario and 600 total scenario-condition pairs.
All dialogue, background, topic structure, party persona, and socio-cognitive metadata are stored in structured JSON.

Evaluation Protocol

Topic-localized evaluator computes per-topic consensus by tracking stance and agreement shifts only when the topic is locally discussed.
Key metrics:
- Intervention Timeliness: Fraction of significant consensus drops promptly countered by mediator intervention.
- Intervention Effectiveness: Short-term consensus gain post-intervention.
- Consensus Gain: Final consensus gain relative to unmediated baseline.
Evaluator achieves a Pearson $o_1,\ldots,o_i$ 3 correlation with human expert judgments, outperforming per-turn annotation and non-expert baselines.

Use and Access

600 scenario-condition JSONs and all unmediated baselines are provided for benchmarking.
There is no train/validation/test split; SoCRATES is strictly a test benchmark for E2E evaluation of new LLM mediators.

4. Digital Socrates Critique Bank: Automatic Explanation Critiquing

Digital Socrates Critique Bank is an explanation evaluation dataset for LLM-generated answers to multiple-choice QA, focusing on the ability to identify, localize, and categorize explanation flaws (Gu et al., 2023).

Annotation Protocol and Structure

Each entry is a 5-tuple: localization (floc), flaw dimension (fdim, eight categories), general suggestion (Sgen), specific suggestion (Sspec), explanation quality score ( $o_1,\ldots,o_i$ 4).
Types of flaws: misunderstanding, lack_justification, incorrect_information, missing_information, incorrect_reasoning, incomplete_reasoning, inconsistent_answer, irrelevant.
Data: 26,478 annotated critiques, covering 4,091 questions, with silver (GPT-4), crowd-verified, and expert partitions for training/dev splits.

Statistical Properties

Flaw dimension occurrence: misunderstanding (20%), lack_justification (7%), incorrect_information (18%), missing_information (5%), incorrect_reasoning (21%), incomplete_reasoning (18%), inconsistent_answer (7%), irrelevant (4%).
57% of explanations in the dev partition are judged flawless by all critique models.

Usage

Model outputs (answer and explanation) can be automatically critiqued to benchmark explanation quality.
The dataset is distributed as JSONL; annotation reliability can be checked with Cohen’s $o_1,\ldots,o_i$ 5.

5. Socrates Photolysis Dataset (v24.11): Atmospheric and Exoplanet Photochemistry

The Socrates photolysis dataset provides a comprehensive and updated collection of photolysis rates ( $o_1,\ldots,o_i$ 6-values) for 31 key atmospheric reactions, relevant to both Earth and exoplanet atmospheres (Adams et al., 18 Feb 2026).

Computational Formalism

$o_1,\ldots,o_i$ 7
$o_1,\ldots,o_i$ 8: Cross section as a function of wavelength, temperature, pressure.
$o_1,\ldots,o_i$ 9: Quantum yield.
$(e_1, r_1, e_2)$ 0: Actinic flux, evaluated with a correlated- $(e_1, r_1, e_2)$ 1 two-stream solver.

Data Structure

Benchmark rates for 31 reactions (O $(e_1, r_1, e_2)$ 2, NO $(e_1, r_1, e_2)$ 3, N $(e_1, r_1, e_2)$ 4O, H $(e_1, r_1, e_2)$ 5O, various organics, and exoplanet-relevant species).
J-values computed vs. pressure for both modern solar and M-dwarf (Proxima Centauri) stellar spectra.
Each reaction is backed by explicit cross section and quantum yield tables, including temperature and pressure dependencies.

Comparison and Applications

Agreement to within 5% with PhotoComp 2011 for many channels, but extended to higher resolution, updated cross-sections, and new exoplanet channels.
Datasets (tabular ASCII, CSV, driver scripts) on Zenodo: https://doi.org/10.5281/zenodo.15941222.

6. Licensing and Accessibility

Socrates (SocSci210), SOCRATES (latent reasoning), SoCRATES (mediation), and Digital Socrates are all released for non-commercial research under open-source or permissive licenses, but users are required to cite the original work and comply with corresponding data-use policies (Gu et al., 2023, Yang et al., 2024, Kolluri et al., 6 Sep 2025, Adams et al., 18 Feb 2026, Yun et al., 4 Jun 2026).

7. Representative Applications and Implications

Socrates (SocSci210) models enable synthetic simulation of policy, moral, or economic response distributions for hypothesis screening without field deployment, achieving up to 74% normalized individual accuracy and $(e_1, r_1, e_2)$ 626% distributional alignment gains over base LLMs on new studies (Kolluri et al., 6 Sep 2025).
SOCRATES exposes the distinction between latent and explicit multi-hop reasoning, quantifying the compositional limitations of modern LLMs and providing a basis for diagnostic probing of neural representations (Yang et al., 2024).
SoCRATES supports controlled benchmarking of LLM-driven mediators across axes of social adaptation, exposing model performance bounds in consensus-building under complex domain, party, emotion, and culture permutations (Yun et al., 4 Jun 2026).
The Socrates photolysis dataset enables state-of-the-art atmospheric modeling of photochemical processes, including exoplanet atmospheres under varied stellar irradiation (Adams et al., 18 Feb 2026).
Digital Socrates enables non-intrusive, high-throughput auditing of LLM-generated explanations, revealing error types not captured by end-task accuracy (Gu et al., 2023).

Each Socrates dataset represents a distinct line of inquiry, unified only by adherence to state-of-the-art data curation, broad domain coverage, and support for quantitative benchmarking across leading research challenges in both computational sciences and the humanities.

Markdown Report Issue Upgrade to Chat

References (5)

Finetuning LLMs for Human Behavior Prediction in Social Science Experiments (2025)

Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts? (2024)

SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations (2026)

Digital Socrates: Evaluating LLMs through Explanation Critiques (2023)

Benchmarking Photolysis Rates with Socrates (24.11): Species for Earth and Exoplanets (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Socrates Dataset.