Teacher-Student Chatroom Corpus (TSCC)
- TSCC is a dataset of authentic teacher-student chatroom interactions capturing natural, pedagogical dialogue in online language lessons.
- It features multi-layer annotations for grammatical errors, discourse structure, and engagement, facilitating detailed analysis and argument mining.
- The corpus supports research in ESL error correction, dialogue system development, and educational analytics through reproducible metrics and methodologies.
The Teacher-Student Chatroom Corpus (TSCC) is a publicly available dataset comprising written, synchronous interactions between teachers and learners of English during one-to-one lessons in an online chatroom. It is distinguished by its naturalistic capture of pedagogically driven, immediate, and informal language use in real tutoring scenarios. The TSCC facilitates multi-dimensional annotation and analysis of teacher-student dialogue, supporting research in second language acquisition, error correction, dialogue system development, and engagement modeling.
1. Corpus Design, Collection, and Structure
The TSCC was created to record spontaneous, authentic exchanges in a controlled online chatroom setting specifically designed for language lessons (Caines et al., 2020). Data collection utilized a minimal, custom-built web interface (with R Shiny tools), ensuring real-time textual conversation while minimizing distraction and extraneous social interaction. The participant pool consisted of two teachers and eight students, collectively producing more than 100 recorded lessons. The corpus includes over 13.5K conversational turns and 133K words, with each lesson transcript anonymized for privacy.
Each conversational turn in TSCC is annotated with metadata indicating the speaker (teacher/student), temporal order, and session. The pedagogical environment is asymmetrical: teachers control lesson flow, provide scaffolded exercises and personalized feedback, while students engage, respond, and make errors. This structure allows detailed longitudinal analysis of rhetorical strategies, error correction dynamics, and learner development.
2. Annotation Layers and Scheme
TSCC incorporates a multi-layer annotation scheme spanning grammatical errors, dialogue structure, and pedagogical phenomena. Student turns and, occasionally, teacher utterances are aligned with corrected versions, with error-type codes detailing lexical, syntactic, punctuation, and pragmatic errors. Additionally, annotations capture discourse organization, including sequences for lesson openings, repair and clarification turns, and closings. Quantitative analyses summarize lesson duration, error distribution, and proficiency-linked features, enabling cross-corpus comparison with other learner datasets.
A further annotation protocol, first established in "Annotating Student Talk in Text-based Classroom Discussions" (Lugini et al., 2019), has informed extension work in TSCC. This scheme segments student talk into "argument moves," labeled for:
- Argumentation: classified as claim, evidence, or warrant, restricting warrant annotations to moves succeeding claims and evidence.
- Specificity: rated on a three-point scale (low, medium, high) according to text referencing, qualification, vocabulary specificity, and chains of reasoning.
- Knowledge Domain: distinguished as disciplinary (text-referenced) versus experiential (personal interpretation).
These dimensions enable detailed discourse analysis and automated argument mining.
Table 1: Example Annotation (from (Lugini et al., 2019))
| Move | Argument | Specificity | Domain |
|---|---|---|---|
| 23 | claim | medium | disciplinary |
3. Reliability, Predictive Power, and Quality Evaluation
Annotation reliability was established with expert-human raters using Cohen’s kappa for argumentation and domain, and quadratic-weighted kappa for specificity. Reported values for argumentation (0.61–0.8), specificity (0.81–1), and domain (up to 1) reflect substantial to near-perfect agreement, with confusion matrices detailing remainder disagreements ((Lugini et al., 2019), Table 2–3).
Predictive validity was assessed via regression analyses comparing annotation dimensions with expert evaluations of discussion quality. Models integrating argumentation, specificity, and domain achieved an of 0.432, confirming that multi-dimensional annotations robustly predict pedagogical quality.
For synthetic data generation tasks (Book2Dial (Wang et al., 5 Mar 2024)), quality is further quantified by metrics such as BERTScore F1 (BF1), QuestEval, informativeness, groundedness, answerability, factual consistency (QFactScore: ), and specificity.
4. Extensions: Engagement, Interestingness, and Sequence-Level Annotation
Building on TSCC, IntrEx (Tan et al., 8 Sep 2025) introduced sequence-level annotations for engagement, providing ratings for both "interestingness" (attention or curiosity elicited by a message or sequence) and "expected interestingness" (anticipated engagement for subsequent dialogue). Annotation was performed by over 100 second-language learners using a comparison-based RLHF-inspired framework, with inter-annotator reliability measured by Gwet’s AC2 (with linear weight penalization).
Linguistic features such as concreteness (negatively correlated with interestingness), comprehensibility (exhibiting an inverted U-curve with engagement), and uptake (building upon previous turns) were statistically linked to engagement. For prediction, smaller instruction-tuned LLMs (Llama3-8B-Instruct, Mistral-7B-Instruct) fine-tuned on IntrEx outperformed larger proprietary models, indicating the value of domain-specific high-quality labels.
5. Synthetic Dialogue Generation and Dataset Expansion
Book2Dial (Wang et al., 5 Mar 2024) provides a framework for generating synthetic teacher-student interactions grounded in textbook content. Dialogues are modeled as sequences of question-answer pairs, with the student agent conditioned on limited context (C: formatting elements) and the teacher agent equipped with full textbook content (S). Generation approaches include multi-turn QG-QA modeling (with T5 and Longformer), dialogue inpainting (span extraction and reconstruction loss), and prompt-based persona role-playing via LLMs (e.g., GPT-3.5).
Quality assessment showed that persona-based prompting produced high relevance, coherence, answerability, and factual consistency, while inpainting methods led in informativeness and groundedness. Challenges such as hallucinated content and over-repetition persist, necessitating further investigation for robust synthetic corpus expansion.
6. Applications in Second Language Acquisition, NLP, and Tutoring Systems
TSCC and its extensions are instrumental in multiple fields:
- Second Language Acquisition: Empirical studies leverage error annotation and engagement metrics to analyze feedback strategies and learning progress (Caines et al., 2020, Tan et al., 8 Sep 2025).
- Automated Error Detection and Correction: Annotated transcripts drive the development and benchmarking of grammatical error correction algorithms and CALL systems.
- Dialogue Systems and Chatbots: Corpus data supports both pre-training and evaluation of conversational agents focused on pedagogically meaningful teacher-student interaction (Wang et al., 5 Mar 2024, McNichols et al., 11 Mar 2025).
- Educational Analytics: Sequence-level and behavioral annotation informs research on conversational engagement and its impact on learning outcomes, supporting intervention design in intelligent tutoring systems.
7. Technical and Reproducibility Considerations
TSCC is distributed as spreadsheet files, with each row representing a conversational turn. Metadata includes turn number, participant role, original and optionally corrected text, and comprehensive error codes (Caines et al., 2020). Statistical reporting leverages LaTeX typesetting for corpus summary tables (lesson count, turn count, word count), and technical specifications include formulaic definitions of error-rate and agreement metrics.
Synthetic and engagement-labeled datasets maintain reproducibility through open-source release of code, annotation protocols, and matrices for metric calculation such as Cohen’s kappa and Gwet’s AC2. Systematic PII filtering (regex-based), annotation sanity checks, and robust reward-based quality assurance contribute to ethical management and methodological rigor.
TSCC thus serves as a foundational resource for educational dialogue research, offering multi-dimensional annotation, robust reliability, and the capacity for application in automated dialogue analysis, chatbot training, and engagement modeling, with ongoing developments extending its utility through synthetic data generation and nuanced annotation protocols.