Papers
Topics
Authors
Recent
Search
2000 character limit reached

TRAIMA: Automating Multimodal Classroom Interactions

Updated 26 January 2026
  • TRAIMA is a project that automates analysis of multimodal classroom interactions through integrated transcription and machine learning approaches.
  • It combines transcription conventions with ML techniques like HMMs, CRFs, and LSTMs to label verbal, paraverbal, and non-verbal cues in educational settings.
  • The project enhances pedagogical research by streamlining data annotation and offering robust frameworks to overcome manual transcription challenges.

The TRAIMA project (TRaitement Automatique des Interactions Multimodales en Apprentissage) investigates the methodological and technological underpinnings required to automate the analysis of multimodal interactions in classroom settings. Conducted from March 2019 to June 2020, its primary objective is to address the challenge posed by the manual annotation of verbal, paraverbal, and non-verbal components—specifically in explanatory and collaborative pedagogical episodes—where the sheer complexity and scale of such multimodal data render manual approaches inefficient and difficult to scale. Through a detailed survey of transcription conventions, empirical analysis of French-language classroom corpora, and infrastructure development (TechnéLAB), TRAIMA elaborates a methodology and framework for the future integration of machine learning in the automated processing and categorization of classroom interactions, especially explanatory discourse sequences in French as a Foreign Language (FLE) and French as a First Language (FLM) contexts (Rançon et al., 19 Jan 2026).

1. Theoretical Foundation: Explanatory Discourse and Multimodality

TRAIMA’s analytical core rests on a precise linguistic and interactional definition of the explanatory discourse sequence as a tripartite structure. Drawing from interactional didactics (Baker 1992; Barbieri et al. 1990), an “explanation” is modeled as a three-part sequence aimed at resolving a comprehension obstacle:

ExpSeq=Opening,Core,Closure\text{ExpSeq} = \langle \text{Opening},\,\text{Core},\,\text{Closure} \rangle

  • Opening: Problematisation and formulation of the explanandum (subject to be explained)
  • Core (Noyau/Explanans): The explanatory proper, where the obstacle is addressed
  • Closure: A mark of reception or ratification by learners

Macro-propositional variants elaborate the sequence as a series of phases:

Macro-ExpSeq=Phase 0: Scheˊmatisation initialeM0,Phase 1: Probleˋme / QuestionM1,Phase 2: Explication / ReˊponseM2,Phase 3: Conclusion–EˊvaluationM3\text{Macro-ExpSeq} = \bigl\langle \underbrace{\text{Phase 0: Schématisation initiale}}_{M_0}, \underbrace{\text{Phase 1: Problème / Question}}_{M_1}, \underbrace{\text{Phase 2: Explication / Réponse}}_{M_2}, \underbrace{\text{Phase 3: Conclusion–Évaluation}}_{M_3} \bigr\rangle

TRAIMA distinguishes dicto causalité (semantic, not explicitly operational causal links) from re causalité (actual manipulated causal relations). Each phase is marked multimodally: verbal cues, prosodic contouring, and characteristic gestures or postures per segment. No instance was observed where gesture or prosody alone constituted a full explanans; paraverbal and non-verbal modalities always accompanied verbal content.

2. Transcription Conventions and Annotation Methodologies

TRAIMA delivers a detailed review of five leading transcription conventions, evaluating their expressive power and operational feasibility in multimodal didactic contexts:

Convention Multimodal Features Coded Principal Strengths Principal Limitations
ICOR Verbal, paraverbal, gesture/posture, spatial Explicit multimodal alignment; introductory No fixed gesture taxonomy; labor-intensive
Mondada Gesture trajectory, gaze, strict temporal alignment Fine granularity; aligns gesture/verbal High cognitive load; little prosodic support
GARS Verbal, minimal non-verbal (notes) Simple; computationally light Non-verbal marginal; little prosody support
VALIBEL Verbal (plus orality cues), limited non-verbal (notes) Faithful to orality; Praat integration Gesture/gaze largely absent
Ferré Verbal, prosody, gesture (functional taxonomy) Integrated analysis; gesture categorization Complex; requires multiple tools

TRAIMA’s framework adopts a hybrid approach: GARS/VALIBEL for verbal turns; Mondada for gesture/posture/gaze; Ferré for gesture function; ICOR for sequence boundaries. Minimal analytical units consist of word + prosodic event + gesture event + proxemic position integrated on a common time grid, managed via ELAN or EXMARaLDA, and exported to Praat for prosodic sub-tiers.

3. Data Infrastructure and Empirical Corpora

The empirical base comprises two principal corpora:

  • INTER-EXPLIC corpus (Univ. Toulouse II, 2006): ≈30 hours of FLE/FLM classroom video recordings, >100 tripartite explanations in varied teacher–learner configurations, multi-camera video and synchronized lapel microphones, annotated in ELAN with ICOR conventions.
  • EXPLIC-LEXIC corpus (Univ. Poitiers, 2016): Lexically-focused explanatory dataset designed for automatic transcription, integrating video, audio, and digital whiteboards, and optimized for ASR adaptation.

Manual annotation adheres to ICOR for sequence segmentation, Mondada for gesture alignment, Ferré’s taxonomy for gesture function (iconique, déictique, métaphorique, emblème, battement). Inter-annotator agreement metrics such as Cohen's κ or Krippendorff's α are not reported.

4. Machine Learning and Automation Prospects

TRAIMA’s automation strategy is exploratory, detailing methodological candidates rather than formal implementations or results:

The report does not specify concrete evaluation metrics, feature sets, or parameter choices. The automation strategy is underpinned by the recognition that transcription bottlenecks are both technical (tool chain fragmentation, limited domain ASR) and theoretical (interpretative variability in gesture/function labeling).

5. Functional Insights into Multimodal Explanation

Manual analysis in TRAIMA yields several confirmed functions for non-verbal modalities in explanatory discourse:

  • Communicative load: Approximately 1/3 is verbal, 2/3 non-verbal (Lazaraton 2004 cited).
  • Kinesic and proxemic resources: Used to segment discourse (opening, core, closure), anchor key terms via deictic pointing, and reinforce causal connections through iconic metaphors.
  • Synchrony: Optimal comprehension occurs when gesture apexes are temporally aligned with prosodic emphasis on keywords. Misalignment (e.g., gesture delay) correlates with learner calls for clarification.

No evidence was found for purely gestural or prosodic explanations substituting for verbal contributions in classroom settings.

6. TechnéLAB Platform: Research Infrastructure

TRAIMA is integrated within TechnéLAB (Univ. Poitiers), which constitutes both a multimodal data capture facility and a testbed for annotation and automation. Key technological resources include:

  • Capture hardware: Multi-cam HD video, individual lapel and room microphones, eye-tracking, precise geolocation sensors, interactive digital whiteboards.
  • Synchronization and storage: AV streams time-stamped (NTFS), redundant backup.
  • Software and automation: Praat for prosody, ELAN/ANVIL for annotation; in-house scripts for ingestion/conversion; future roadmap toward chained processing (ASR, tier association, automated gesture detection).
  • Annotation support: Web-based session cataloging, annotation progress tracking, emerging dashboards for inter-annotator agreement, and API hooks for integration of ML classifiers.

7. Methodological Recommendations and Future Research Trajectories

TRAIMA recommends a rigorously hybrid transcription convention that accommodates the strengths of GARS/VALIBEL (verbal, orality), Mondada (gesture, posture, gaze), Ferré (gesture function), and ICOR (sequence boundaries). Metadata and reflexive documentation of coding are explicitly advocated.

Challenges persisting in this domain include the subjectivity inherent in gesture interpretation, boundary cases for communicative vs. extra-communicative gestures, high annotation costs, fragmented tool ecosystems, and insufficient adaptation of ASR for classroom acoustics. Future priority areas comprise:

  • Domain-adapted ASR for noisy, overlapping classroom speech
  • Computer vision modules for gesture detection and classification
  • Rich multimodal synchrony analyses integrating eye-gaze and digital traces
  • Open science via CC-BY multimodal corpus publication
  • Standardization via XML schemas (TEI + EMELD) for multimodal annotation

TRAIMA thus establishes both the theoretical infrastructure for explanatory multimodal discourse analysis and the methodological groundwork for increasingly automated, reproducible, and scalable approaches in the field of multimodal pedagogical interaction analysis (Rançon et al., 19 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TRAIMA Project.