TRAIMA: Automating Multimodal Classroom Interactions

Updated 26 January 2026

TRAIMA is a project that automates analysis of multimodal classroom interactions through integrated transcription and machine learning approaches.
It combines transcription conventions with ML techniques like HMMs, CRFs, and LSTMs to label verbal, paraverbal, and non-verbal cues in educational settings.
The project enhances pedagogical research by streamlining data annotation and offering robust frameworks to overcome manual transcription challenges.

The TRAIMA project (TRaitement Automatique des Interactions Multimodales en Apprentissage) investigates the methodological and technological underpinnings required to automate the analysis of multimodal interactions in classroom settings. Conducted from March 2019 to June 2020, its primary objective is to address the challenge posed by the manual annotation of verbal, paraverbal, and non-verbal components—specifically in explanatory and collaborative pedagogical episodes—where the sheer complexity and scale of such multimodal data render manual approaches inefficient and difficult to scale. Through a detailed survey of transcription conventions, empirical analysis of French-language classroom corpora, and infrastructure development (TechnéLAB), TRAIMA elaborates a methodology and framework for the future integration of machine learning in the automated processing and categorization of classroom interactions, especially explanatory discourse sequences in French as a Foreign Language (FLE) and French as a First Language (FLM) contexts (Rançon et al., 19 Jan 2026).

1. Theoretical Foundation: Explanatory Discourse and Multimodality

TRAIMA’s analytical core rests on a precise linguistic and interactional definition of the explanatory discourse sequence as a tripartite structure. Drawing from interactional didactics (Baker 1992; Barbieri et al. 1990), an “explanation” is modeled as a three-part sequence aimed at resolving a comprehension obstacle:

$\text{ExpSeq} = \langle \text{Opening},\,\text{Core},\,\text{Closure} \rangle$

Opening: Problematisation and formulation of the explanandum (subject to be explained)
Core (Noyau/Explanans): The explanatory proper, where the obstacle is addressed
Closure: A mark of reception or ratification by learners

Macro-propositional variants elaborate the sequence as a series of phases:

$\text{Macro-ExpSeq} = \bigl\langle \underbrace{\text{Phase 0: Schématisation initiale}}_{M_0}, \underbrace{\text{Phase 1: Problème / Question}}_{M_1}, \underbrace{\text{Phase 2: Explication / Réponse}}_{M_2}, \underbrace{\text{Phase 3: Conclusion–Évaluation}}_{M_3} \bigr\rangle$

TRAIMA distinguishes dicto causalité (semantic, not explicitly operational causal links) from re causalité (actual manipulated causal relations). Each phase is marked multimodally: verbal cues, prosodic contouring, and characteristic gestures or postures per segment. No instance was observed where gesture or prosody alone constituted a full explanans; paraverbal and non-verbal modalities always accompanied verbal content.

2. Transcription Conventions and Annotation Methodologies

TRAIMA delivers a detailed review of five leading transcription conventions, evaluating their expressive power and operational feasibility in multimodal didactic contexts:

Convention	Multimodal Features Coded	Principal Strengths	Principal Limitations
ICOR	Verbal, paraverbal, gesture/posture, spatial	Explicit multimodal alignment; introductory	No fixed gesture taxonomy; labor-intensive
Mondada	Gesture trajectory, gaze, strict temporal alignment	Fine granularity; aligns gesture/verbal	High cognitive load; little prosodic support
GARS	Verbal, minimal non-verbal (notes)	Simple; computationally light	Non-verbal marginal; little prosody support
VALIBEL	Verbal (plus orality cues), limited non-verbal (notes)	Faithful to orality; Praat integration	Gesture/gaze largely absent
Ferré	Verbal, prosody, gesture (functional taxonomy)	Integrated analysis; gesture categorization	Complex; requires multiple tools

TRAIMA’s framework adopts a hybrid approach: GARS/VALIBEL for verbal turns; Mondada for gesture/posture/gaze; Ferré for gesture function; ICOR for sequence boundaries. Minimal analytical units consist of word + prosodic event + gesture event + proxemic position integrated on a common time grid, managed via ELAN or EXMARaLDA, and exported to Praat for prosodic sub-tiers.

3. Data Infrastructure and Empirical Corpora

The empirical base comprises two principal corpora:

INTER-EXPLIC corpus (Univ. Toulouse II, 2006): ≈30 hours of FLE/FLM classroom video recordings, >100 tripartite explanations in varied teacher–learner configurations, multi-camera video and synchronized lapel microphones, annotated in ELAN with ICOR conventions.
EXPLIC-LEXIC corpus (Univ. Poitiers, 2016): Lexically-focused explanatory dataset designed for automatic transcription, integrating video, audio, and digital whiteboards, and optimized for ASR adaptation.

Manual annotation adheres to ICOR for sequence segmentation, Mondada for gesture alignment, Ferré’s taxonomy for gesture function (iconique, déictique, métaphorique, emblème, battement). Inter-annotator agreement metrics such as Cohen's κ or Krippendorff's α are not reported.

4. Machine Learning and Automation Prospects

TRAIMA’s automation strategy is exploratory, detailing methodological candidates rather than formal implementations or results:

Hidden Markov Models (HMMs) for aligning prosodic features with gestural events.
Conditional Random Fields (CRFs) for sequence labeling of tripartite explanatory borders.
Deep neural networks (LSTM-based temporal models) to capture dependencies among acoustic, lexical, and gestural feature embeddings.

The report does not specify concrete evaluation metrics, feature sets, or parameter choices. The automation strategy is underpinned by the recognition that transcription bottlenecks are both technical (tool chain fragmentation, limited domain ASR) and theoretical (interpretative variability in gesture/function labeling).

5. Functional Insights into Multimodal Explanation

Manual analysis in TRAIMA yields several confirmed functions for non-verbal modalities in explanatory discourse:

Communicative load: Approximately 1/3 is verbal, 2/3 non-verbal (Lazaraton 2004 cited).
Kinesic and proxemic resources: Used to segment discourse (opening, core, closure), anchor key terms via deictic pointing, and reinforce causal connections through iconic metaphors.
Synchrony: Optimal comprehension occurs when gesture apexes are temporally aligned with prosodic emphasis on keywords. Misalignment (e.g., gesture delay) correlates with learner calls for clarification.

No evidence was found for purely gestural or prosodic explanations substituting for verbal contributions in classroom settings.

6. TechnéLAB Platform: Research Infrastructure

TRAIMA is integrated within TechnéLAB (Univ. Poitiers), which constitutes both a multimodal data capture facility and a testbed for annotation and automation. Key technological resources include:

Capture hardware: Multi-cam HD video, individual lapel and room microphones, eye-tracking, precise geolocation sensors, interactive digital whiteboards.
Synchronization and storage: AV streams time-stamped (NTFS), redundant backup.
Software and automation: Praat for prosody, ELAN/ANVIL for annotation; in-house scripts for ingestion/conversion; future roadmap toward chained processing (ASR, tier association, automated gesture detection).
Annotation support: Web-based session cataloging, annotation progress tracking, emerging dashboards for inter-annotator agreement, and API hooks for integration of ML classifiers.

7. Methodological Recommendations and Future Research Trajectories

TRAIMA recommends a rigorously hybrid transcription convention that accommodates the strengths of GARS/VALIBEL (verbal, orality), Mondada (gesture, posture, gaze), Ferré (gesture function), and ICOR (sequence boundaries). Metadata and reflexive documentation of coding are explicitly advocated.

Challenges persisting in this domain include the subjectivity inherent in gesture interpretation, boundary cases for communicative vs. extra-communicative gestures, high annotation costs, fragmented tool ecosystems, and insufficient adaptation of ASR for classroom acoustics. Future priority areas comprise:

Domain-adapted ASR for noisy, overlapping classroom speech
Computer vision modules for gesture detection and classification
Rich multimodal synchrony analyses integrating eye-gaze and digital traces
Open science via CC-BY multimodal corpus publication
Standardization via XML schemas (TEI + EMELD) for multimodal annotation

TRAIMA thus establishes both the theoretical infrastructure for explanatory multimodal discourse analysis and the methodological groundwork for increasingly automated, reproducible, and scalable approaches in the field of multimodal pedagogical interaction analysis (Rançon et al., 19 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Rapport du Projet de Recherche TRAIMA (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TRAIMA Project.