MMS: MultiModal Signstream Framework

Updated 10 March 2026

MultiModal Signstream (MMS) is a formal framework for encoding sign language animations using parallel gloss channels and precise geometric parameters.
The framework employs both absolute and relative timing controls to support fine-grained synchronization in 3D animation pipelines like Blender.
Future extensions aim to incorporate facial and hip inflections to overcome current limits in achieving naturalistic avatar expression.

MultiModal Signstream (MMS) is a formal representation framework for sign language animation and processing that generalizes traditional gloss-based approaches to encode parallel channel activity, fine-grained timing, and geometric inflections. MMS is primarily realized as an open specification and as an animation input format for the MMS-Player software system, which enables parametric, data-driven animation of sign language avatars within the Blender 3D authoring ecosystem (Nunnari et al., 22 Jul 2025). MMS has emerged as a response to the limitations of single-channel gloss lists, supporting multi-articulator, temporally detailed, and inflection-rich content generation.

1. MMS Data Model and Structure

A MultiModal Signstream instance is a two-dimensional table $M$ with $N$ rows (interpreted as the fundamental temporal units, typically one per sign) and $C=45$ columns, where each column captures either a gloss index, a timing marker, or a geometric parameter. Each row $r_k$ ( $k=1\dots N$ ) has the general form: $r_k = (g_\mathrm{main},\ g_\mathrm{dom},\ g_\mathrm{ndom},\ t_\mathrm{start},\ t_\mathrm{end},\ \Delta t,\ \delta,\ F_\mathrm{inflection})$ where:

$g_\mathrm{main} \in \Sigma$ is the main gloss (primary sign identification)
$g_\mathrm{dom},\ g_\mathrm{ndom} \in \Sigma \cup \{\bot\}$ are optional dominant and non-dominant override glosses
$t_\mathrm{start},\ t_\mathrm{end} \in \mathbb{R}_{+}$ are absolute timestamps delineating the sign interval in the animation timeline (in seconds)
$\Delta t \in \mathbb{R}_{+}$ is the transition delay following the previous sign
$\delta \in \mathbb{R}_{+} \cup$ percentage is the playback duration, which can be absolute or relative (as % of nominal gloss duration)
$F_\mathrm{inflection}$ is a set of inflection parameters keyed to affected body-part functions (e.g., coordinate displacements and rotations of hand, shoulder, torso, head)

The MMS schema includes position- and rotation-oriented deltas for both hands, shoulders, torso, and head. The “gloss channels” $g_\mathrm{main}$ , $g_\mathrm{dom}$ , $g_\mathrm{ndom}$ facilitate the encoding of simultaneous, parallel sign streams, supporting up to three independent motion planes (typically full body and two hands; with special gloss <HOLD> serving as a frame hold mechanism).

This table-based structure enables discrete, temporally aligned, and parameterized representation of complex sign sequences, extending expressivity well beyond linear gloss lists.

2. File Format, Synchronization, and Temporal Control

In practice, the MMS data table is implemented as a CSV or TSV file, with the first line indicating the full column set. Each sign is one row, populated with gloss and parameter values—numeric fields (e.g., domhandreloc_x, headrot_z), and structured triples for coordinate or rotational values.

Synchronization of the parallel gloss channels is managed by explicit placement of each sign’s keyframes into the Blender global timeline according to their $[t_\mathrm{start}, t_\mathrm{end}]$ intervals. If only relative timing (transition and duration) is specified, absolute times for each sign segment are determined by: $t_{\mathrm{start},k} = t_{\mathrm{end},k-1} + \Delta t_k$

$t_{\mathrm{end},k} = t_{\mathrm{start},k} + \delta_k$

This supports both rigid interval sequencing and “stretch-and-squash” time-modification of sub-clips, aligning with motion capture (MoCap) sources of variable lengths.

Parallelism in the MMS model extends to articulator-specific override glosses, enabling, for instance, a “maingloss” for the body with separate “domgloss” and “ndomgloss” for single-arm sub-animations. Non-manual markers ("inflections") are numerically specified within the same MMS table, obviating external phonetic or HamNoSys-style descriptions.

3. MMS-Player Realization Pipeline

MMS-Player is a suite of Python scripts embedded in the Blender 3.x environment or operating headlessly for automated generation. The realization process, invoked either batch-wise (main.py) or via a Flask-powered HTTP API (Serve.py), parses MMS files and performs the following key steps for each sign row:

Loads the relevant Blender action file for the gloss
Resamples the action to target duration using uniform keyframe time remapping
Applies geometric inflections by inserting inverse kinematics (IK) controllers, baking controller adjustments, and mapping them back to bone-level animation
Inserts the processed action into the Non-Linear Animation (NLA) track at the computed frame slot
Cleans up intermediate or temporary actions and controllers
If specified, exports the sequence as .mp4 (via FFmpeg), .fbx, .blend, or JSON animation data

The HTTP API exposes a /realize endpoint accepting multipart form-data uploads of MMS files, Blender avatar rigs, and output format requests, providing asynchronous processing and progress polling.

Inflections are mapped to bone motion by strategy classes (e.g., TrajectoryTarget, LocalRotationTarget, RelativeLocRotTarget), with root-bone anchoring (e.g., domhandreloc_xyz mapped to Hand.L/Hand.R relative to Spine2). For each keyframe $t_i$ , a rigid-body transform of the form $p'_i = R\cdot p_i + T$ , $q'_i = q_i \oplus \Delta q$ is applied, enabling precise realization of spatial inflections.

4. Advantages and Extensions Beyond Gloss Lists

MMS provides several critical enhancements compared to traditional gloss systems:

Multi-articulator parallelism: The three gloss channels permit simultaneous, independently timed movement sequences for full body and both arms.
Fine-grained temporal control: Absolute (framestart, frameend) and relative (transition, duration) fields enable arbitrary temporal alignment, time scaling, and MoCap retiming.
Numerical inflection encoding: All non-manuals (head, torso, shoulders) are incorporated as numeric fields, eliminating external annotation dependencies and facilitating parametric or programmatic manipulation.
Immediate integration with 3D pipelines: The format lends itself directly to real-time or batch realization in Blender, with simple conversion to video, FBX, and other 3D formats.

A plausible implication is that MMS bridges the "phonetic-phonological gap" in sign language avatar synthesis, providing enough low-level parameterization for naturalistic, expressive generation without the need for discrete script translation stages.

5. Practical Constraints and Empirical Evaluation

MMS-Player implements several efficiency optimizations, such as batch creation of IK controllers and minimizing baking passes per animation row. In a user study (N=5), key practical challenges were identified:

Absence of facial animation resulted in incomplete perceptual realism, with participants refusing to rate signing correctness without facial cues.
Omission of hip motion led to exaggerated torso lean, while default Blender interpolation could introduce visible "jitter" between non-overlapping sign units.
Excessive time compression for sentence-level signing produced unnatural pacing.

These highlight the limits of current MMS parameterization (notably lacking explicit hip and facial inflections), as well as the challenges of synthesizing natural flow with keyframe-based retiming and default interpolation mechanisms.

Planned extensions include hip-channel support, ad-hoc transition routines, and explicit facial channel parameterization, aiming to address the documented limitations in inflectional expressivity and transition smoothness.

6. Applications and Ecosystem Integration

MMS-Player supports multiple output modalities, including:

Direct animation rendering as .mp4 via FFmpeg
Export as FBX for use in game engines (Unity/Unreal)
.blend project files for further manual editing or refinement
Raw structured animation data (JSON) for downstream computational pipelines

Software distribution is under GPL-3.0 and is available from DFKI-SignLanguage’s MMS-Player repository. The MMS representation is designed for broad adoption in automated, data-driven, parametric generation of sign language animation, facilitating experimentation in computer sign linguistics, communication accessibility, and human-computer interaction contexts.

7. Relation to Multi-expert Sign Language Translation Architectures

While MMS is specifically an animation and representation format for avatar synthesis, there is conceptual resonance with recent advances in automated sign language translation leveraging multi-stream and multi-expert approaches, as exemplified by MultiStream-LLM (Thomas et al., 20 Aug 2025). These translation systems employ parallel expert predictors for continuous sign, fingerspelling, and lipreading streams, fusing their outputs for robust sentence-level translation.

A plausible implication is that MMS, with its support for parallel temporal channels and fine-grained inflections, could provide a suitable formalism for representing or generating the type of multi-modal, temporally aligned annotation data required for training or evaluating such models, especially where fine discriminations between manual and non-manual components are critical.

Markdown Report Issue Upgrade to Chat

References (2)

MMS Player: an open source software for parametric data-driven animation of Sign Language avatars (2025)

MultiStream-LLM: Bridging Modalities for Robust Sign Language Translation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MultiModal Signstream (MMS).