Articulatory Kinematics Overview

Updated 18 November 2025

Articulatory kinematics is the quantitative modeling of time-varying speech articulator positions, linking neural planning and acoustic output through empirical and computational methods.
It employs precise measurement techniques such as EMA, rtMRI, and motion capture to capture 2D/3D trajectories with high temporal resolution and sub-millimeter accuracy.
Computational models ranging from task-dynamics to neural factorization enable accurate inversion, synthesis, and control of articulatory motions for applications in speech therapy, synthesis, and audiovisual animation.

Articulatory kinematics refers to the quantitative description and mathematical modeling of the time-varying positions, velocities, and accelerations of speech articulators—such as the tongue, lips, jaw, and velum—during speech. The domain encompasses both empirical measurement (using electromagnetic articulography, ultrasound, MRI, or motion capture), and computational frameworks that allow the prediction, control, and synthesis of kinematic trajectories underpinning spoken language. Articulatory kinematics provides the physical bridge between neural planning, biomechanical vocal tract realization, and the resulting acoustic output, and is fundamental for understanding, modeling, and engineering spoken communication.

1. Measurement and Signal Representation

Articulatory kinematics are typically derived from direct measurement modalities such as electromagnetic articulography (EMA), ultrasound, or real-time magnetic resonance imaging (rtMRI). The most common configuration involves tracking the 2D or 3D coordinates of coils or fleshpoints attached to key articulators (e.g., tongue tip, dorsum, blade, upper/lower lip, jaw), sampled at rates from 50 to 250 Hz (Steiner et al., 2013, Singh et al., 2019, Kirkham et al., 21 Oct 2025). Each time series $X = [x_1, x_2, ..., x_t]$ , with $x_i \in \mathbb{R}^d$ , encodes the kinematic state of the vocal tract at $t$ time points and $d$ spatial channels. These raw trajectories are pre-processed via low-pass filtering, normalization (typically z-score, by utterance), and sometimes temporal alignment or co-registration with anatomical scans (Steiner et al., 2013).

Recent approaches also include source-related signals such as fundamental frequency and loudness, increasing the dimensionality of the kinematic space to capture both articulatory configuration and vocal source properties (Cho et al., 18 Jun 2024). In rtMRI-based systems, contour points on jaw, tongue, lips, velum, and larynx can yield up to 400-dimensional spatial frames, which are factorized into articulator-specific descriptors (Lian et al., 2022).

2. Mathematical and Computational Models

2.1 Task-Dynamic and Dynamical Systems Approaches

A dominant formalism for modeling articulatory kinematics is the Task-Dynamic framework, representing each articulator as a mass–spring–damper system evolving according to second-order ordinary differential equations (Kirkham, 7 Apr 2025, Kirkham, 19 Nov 2024, Kirkham et al., 21 Oct 2025):

$m\ddot{x}(t) + b\dot{x}(t) + k(x(t) - T) = 0$

Here, $x(t)$ is the kinematic state (e.g., tongue or lip position), $m$ is effective mass, $b$ is damping, $k$ is stiffness, and $T$ is a spatial target. A critically or under-damped regime is typically observed (Kirkham, 7 Apr 2025). Nonlinear extensions introduce a cubic restoring force to better match empirically observed symmetric velocity profiles:

$m\ddot{x}(t) + b\dot{x}(t) + k(x(t) - T) - d(x(t) - T)^3 = 0$

Scaling laws for the cubic term provide parameter-parsimonious, physically interpretable ways to model movement amplitude and timing across gestures of varying range (Kirkham, 19 Nov 2024).

Symbolic regression over empirical kinematic corpora, such as XRMB or EMA, confirms that a second-order (inertia + elasticity) law fits the vast majority of real speech gestures (variance-weighted $R^2 \geq 0.98$ ), with nonlinearity required in approximately one-third of cases (Kirkham, 7 Apr 2025). Parameters $(k, b, d)$ map directly to gestural "settings" underlying phonological organization.

2.2 Articulatory Representation Learning

Neural approaches decompose high-dimensional kinematics into interpretable, temporally localized "gestures" and sparse gestural scores using convolutional nonnegative matrix factorization (neural NMF) (Lian et al., 2022, Lian et al., 2022). Formally:

$X \in \mathbb{R}^{C\times t}$ (articulator-by-time) is decomposed as $X \approx W \odot H$ , where $W$ contains gestural primitives (e.g., tongue raising, lip closure) and $H$ is a sparse activation matrix.
Guided factor analysis on rtMRI or contour data imposes anatomical structure, yielding factors $F_{art}$ for each articulator and corresponding score trajectories $Y$ for efficient, disentangled representation (Lian et al., 2022).
These latent trajectories support phoneme recognition and generative synthesis, often achieving $>90\%$ sparsity with minimal information loss.

3. Inference, Synthesis, and Control

3.1 Acoustic-to-Articulatory Inversion

Self-supervised representation learning models (e.g., HuBERT, WavLM) encode internal structures that, via simple linear probes, recover universal articulatory kinematics across languages, speakers, and dialects (Cho et al., 2023, Cho et al., 18 Jun 2024). Inversion accuracy reaches layer-wise correlation coefficients $r\approx0.70-0.88$ (single speaker), and $r \approx 0.55-0.65$ in zero-shot cross-language or cross-gender transfer, demonstrating the near-universality of articulatory abstraction in modern SSL models (Cho et al., 2023, Cho et al., 18 Jun 2024). SPARC (Cho et al., 18 Jun 2024) provides an end-to-end framework mapping audio to 12–14 kinematic+source features, enabling high-fidelity speech synthesis and accent-preserving voice conversion, with WERs $<5.5\%$ and MOS $>3.8$ across unseen speakers.

3.2 Text-to-Articulatory Synthesis

Joint text-to-speech and articulation models exploit multilinear tongue models and HMMs or neural networks to generate 3D tongue-surface kinematics directly from linguistic features (Steiner et al., 2016). Synthesized trajectories achieve sub-3 mm Euclidean error relative to reference EMA data, with accurate temporal alignment (RMSE in phone durations $\sim$ 27 ms). Such models can decouple anatomical variation (speaker space) from articulation (pose space), supporting speaker adaptation and personalized multimodal synthesis.

3.3 Direct Motor Control and Reinforcement Learning

Articulatory control can be formalized as a multi-DOF robotic manipulation problem, with explicit motor control of tongue, lips, and jaw positions in the midsagittal plane (Anand et al., 7 Oct 2025). Reinforcement learning (PPO) agents trained to maximize acoustic similarity via articulatory-to-acoustic decoding can learn interpretable articulatory trajectories reproducing target syllables, with converged audio similarities >0.85 and high transcription accuracy. Trajectories exhibit canonical phonetic stages (e.g., lip closure for /p/, tongue-tip elevation for /l/) and maintain biomechanical plausibility throughout (Anand et al., 7 Oct 2025).

4. Data-Driven Kinematic Decomposition and Gesture Scores

Sparse matrix factorization—both deep neural and classical—enables empirical discovery of gestural primitives underlying speech. Kinematic signals $X$ are typically factorized as $X \approx W \odot H$ , with $W$ capturing phonologically relevant movements (e.g., tongue raising, lip rounding), and $H$ forming a low-dimensional, time-sparse gestural score (Lian et al., 2022, Lian et al., 2022). These representations achieve $>90\%$ sparsity while retaining phoneme recognition accuracy close to raw EMA (PER $\sim$ 14%). Factor analysis (guided by articulator-specific masks) enhances anatomical interpretability, guaranteeing that learned gestures correspond to jaw, tongue, lips, etc. (Lian et al., 2022). Such decentralized scores facilitate data-efficient multimodal TTS, robust cross-speaker generalization, and theoretical quantification of gesture overlap, coarticulation, and motor primitives.

5. Articulatory Kinematics in Synthesis and Animation

Articulatory kinematic data are essential for high-fidelity audiovisual speech synthesis and talking avatars. EMA motion capture workflows involve denoising, co-registering kinematic data with MRI/dental scans, and driving 3D mesh deformation through a kinematic chain or spline-IK rig (Steiner et al., 2013). Linear-blend skinning propagates measured coil motion to the digital mesh, yielding sub-millimeter RMSE between measured and reconstructed vertex positions, with high temporal correlation (>0.9) in per-frame movements. Such models underlie applications in CAPT, speech therapy, and phonetic research requiring visual feedback of internal articulation.

6. Behavioral and Clinical Implications

Articulatory kinematics has direct impact on speech disorder assessment and therapy. Speech-inversion neural networks recover tract variables (lip aperture, tongue-tip degree/location, tongue-body degree/location) from audio, differentiating not only categories of misarticulation (e.g., derhotic /r/, dentalized /s/) but also magnitude of deviation, as validated against expert perceptual ratings (Benway et al., 2 Jul 2025). Quantitative proximity metrics between inferred kinematics and correct targets align with perceptual severity scales (PERCEPT rating), supporting objective, gradient tracking of articulatory improvement during intervention. The ability to recover interpretable kinematics from audio alone facilitates telepractice and broad access to articulatory biofeedback without restrictive instrumentation.

7. Contextual and Applied Perspectives

Articulatory kinematics bridges low-level biomechanical modeling and high-level linguistic structure. Attention-based encoder–decoder networks (e.g., AstNet) model both the duration and amplitude scaling of kinematic trajectories as speaking rate varies, outperforming affine and DNN baselines in dynamic time warping distance and correctly predicting rate-specific undershoot/hyperarticulation (Singh et al., 2020). Cognitive-motor constraints, such as the planning-time encoding of target distance and nonlinearity (as in cubic task-dynamics), are supported by cross-linguistic universality of kinematic abstraction in self-supervised models (Kirkham, 19 Nov 2024, Cho et al., 2023). Such findings suggest that articulatory kinematics constitutes a compact, cross-domain control space for interpretability, multilingual robustness, and physical grounding in both engineering and behavioral science contexts.