Pose Triplet Tokenization in Sign Language Recognition

Updated 9 February 2026

The method constructs pose triplet tokens by discretizing left-hand, right-hand, and body keypoints from continuous pose signals.
It employs a coupled VQ-VAE alongside a graph convolutional network to convert and encode pose data for masked token pre-training with Transformers.
Empirical results demonstrate state-of-the-art performance on multiple sign language benchmarks by bridging continuous visual inputs with discrete token modeling.

Pose triplet tokenization is an approach for constructing discrete, token-based representations from low-level continuous pose signals—specifically aimed at the modeling of sign language recognition using Transformer-based neural architectures. The method is designed both to capture the compositional structure inherent in sign language motion data and to bridge the gap between the continuous world of vision-based pose estimation and the discrete token sequences expected by state-of-the-art LLM pre-training objectives. The core innovation lies in structuring each frame as a pose "triplet" (combining left hand, right hand, and body keypoints), then discretizing these vectors via coupled vector quantization, enabling masked language modeling (MLM)-style pre-training on pose data (Zhao et al., 2023). This framework has been foundational for pre-training sign language encoders using Transformer architectures.

1. Definition and Mathematical Structure of Pose Triplets

At each video frame $t$ , a "pose triplet" comprises:

Upper-body joints: $J_{\text{body},t} \in \mathbb{R}^{N_b \times 2}$
Left-hand joints: $J_{\text{left},t} \in \mathbb{R}^{N_h \times 2}$
Right-hand joints: $J_{\text{right},t} \in \mathbb{R}^{N_h \times 2}$

All keypoints are flattened and concatenated into a single vector:

$t_t = [\mathrm{vec}(J_{\text{left},t});\, \mathrm{vec}(J_{\text{right},t});\, \mathrm{vec}(J_{\text{body},t})] \in \mathbb{R}^d$

with $d = 2 \cdot (2N_h + N_b)$ . Typically, this sequence $(t_1, ..., t_T)$ is passed through a small graph convolutional network (GCN), yielding $f_{\text{sign},t} \in \mathbb{R}^D$ for each frame (Zhao et al., 2023).

2. Coupling Tokenization by Discrete VQ-VAE

To align pose triplet representation with discrete token frameworks, a coupled vector-quantized variational autoencoder (VQ-VAE) is employed:

Encoder: Produces a latent code $z_t \in \mathbb{R}^{3D_{\text{part}}}$ , partitioned into three subvectors for left hand, right hand, body.
Quantizers: Two codebooks are learned:
- $V_{\text{hand}} = \{ h_k \}_{k=1}^{M_1 }$ (for hands), $V_{\text{body}} = \{ d_k \}_{k=1}^{M_2 }$ (for body).
Each segment $z_t^{l}$ , $z_t^{r}$ , $z_t^{b}$ is quantized to its nearest codebook entry.
The quantized vector $z_{q,t} = [ h_{k_t^l}; h_{k_t^r}; d_{k_t^b} ]$ is reconstructed via a learned decoder (Zhao et al., 2023).

Loss is computed as the sum of L1 reconstruction (pose keypoints), codebook, and commitment penalties:

$\mathcal{L}_{d\text{-VAE}} = \mathcal{L}_{\text{hand}} + \beta_1 \mathcal{L}_{\text{body}} + \beta_2 \|\text{sg}[z_t] - z_{q,t}\|_2^2 + \beta_3 \| z_t - \text{sg}[z_{q,t}] \|_2^2$

Each token is thus indexed as a triplet $(k_t^l, k_t^r, k_t^b)$ , discretizing the framewise pose (Zhao et al., 2023).

3. Pre-Training with Masked Unit Modeling (MUM)

The discrete pose triplet tokens produced by VQ-VAE enable pre-training via Masked Unit Modeling (analogous to MLM/BERT):

Masking: For a random subset $\mathcal{M}$ of frames, and for each triplet component (left, right hand, body), with 50% probability, the latent is replaced by a learned mask embedding.
The input sequence $F_0$ is composed of masked pose embeddings plus positional encodings.
A stack of Transformer encoder blocks is applied, yielding $F_N$ as contextualized frame representations.

A cross-entropy loss is used to predict original quantized codebook indices for the masked tokens only:

$\mathcal{L}_{\sf pre} = -\mathbb{E}_{\mathcal{V}_{\rm sign},\mathcal{M}}\left[ \sum_{t\in\mathcal{M}^l} \log p_{\sf hand}(k_t^l| f_{\text{out},t}^l) + \sum_{t\in\mathcal{M}^r} \log p_{\sf hand}(k_t^r| f_{\text{out},t}^r)\ + \sum_{t\in\mathcal{M}^b} \log p_{\sf body}(k_t^b| f_{\text{out},t}^b) \right]$

This process enables the Transformer to learn temporal and structural relations between pose triplet units (Zhao et al., 2023).

4. Integration with Downstream Sign Language Recognition

After pre-training:

The masked modeling head is removed.
An MLP classifier is attached, taking as input the average-pooled Transformer output over frames, for isolated sign recognition.
Optionally, outputs can be fused with parallel RGB-based models for multimodal classification.

Fine-tuning updates both the GCN pose encoder and the Transformer layers, using standard cross-entropy loss over the gloss vocabulary (Zhao et al., 2023).

5. Empirical Performance and Significance

Pose triplet tokenization achieves state-of-the-art results on four standard sign language recognition benchmarks, as demonstrated in the referenced work (Zhao et al., 2023). The approach enables learning hierarchical correlations among hand/body configurations and is robust to the low-level, continuous, and ambiguous nature of pose input—contrasting with classical NLP token units, which are inherently semantic and discrete. This formulation bridges the representational gap, providing a direct analogue for linguistic tokenization strategies (e.g., BERT) in visual-manual languages.

6. Relation to Other Tokenization Approaches

Pose triplet tokenization shares formal similarities with recent advances in subword and character-level tokenization in language and vision, notably factorized vector quantization schemes such as FACTORIZER (Samuel et al., 2023) and product-of-codebook modeling. It differs fundamentally in the granularity and semantics of the fundamental unit—focusing specifically on the multimodal, fine-grained compositionality of sign language motion. Classical subword tokenizers (BPE, unigram, and their variants) address symbolic string segmentation, while pose triplet tokenization addresses discretization of continuous spatiotemporal visual data.

The method stands in contrast to the hash-based sparse character trigram representations used in frameworks like T-FREE (Deiseroth et al., 2024) for LLMs, where trigrams originate from textual sequences. In pose triplet tokenization, the triplet is defined structurally by domain-specific articulation groups (left hand, right hand, body), and quantization is based on VQ-VAE codebooks learned from pose statistics, not text.

7. Impact and Broader Implications

The pose triplet tokenization paradigm enables the application of sequence modeling and masked-token pre-training—dominant techniques in NLP—to continuous, multimodal human communication data. It achieves new state-of-the-art accuracy for sign language recognition tasks and demonstrates the capacity of discrete token-based paradigms to encode, compress, and generalize across low-level perceptual inputs. A plausible implication is the extensibility of coupled tokenization and masked modeling techniques to other structured low-level signal domains (e.g., kinematic motion, biosignals) where similar compositionality exists (Zhao et al., 2023).