Canonical Semantic Form (CSF)
- CSF is a language-neutral semantic framework that employs a structured 9-tuple of discrete slots to decompose utterances for precise sign language generation.
- It features a detailed condition taxonomy with 35 classes across 8 categories to capture nuanced, context-dependent linguistic expressions.
- A lightweight Transformer-based slot extractor achieves over 99% accuracy, enabling parameter-efficient, real-time crosslingual translation.
Canonical Semantic Form (CSF) is a language-agnostic semantic representation designed to enable direct translation from any source language to sign languages without mediating through English. CSF represents utterances as a structured 9-tuple of discrete semantic slots, supporting precise, universal mapping suitable for multilingual sign language generation. It is constructed to facilitate parameter-efficient, real-time processing and provides explicit decomposition of utterances for nuanced linguistic phenomena, notably conditional expressions, across a broad typological spectrum (Bao, 5 Jan 2026).
1. Formal Specification and Semantic Slot Scheme
Let denote an input utterance in any natural language. The canonical representation is defined as:
where each function extracts the corresponding semantic slot. More formally, there exist finite sets:
and a mapping
The slot inventory and their permitted values are summarized below:
| Slot | Value Set Size | Example Values |
|---|---|---|
| event | 7 | GO, STAY, BUY, WORK, MEET, EAT, LEARN |
| intent | 4 | NONE, PLAN, WANT, DECIDE |
| time | 5 | NONE, TODAY, TOMORROW, YESTERDAY, NOW |
| condition | 35 | See Section 2 |
| agent | 5 | ME, YOU, HE, SHE, THEY |
| object | 5 | NONE, FOOD, BOOK, MEDICINE, THING |
| location | 6 | NONE, HOME, SCHOOL, HOSPITAL, OFFICE, STORE |
| purpose | 2 | NONE, REST |
| modifier | 4 | NONE, FAST, SLOW, ALONE |
"NONE" denotes the absence of a value. Each slot captures a high-level primitive, enabling compositional meaning across languages.
2. Condition Taxonomy: Semantic Breadth and Systematicity
The condition slot () is distinguished by a fine-grained taxonomy encompassing 35 classes allocated to eight disjoint categories:
| Category | Condition Classes |
|---|---|
| Weather | IF_RAIN, IF_SUNNY, IF_COLD, IF_HOT, IF_WINDY |
| Time | IF_LATE, IF_EARLY, IF_WEEKEND, IF_NIGHT, IF_MORNING |
| Health | IF_SICK, IF_TIRED, IF_HUNGRY, IF_THIRSTY, IF_FULL |
| Schedule | IF_BUSY, IF_FREE, IF_HOLIDAY, IF_WORKING |
| Mood | IF_BORED, IF_HAPPY, IF_SAD, IF_STRESSED, IF_ANGRY |
| Social | IF_ALONE, IF_WITH_FRIENDS, IF_WITH_FAMILY |
| Activity | IF_FINISH_WORK, IF_FINISH_SCHOOL, IF_FINISH_EATING, IF_WATCH_MOVIE, IF_LISTEN_MUSIC |
| Financial | IF_HAVE_MONEY, IF_NO_MONEY |
This taxonomy enables explicit encoding of conditional expressions prevalent in natural language, such as weather contingencies, temporal and habitual conditions, agent internal state, and socio-economic factors. Examples include:
- “If it rains, I stay home.” c = IF_RAIN
- “When I’m bored, I watch Netflix.” c = IF_BORED
- “If I have money, I go shopping.” c = IF_HAVE_MONEY
A plausible implication is that such detailed condition granularity supports nuanced, inferential translation for sign languages, which often encode conditional semantics syntactically rather than lexically.
3. Transformer-Based Slot Extraction Architecture
CSF slot extraction is operationalized via a bespoke lightweight Transformer. The architecture includes:
- Subword tokenization with a custom BPE vocabulary ()
- Embedding layer maps tokens to , adding positional information
- stacked Transformer encoder layers using Pre-LayerNorm, -headed self-attention, and dual-layer FFN (inner size = , GELU activation)
- A global [CLS] token representation feeds nine slot-wise softmax classifiers
Per-slot prediction is given by:
for slot , with and as learned parameters. The total loss, summed over all slots, is:
where is the one-hot ground truth for slot .
Model specifications include approximately parameters, an ONNX export size of $433.7$ KB, and a complete deployment footprint of $0.74$ MB. Inference achieves $3.02$ ms per utterance on CPU, supporting real-time applications.
4. Data Regime, Training Procedure, and Empirical Performance
Training utilizes a dataset of utterances distributed across English, Vietnamese, Japanese, and French:
- Train/validation split:
- All $35$ condition classes represented; ≈
- Optimization: AdamW (weight decay $0.01$), learning rate , OneCycleLR with cosine decay, steps over $15$ epochs, warm-up, A100 GPU (20 min total).
Performance is reported as slot-level accuracy and averaged over all slots:
| Slot | Classes | Accuracy (%) |
|---|---|---|
| event | 7 | 97.8 |
| intent | 4 | 99.2 |
| time | 5 | 99.6 |
| condition | 35 | 99.4 |
| agent | 5 | 99.0 |
| object | 5 | 99.2 |
| location | 6 | 97.9 |
| purpose | 2 | 99.7 |
| modifier | 4 | 99.5 |
| Average | — | 99.03 |
Condition classification reaches across $35$ classes, indicating robust, fine-grained extraction even under high class cardinality. This suggests strong crosslingual generalization from a unified, parameter-efficient model.
5. Deterministic Mapping to Signed Gloss Representations
Once extracted, the nine semantic slots are mapped deterministically to a gloss sequence emulating American Sign Language (ASL) topic–comment structure. The output ordering is . The gloss string construction is:
where is the label for slot .
For example, the utterance “If it rains tomorrow, I stay home.” is mapped as follows:
- CSF:
- GLOSS output: TOMORROW IF_RAIN HOME STAY
The conversion is algorithmically defined:
1 2 3 4 5 6 7 |
def CSF_to_GLOSS(slots): order = [modifier, time, condition, agent, location, object, event, purpose] gloss_seq = [] for s in order: if slots[s] != "NONE": gloss_seq.append(slots[s]) return gloss_seq |
6. Practicality, Scope, and Significance
By employing fixed, language-neutral semantic primitives, CSF enables direct translation—removing reliance on English as a pivot and bypassing inherent bottlenecks in resource-lean languages. The framework demonstrates:
- Extreme parameter and compute efficiency ($0.74$ MB deployment, $3.02$ ms inference)
- Exceptional crosslingual generalization ( slot extraction, condition classification)
- The most comprehensive condition taxonomy yet published for sign language translation (35-class, 8-category schema)
- Applicability in browser-based environments for real-time multimodal accessibility
This design bridges typologically diverse spoken languages and signed languages with a unified, interpretable intermediate representation, directly addressing barriers faced by global Deaf communities in current translation systems that require English mediation (Bao, 5 Jan 2026).