Papers
Topics
Authors
Recent
2000 character limit reached

Canonical Semantic Form (CSF)

Updated 12 January 2026
  • CSF is a language-neutral semantic framework that employs a structured 9-tuple of discrete slots to decompose utterances for precise sign language generation.
  • It features a detailed condition taxonomy with 35 classes across 8 categories to capture nuanced, context-dependent linguistic expressions.
  • A lightweight Transformer-based slot extractor achieves over 99% accuracy, enabling parameter-efficient, real-time crosslingual translation.

Canonical Semantic Form (CSF) is a language-agnostic semantic representation designed to enable direct translation from any source language to sign languages without mediating through English. CSF represents utterances as a structured 9-tuple of discrete semantic slots, supporting precise, universal mapping suitable for multilingual sign language generation. It is constructed to facilitate parameter-efficient, real-time processing and provides explicit decomposition of utterances for nuanced linguistic phenomena, notably conditional expressions, across a broad typological spectrum (Bao, 5 Jan 2026).

1. Formal Specification and Semantic Slot Scheme

Let xx denote an input utterance in any natural language. The canonical representation is defined as:

CSF(x)=(e(x),i(x),t(x),c(x),a(x),o(x),l(x),p(x),m(x))\text{CSF}(x) = (e(x), i(x), t(x), c(x), a(x), o(x), l(x), p(x), m(x))

where each function extracts the corresponding semantic slot. More formally, there exist finite sets:

Sevent, Sintent, Stime, Scondition, Sagent, Sobject, Slocation, Spurpose, SmodifierS_\text{event},\ S_\text{intent},\ S_\text{time},\ S_\text{condition},\ S_\text{agent},\ S_\text{object},\ S_\text{location},\ S_\text{purpose},\ S_\text{modifier}

and a mapping

fCSF: VocabularySevent×Sintent×Stime×Scondition×Sagent×Sobject×Slocation×Spurpose×Smodifierf_\text{CSF}:\ \text{Vocabulary}^* \to S_\text{event} \times S_\text{intent} \times S_\text{time} \times S_\text{condition} \times S_\text{agent} \times S_\text{object} \times S_\text{location} \times S_\text{purpose} \times S_\text{modifier}

The slot inventory and their permitted values are summarized below:

Slot Value Set Size Example Values
event 7 GO, STAY, BUY, WORK, MEET, EAT, LEARN
intent 4 NONE, PLAN, WANT, DECIDE
time 5 NONE, TODAY, TOMORROW, YESTERDAY, NOW
condition 35 See Section 2
agent 5 ME, YOU, HE, SHE, THEY
object 5 NONE, FOOD, BOOK, MEDICINE, THING
location 6 NONE, HOME, SCHOOL, HOSPITAL, OFFICE, STORE
purpose 2 NONE, REST
modifier 4 NONE, FAST, SLOW, ALONE

"NONE" denotes the absence of a value. Each slot captures a high-level primitive, enabling compositional meaning across languages.

2. Condition Taxonomy: Semantic Breadth and Systematicity

The condition slot (cSconditionc \in S_\text{condition}) is distinguished by a fine-grained taxonomy encompassing 35 classes allocated to eight disjoint categories:

Category Condition Classes
Weather IF_RAIN, IF_SUNNY, IF_COLD, IF_HOT, IF_WINDY
Time IF_LATE, IF_EARLY, IF_WEEKEND, IF_NIGHT, IF_MORNING
Health IF_SICK, IF_TIRED, IF_HUNGRY, IF_THIRSTY, IF_FULL
Schedule IF_BUSY, IF_FREE, IF_HOLIDAY, IF_WORKING
Mood IF_BORED, IF_HAPPY, IF_SAD, IF_STRESSED, IF_ANGRY
Social IF_ALONE, IF_WITH_FRIENDS, IF_WITH_FAMILY
Activity IF_FINISH_WORK, IF_FINISH_SCHOOL, IF_FINISH_EATING, IF_WATCH_MOVIE, IF_LISTEN_MUSIC
Financial IF_HAVE_MONEY, IF_NO_MONEY

This taxonomy enables explicit encoding of conditional expressions prevalent in natural language, such as weather contingencies, temporal and habitual conditions, agent internal state, and socio-economic factors. Examples include:

  • “If it rains, I stay home.” \to c = IF_RAIN
  • “When I’m bored, I watch Netflix.” \to c = IF_BORED
  • “If I have money, I go shopping.” \to c = IF_HAVE_MONEY

A plausible implication is that such detailed condition granularity supports nuanced, inferential translation for sign languages, which often encode conditional semantics syntactically rather than lexically.

3. Transformer-Based Slot Extraction Architecture

CSF slot extraction is operationalized via a bespoke lightweight Transformer. The architecture includes:

  • Subword tokenization with a custom BPE vocabulary (V=8, ⁣000|V|=8,\!000)
  • Embedding layer maps tokens to Rd\mathbb{R}^d, adding positional information
  • LL stacked Transformer encoder layers using Pre-LayerNorm, HH-headed self-attention, and dual-layer FFN (inner size = 1, ⁣0241,\!024, GELU activation)
  • A global [CLS] token representation hclsRdh_\text{cls} \in \mathbb{R}^d feeds nine slot-wise softmax classifiers

Per-slot prediction is given by:

zk=Wkhcls+bkRSkz_k = W_k h_\text{cls} + b_k \in \mathbb{R}^{|S_k|}

pk=softmax(zk)p_k = \operatorname{softmax}(z_k)

for slot k=1,,9k=1,\ldots,9, with WkW_k and bkb_k as learned parameters. The total loss, summed over all slots, is:

L(x;θ)=k=19CE(yk,pk)L(x;\theta) = \sum_{k=1}^9 \operatorname{CE}(y_k, p_k)

where yky_k is the one-hot ground truth for slot kk.

Model specifications include approximately 1.5×1061.5\times 10^6 parameters, an ONNX export size of $433.7$ KB, and a complete deployment footprint of $0.74$ MB. Inference achieves $3.02$ ms per utterance on CPU, supporting real-time applications.

4. Data Regime, Training Procedure, and Empirical Performance

Training utilizes a dataset of 18, ⁣88518,\!885 utterances distributed across English, Vietnamese, Japanese, and French:

  • Train/validation split: 16, ⁣996/1, ⁣88916,\!996/1,\!889
  • All $35$ condition classes represented; NONENONE22.6%22.6\%
  • Optimization: AdamW (weight decay $0.01$), learning rate 2×1042\times10^{-4}, OneCycleLR with cosine decay, 3, ⁣9903,\!990 steps over $15$ epochs, 10%10\% warm-up, A100 GPU (\sim20 min total).

Performance is reported as slot-level accuracy and averaged over all slots:

Slot Classes Accuracy (%)
event 7 97.8
intent 4 99.2
time 5 99.6
condition 35 99.4
agent 5 99.0
object 5 99.2
location 6 97.9
purpose 2 99.7
modifier 4 99.5
Average 99.03

Condition classification reaches 99.4%99.4\% across $35$ classes, indicating robust, fine-grained extraction even under high class cardinality. This suggests strong crosslingual generalization from a unified, parameter-efficient model.

5. Deterministic Mapping to Signed Gloss Representations

Once extracted, the nine semantic slots are mapped deterministically to a gloss sequence emulating American Sign Language (ASL) topic–comment structure. The output ordering π\pi is (modifier, time, condition, agent, location, object, event, purpose)(\text{modifier},\ \text{time},\ \text{condition},\ \text{agent},\ \text{location},\ \text{object},\ \text{event},\ \text{purpose}). The gloss string construction is:

GLOSS(x)=[vk  kπ, vk“NONE”]\text{GLOSS}(x) = [\, v_k\ |\ k \in \pi,\ v_k \neq \text{“NONE”} \,]

where vkv_k is the label for slot kk.

For example, the utterance “If it rains tomorrow, I stay home.” is mapped as follows:

  • CSF: (e=STAY, i=NONE, t=TOMORROW, c=IF_RAIN, a=ME, o=NONE, l=HOME, p=NONE, m=NONE)(e=\text{STAY},\ i=\text{NONE},\ t=\text{TOMORROW},\ c=\text{IF\_RAIN},\ a=\text{ME},\ o=\text{NONE},\ l=\text{HOME},\ p=\text{NONE},\ m=\text{NONE})
  • GLOSS output: TOMORROW IF_RAIN HOME STAY

The conversion is algorithmically defined:

1
2
3
4
5
6
7
def CSF_to_GLOSS(slots):
    order = [modifier, time, condition, agent, location, object, event, purpose]
    gloss_seq = []
    for s in order:
        if slots[s] != "NONE":
            gloss_seq.append(slots[s])
    return gloss_seq
A plausible implication is that this explicit pipeline reduces ambiguity in mapping open-domain sentences to sign language production by decomposing them into non-overlapping primitives.

6. Practicality, Scope, and Significance

By employing fixed, language-neutral semantic primitives, CSF enables direct L1SignL_1\to \text{Sign} translation—removing reliance on English as a pivot and bypassing inherent bottlenecks in resource-lean languages. The framework demonstrates:

  • Extreme parameter and compute efficiency ($0.74$ MB deployment, $3.02$ ms inference)
  • Exceptional crosslingual generalization (99.03%99.03\% slot extraction, 99.4%99.4\% condition classification)
  • The most comprehensive condition taxonomy yet published for sign language translation (35-class, 8-category schema)
  • Applicability in browser-based environments for real-time multimodal accessibility

This design bridges typologically diverse spoken languages and signed languages with a unified, interpretable intermediate representation, directly addressing barriers faced by global Deaf communities in current translation systems that require English mediation (Bao, 5 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Canonical Semantic Form (CSF).