GOAT-SLM: Paralinguistic Aware Model

Updated 29 July 2025

GOAT-SLM is a spoken language model that integrates linguistic, paralinguistic, and speaker-specific features through a dual-modality design.
It employs a modular and staged training paradigm to align semantic, acoustic, and paralinguistic attributes for balanced performance across dialogue and emotion recognition tasks.
Evaluations on TELEVAL demonstrate its proficiency in nuanced speech synthesis and culturally competent interactions, positioning it as a breakthrough in socially aware conversational AI.

GOAT-SLM (“A Spoken LLM with Paralinguistic and Speaker Characteristic Awareness”) is a spoken LLM that explicitly models paralinguistic and speaker-specific features—such as dialect, age, emotion, and non-speech vocalizations—in addition to core linguistic content. Designed to address limitations of conventional SLMs that treat speech largely as a vector for transcribed text, GOAT-SLM uses a dual-modality architecture and modular, staged training strategy to achieve balanced proficiency across both linguistic and paralinguistic tasks. Experimental evaluations on TELEVAL show advances in emotion, dialect, and age-aware interactions, positioning GOAT-SLM as a foundation for socially aware spoken language systems (Chen et al., 24 Jul 2025).

1. Dual-Modality Architecture and Decoupling of Linguistic and Acoustic Processing

GOAT-SLM is constructed on a dual-modality head architecture that decouples semantic (linguistic) modeling from the acoustic realization pathway. The central “Think” module consists of the bottom $N$ transformer layers of a shared pretrained LLM, responsible for abstracted semantic reasoning over both text and speech modalities. This core is bifurcated at the top $K$ transformer layers into two distinct heads:

Write Module: Applies the upper layers of the LLM to generate textual output.
Speak Module: Shares the same transformer upper layers, but attaches a distinct prediction head to emit sequences of speech tokens.

Inbound speech is processed by the Listen module, which includes a Whisper-small encoder augmented with two CNN layers and two additional transformer layers to project acoustic features into the LLM embedding space. Downstream of the Speak module, a flow-matching component transcodes speech token sequences into acoustic feature representations, subsequently synthesized into waveforms via a neural vocoder.

Formally, for input $x$ (text or speech), the architecture computes:

Semantic embedding: $S = f_{\text{bottom}}(x)$
Text output: $y_{\text{text}} = f_{\text{write}}(S)$
Speech output: $y_{\text{speech}} = f_{\text{speak}}(S)$

The modular design enables isolating improvements in linguistic understanding or speech realization and facilitates efficient adaptation to new paralinguistic phenomena.

2. Modular and Staged Training Paradigm

GOAT-SLM training follows a three-stage modular and progressive protocol designed for incremental alignment across linguistic, paralinguistic, and speaker-level attributes:

Stage 1: Instruction Tuning
- Multi-turn dialogues are fine-tuned using prompts with explicit paralinguistic/speaker attribute descriptors (dialect, age, emotion, non-speech vocalizations).
- Contrastive data provisioning (multiple versions of an instruction varying vocal cues) provides robust supervision for attribute-sensitive response generation.
Stage 2: Speech–Text Alignment
- Stage 2-1 (Linguistic Alignment): The Listen module’s projector is exclusively updated on large-scale ASR data, with the core LLM frozen; “Repeat-and-Continue” prompting ensures efficient alignment between textual and speech-form inputs.
- Stage 2-2 (Linguistic + Paralinguistic Alignment): Real/synthesized speech queries embed explicit attribute descriptors, instructing the model to instantiate correct interplay of semantics and paralinguistics. The full Listen module is fine-tuned in this phase.
Stage 3: Expressive Speech Generation
- Stage 3-1 (Cold-Start): Training employs triplet-form data—e.g., $<$ Text Query, Text Response, Speech Response $>$ —with long-form (up to 60 seconds) segments to ensure language continuity in speech outputs.
- Stage 3-2 (Attribute-Aware Refinement): The Speak module is further refined using curated data spanning emotions (happiness, comfort, surprise, neutral) and age groups. GOAT-TTS is utilized to synthesize high-quality training data; only the Speak module is updated.

An additional detail is the multi-token prediction (MTP) mechanism in the speech head, where output embeddings are fed forward and concatenated with subsequent inputs, stabilizing pronunciation and enhancing synthesis fidelity. Selective gradient masking is employed (conditioned on token prediction confidence) to further improve fluency.

3. Evaluation on TELEVAL: Semantic and Paralinguistic Performance

Evaluation was conducted using the TELEVAL benchmark, whose dimensions span semantic intelligence (QA, dialogue) to paralinguistic and speaker-aware capabilities.

Semantic Intelligence:
- On general Audio Question Answering tasks (Chinese/English), GOAT-SLM delivers average performance—competitive on multi-turn and knowledge-based tasks (e.g., 72.33% on LlamaQA-en, 48.43% on ChineseQuiz-zh)—but slightly underperforms specialized open-source models such as Qwen2.5-Omni in purely semantic contexts. This trade-off reflects the expanded parameter utilization for modeling non-semantic features.
Paralinguistic and Speaker Characteristic Awareness:
- On dialectal following and adaptation, GOAT-SLM achieves subjective consistency rates of 50–70% (and over 90% on certain controlled dialect tasks), outperforming benchmarks for dialect comprehension and adaptation.
- For emotion-aware generation and NSV-handling, the model attains 72.13% on age-aware tasks and 40.91% for NSV-involved tests—domains where competing open-source models lag.
- Spoken output exhibits low Chinese Character Error Rate (CER: 1.57) and elevated Emotion Score (61.48), indicating high intelligibility and expressiveness.

These results indicate that explicit modular alignment and attribute-aware instruction tuning in GOAT-SLM enable the capture and reproduction of subtle paralinguistic and demographic features in both perception and generation.

4. Applications and Systems Integration

The explicit modeling of paralinguistic and speaker attributes in GOAT-SLM facilitates several practical and research directions:

Enhanced Dialogue Quality: Precise control and perception of emotional, dialectal, and age features supports more context-aware and culturally competent conversational agents. For instance, adaptive customer service systems, intelligent tutors, or social robots can modulate tone, dialect, and affect in response to user cues.
Personalized and Adaptive Speech Synthesis: Modular training enables rapid adaptation to new speaker profiles or emotion distributions, supporting deployment in diverse operational settings.
Multi-Language and Socio-Cultural Adaptation: The approach can be extended to cover larger numbers of languages and dialects, increasing global accessibility.

This suggests a paradigm shift from pure semantic-to-speech models toward agents capable of nuanced human-like interaction that spans linguistic and paralinguistic spectra.

5. Future Directions

The architecture and procedures in GOAT-SLM indicate several key research trajectories:

Finer-Grained Paralinguistic Reasoning: Extending the attribute set to include prosodic variation, nuanced emotion gradients, and gesture-speech entrainment is a noted next step.
Streaming and Low-Latency Adaptation: Optimizing streaming responsiveness (e.g., for real-time interaction) and further developing multi-turn dialogue memory and context caching.
Broader Linguistic and Cultural Coverage: Scaling training data and attribute descriptors for minority languages/dialects and cultural context.
Downstream Task Integration: Embedding paralinguistic-aware SLMs within larger dialogue or emotional intelligence frameworks for richer conversational analysis.

A plausible implication is that future iterations of GOAT-SLM, or analogous frameworks, will be essential for achieving truly natural, robust, and socially adaptive human–AI spoken interfaces.

6. Technical Summary and Broader Significance

GOAT-SLM exemplifies a new class of spoken LLMs that systematize the modeling of speech as a confluence of linguistic and non-linguistic features. Its dual-modality head and staged alignment design yield a system with both robust language understanding (as measured by QA accuracy) and advanced paralinguistic expressiveness (as measured by dialect/age/emotion adaptation and expressive speech synthesis metrics). By advancing from text-centric to socially aware SLMs, GOAT-SLM marks a substantive development in adaptive, human-compatible conversational AI (Chen et al., 24 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to GOAT-SLM.