Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SpeechWellness Challenge (SW1)

Updated 3 July 2025
  • SpeechWellness Challenge (SW1) is a benchmark initiative that uses multimodal speech signals and deep learning to detect suicide risk in adolescents.
  • It leverages a privacy-preserved Mandarin speech dataset from 600 subjects with balanced clinical evaluations for actionable insights.
  • Innovative fusion methods and LLM-driven analysis in SW1 yield robust performance and clinical interpretability for scalable mental health screening.

The SpeechWellness Challenge (SW1) is a benchmark initiative targeting the advancement of automated speech-based suicide risk detection in adolescents. Designed to address shortcomings in traditional mental health risk assessment—such as reliance on self-report and access to clinical resources—SW1 has rapidly become a critical testbed for evaluating multimodal signal processing, deep learning, and LLM methods applied to non-invasive, real-world mental health screening.

1. Scope and Motivation

SW1 focuses on detecting suicidal risk among adolescents (ages 10–18) from speech, leveraging the hypothesis that vocal and linguistic patterns may reveal psychological states not easily accessed through direct questioning. Suicide is a leading cause of death among adolescents worldwide, with early identification of at-risk individuals recognized as pivotal for prevention. Traditional methods (e.g., clinical interviews, self-report questionnaires) are resource-intensive and can suffer from underreporting or access barriers. Speech-based analysis offers a scalable, non-intrusive alternative, opening possibilities for remote, continuous, and stigma-free risk monitoring (2501.06474).

The challenge specifically addresses:

  • The creation and release of a large-scale, privacy-conscious speech dataset.
  • The evaluation of acoustic, linguistic, and multimodal algorithms for binary suicide risk classification.
  • The promotion of explainable and clinically actionable AI models suitable for deployment in health and e-health contexts.

2. Dataset Characteristics and Acquisition Protocol

The SW1 dataset comprises 600 anonymized Mandarin speech samples, equally split between at-risk and non-risk adolescents, balanced for gender and spanning ages 10–18. Participants were sourced from 47 schools in Guangdong, China, with risk status evaluated using the MINI-KID clinical diagnostic interview, an established gold standard for psychiatric assessment in youth (2501.06474, 2505.13069).

Data collection included three tasks per subject:

  • Emotional Regulation (ER): Responses to open-ended prompts on managing emotional distress.
  • Passage Reading (PR): Reading a standard story ("The North Wind and The Sun").
  • Expression Description (ED): Description of an image with a facial expression.

Recordings were conducted individually in sound-proof environments using standardized devices, with rigorous ethical protocols, including informed consent and multi-stage validation. To ensure privacy, vocal timbre was anonymized by neural voice conversion post-processing, with additional checks (e.g., minimum word count, intelligibility, and sensitivity screening via LLMs).

Table: SW1 Dataset Structure

Attribute Details
Subjects 600 adolescents (300/300 risk/non-risk)
Language Mandarin Chinese
Tasks ER, PR, ED
Average duration/task ER: 36.4s, PR: 47.4s, ED: 23.5s
Anonymization Neural timbre conversion
Risk label MINI-KID diagnosis
Data split Train:Dev:Test = 4:1:1

3. Methodological Innovations and Baseline Systems

The SW1 challenge established several key methodological baselines, against which participants evaluated novel models. These baselines spanned both classical and modern deep learning paradigms:

  • eGeMAPS + SVM Baseline: Utilized the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS, 88 features) capturing prosody, voice quality, spectral attributes, and paralinguistic cues. Support Vector Machine (SVM) classifiers (RBF or linear kernel) were applied per task, with ensembled predictions across tasks (2501.06474).
  • Wav2Vec2 (W2V2) + BERT Baseline: Combined large-scale self-supervised audio embeddings (Wav2Vec2, 24 transformer layers, trained on 56k+ hours multilingual) with Chinese BERT text embeddings, fused via concatenation and a softmax-trained classifier under cross-entropy loss.
  • Multimodal and Fusion Strategies: Subsequent studies adopted more advanced methods:
    • WhisperX for transcription,
    • WavLM for self-supervised audio embeddings,
    • Chinese RoBERTa for enriched text embeddings,
    • Inclusion of traditional acoustics (MFCCs, spectral contrast, pitch statistics),
    • Fusion architectures ranging from simple concatenation to weighted attention with mixup regularization (2505.13069).
  • LLM-centric Approaches: The leading submissions deployed LLMs (e.g., DeepSeek-R1, Gemma2, Qwen2.5) both for direct transcript-based classification in an in-context learning regime and for prompt-driven extraction of interpretable clinical indicators (e.g., self-harm, social support) (2505.20491, 2507.00693). Automated prompt management and ablation were managed with DSPy to optimize few-shot and chain-of-thought strategies.

4. Feature Engineering and Model Architectures

Three main feature paradigms were established:

  1. Acoustic Features:
    • Extracted with pretrained neural models (HuBERT, Wav2Vec2, Whisper) as well as hand-crafted descriptors (MFCCs, spectral contrast, pitch). These features are sensitive to prosodic and affective cues linked to emotional state.
  2. Semantic (Textual/Linguistic) Features:
    • Fast automatic transcription via WhisperX, then processed through models such as Chinese BERT or RoBERTa, yielding linguistic and contextual embeddings of responses. These address word selection, syntax, and content patterns.
  3. LLM-Interpretable Features:
    • DeepSeek-R1 and similar LLMs were prompted to extract the presence of clinically relevant themes (Self-harm Behavior, Pressure, Social Support, Unhealthy Outlets, Exercise) with quoted evidence from transcripts. Outputs were binary (indicator present/absent) per category, supplying model transparency to clinicians (2507.00693).

Fusion Mechanisms:

  • Simple concatenation of embeddings, independent modality-specific processing followed by attention-based fusion, and voting ensembles all featured; the best test-time generalization was achieved by learned modality-specific weighting with mixup regularization (2505.13069).

LLM In-context Learning:

  • Prompted LLMs, using multiple labeled examples, outperformed classical feature-based pipelines. Four-shot Gemma2-9b achieved 0.68 accuracy (F1=0.70) using only transcripts, with statistically significant improvement (p=0.003) for more prompt examples (2505.20491).

5. Outcomes, Performance, and Generalization

The test set results across SW1 are as follows:

Approach Test Accuracy Test F1 Notes
Baseline-bonus (W2V2+BERT) 0.61 Provided baseline
Multimodal SOTA (WavLM+RoBERTa+Acoustics) 0.56 0.56 Weighted attention + mixup (2505.13069)
LLM in-context only (transcript-based) 0.68 0.70 Gemma2-9b, DSPy, 3rd place (2505.20491)
LLM interpretable features (DeepSeek-R1) 0.652 0.679 ER task only (2507.00693)
Ensemble (acoustic+semantic+LLM) (winner) 0.74 0.74 1st place (2507.00693)

Key findings:

  • LLM-augmented frameworks surpassed traditional signal and fusion methods.
  • Pure text-based in-context learning was robust to audio anonymization, emphasizing the power of linguistic cues.
  • Overfitting to dev/validation sets remained a challenge in deep multimodal systems, mitigated by mixup regularization and modulation of feature weights.
  • The ensemble approach, heavily weighting LLM-extracted clinical indicators, provided both performance and interpretability.

6. Patterns, Markers, and Clinical Integration

Markers correlating with elevated suicide risk, as captured in the challenge, included:

  • Semantic features: Use of distinct content words and thematic shifts in transcripted responses.
  • Paralinguistic features: Jitter, shimmer, and F0 variability, aligned with stress and mental state (drawn from eGeMAPS and prior psychiatric voice analysis).
  • LLM-extracted psychological/behavioral codes: Direct or indirect references to self-harm, explicit statements of psychosocial pressure, comments on lack of support or reliance on maladaptive coping strategies, with supporting language quotes.

The design enabled not only risk flagging but offered clinicians visibility into the key spoken evidence on which decisions were based. This is particularly significant for augmenting clinical trust and establishing actionable intervention paths.

7. Limitations and Future Directions

Several challenges and research directions are recognized:

  • The SW1 dataset’s demographic scope is presently limited to a single province and language, raising questions about cross-lingual or multicultural model generalization (2501.06474, 2505.13069).
  • The MINI-KID framework provides a robust but present-state label, not predictive of long-term risk or future attempts (2505.13069).
  • The privacy requirement via voice conversion preserved linguistic and rhythmic content but diminished the utility of detailed acoustic modeling.
  • Performance ceilings in deep and multimodal learning, especially on unseen test data, point to the continued need for:
    • Expansion to multiethnic, multilingual datasets
    • Regularization and debiasing strategies
    • Further interpretability for clinical acceptability
    • Longitudinal and multimodal multimarker fusion (e.g., integration with behavioral, physiological, or social data streams)

A plausible implication is that next-generation SW1-style challenges will move toward ongoing, context-aware suicide risk monitoring, employing LLM-extracted, interpretable markers as first-line surveillance for population-scale mental health intervention, while remaining cautious about overfitting, bias, and the temporal generalizability of present-state labels.


In summary, the SpeechWellness Challenge (SW1) exemplifies the convergence of speech technology, deep learning, and clinical psychiatry for adolescent mental health evaluation, highlighting the primacy of LLM-based interpretability, robust multimodal modeling, and the necessity of dataset expansion and clinical integration for future progress.