EMSLlama: Emergency Medical Symptom LLM
- EMSLlama is a domain-specialized large language model that extracts and normalizes emergency medical symptoms from error-prone speech-to-text transcripts.
- It utilizes low-rank adaptation on Llama-3-8B, training only 0.13% of parameters to achieve a 32-point increase in ExactMatch over GPT-4o.
- Designed for edge deployment in pre-arrival EMS workflows, it delivers symptom normalization with inference latencies under one second.
EMSLlama is a domain-specialized LLM designed for robust, real-time extraction and normalization of emergency medical symptoms from noisy, error-prone speech-to-text transcripts. Developed as a core module of the TeleEMS system for pre-arrival emergency medical services (EMS), EMSLlama leverages low-rank adaptation of Meta’s Llama-3-8B backbone to deliver deterministic, high-accuracy decision support in resource-constrained, latency-critical edge settings (Jin et al., 18 Nov 2025).
1. Model Architecture and Foundation
EMSLlama retains the high-level transformer structure of Llama-3-8B with approximately 8 billion parameters, 32 transformer layers, a hidden size of 4096, 32 attention heads, and a feed-forward inner dimension of approximately 11,000. Rather than undertaking full-parameter re-training, EMSLlama introduces trainable Low-Rank Adapters (LoRA, rank 16) into each attention and feed-forward block. For each original block weight , it augments as:
where and , with . This structure updates only 0.13% of total parameters, preserving original language capabilities while enabling specialization to the EMS symptom extraction task. The underlying Llama-3-8B was pre-trained on web-scale corpora using the standard autoregressive cross-entropy objective:
where is a token sequence. EMSLlama’s adaptation does not involve re-running this step but builds upon the pre-trained weights (Jin et al., 18 Nov 2025).
2. Symptom Normalization Pipeline and Training Procedure
EMSLlama operates exclusively on text obtained from speech-to-text (STT) front ends (e.g., Whisper, Google Cloud), never ingesting raw audio features. The fine-tuning pipeline includes two major stages:
Stage 1: Data Augmentation
- 480 real-world pre-arrival EMS audio recordings are selected, representing 40 conversation templates, 4 speakers, and 3 microphones.
- From each, error-laden symptom phrases are extracted (e.g., “arts” for “ARDS”, “shown up of breath” for “shortness of breath”).
- Each noisy phrase is paired with the ground-truth normalized label using a canonical dispatcher–bystander prompt template, forming 384 training pairs and 96 validation pairs.
Stage 2: LoRA Fine-Tuning
- All Llama-3-8B backbone weights are frozen.
- Only the newly introduced LoRA blocks are trained, using standard cross-entropy loss over pairs.
- No raw spectrogram, STFT, or mel-filterbank features are exposed to EMSLlama; all acoustic variability is handled upstream in the STT stage.
Normalization is performed to a custom set of “primary pre-arrival symptoms” derived from the NEMSIS ontology, capturing only symptoms which can be observed and reported in pre-arrival contexts. Inference yields standardized English symptom names used by dispatchers and EMTs, supporting downstream structured decision support workflows (Jin et al., 18 Nov 2025).
3. Performance Evaluation and Benchmarking
EMSLlama is evaluated on 5,760 held-out noisy transcript inputs, spanning 12 STT systems and 480 audio samples. Comparison baselines include GPT-4o (multiple runs, best performance) and five biomedical NER models (SciSpaCy small/medium/large, SciBERT, BC5CDR).
Key results on Whisper-medium transcripts:
| Model | ExactMatch | CER (↓) | BLEU (↑) |
|---|---|---|---|
| GPT-4o | 0.57 | 0.53 | 0.45 |
| EMSLlama | 0.89 | 0.12 | 0.92 |
ExactMatch is defined as the fraction of cases where the output exactly matches the gold normalized symptom. EMSLlama yields +32 points (absolute) improvement in ExactMatch over GPT-4o, by paired bootstrap. Character Error Rate (CER) and BLEU similarly favor EMSLlama. SpaCy-based NER models exhibited lower accuracy and greater confusion, particularly on highly noisy transcripts.
Common failure modes include (i) rare or unseen symptoms in the augmentation set, and (ii) heavily truncated transcripts, which degrade context for normalization. Confusion matrices indicate most errors cluster among low-frequency classes (Jin et al., 18 Nov 2025).
4. Real-Time Inference and Edge Deployment
Deployed within the TeleEMS system, EMSLlama runs on an edge server equipped with three NVIDIA A30 GPUs:
- Inference latency on a single A30: ~120 ms per ∼50-token conversation transcript; ~200 ms including tokenization I/O.
- No extra CPU cores required beyond those serving WebRTC and NLP hosting roles.
- End-to-end system latency from speech input to symptom display on EMT smart glasses is under 1 second.
Deployment workflow:
- Devices stream audio/video via EMSStream (Janus WebRTC) to the edge.
- Audio is transcribed by Whisper or Google STT.
- Transcripts are processed by EMSLlama for symptom normalization.
- Normalized outputs are forwarded to dispatch consoles and EMT wearables via EMSStream.
This ensures real-time, low-latency delivery of standardized symptom information, supporting rapid pre-arrival intervention recommendations (Jin et al., 18 Nov 2025).
5. Limitations and Extension Opportunities
EMSLlama’s primary limitations stem from front-end transcript quality—overlapping speakers and high background noise lead to degraded input, which propagates into normalization errors. Out-of-vocabulary or rare symptoms (e.g., “animal bites” not present in the training set) remain challenging, and the system is currently restricted to monolingual English.
Proposed future directions include:
- Multilingual LoRA fine-tuning (e.g., Spanish, Mandarin) leveraging synthetic and open transcripts.
- Speaker diarization and noise-robust front-end preprocessing for more reliable conversation parsing.
- Enhanced multimodal fusion, integrating audio prosody and video-visual features via unified adapter frameworks.
- Patient-specific adapters for incorporating contextual information such as age or comorbidities into interpretation (Jin et al., 18 Nov 2025).
6. Significance and Distinctiveness
EMSLlama demonstrates the viability of large-scale, low-rank adapted LLMs for deterministic, high-accuracy medical symptom extraction and normalization in safety-critical, resource-constrained environmental settings. It achieves robust performance under substantial input noise, greatly exceeding standard and frontier LLM baselines, and supports real-world EMS workflows with low decision latency.
The architecture’s selective parameter updating strategy suggests a scalable path for rapid domain specialization of foundational LLMs without forfeiting original language understanding, and its integration in TeleEMS paves the way for next-generation intelligent edge analytics in emergency response (Jin et al., 18 Nov 2025).