NLPAI4Health 2025 Shared Task
- NLPAI4Health 2025 Shared Task is a challenge focused on multilingual clinical dialogue summarization and question answering, emphasizing translation-centric pipelines and multitask prompting with distilled models.
- The system employs a three-stage process that translates Indic dialogues to English, applies a multitask distilled model for summarization and answering, and re-translates outputs back to Indic languages.
- Robust performance metrics in mid-resource languages and noted challenges in low-resource settings highlight the practical efficacy and limitations of using quantized, instruction-driven models in clinical NLP.
The NLPAI4Health 2025 Shared Task focused on multilingual clinical dialogue summarization and question answering, with an emphasis on performance across Indic languages and English. A key contribution came from team Kl33n3x, which developed a highly efficient, translation-centric pipeline leveraging distilled LLMs for multilingual processing. Their system demonstrated that compact distilled models, when combined with robust translation and multitask prompting architectures, can provide strong performance in both mid- and low-resource medical dialogue understanding, bypassing the need for task-specific fine-tuning and enabling efficient deployment across diverse settings (Novoa et al., 14 Jan 2026).
1. System Architecture and Pipeline
The Kl33n3x submission implemented a three-stage pipeline optimized for multilingual dialogue processing:
- Forward Translation (Indic → English): Structured clinical dialogues, with explicit speaker and turn markers, were translated to English using the prajdabre/rotary-indictrans2-indic-en-dist-200M model (IndicTrans2 with rotary position embeddings). The translation was guided by a prepended language tag (e.g.,
\<2en>) and truncated if necessary to a 2,048-token limit. - Multitask Text Generation (English): The English text was input to a 2.55B parameter distilled and 4-bit quantized variant of Qwen3-4B (
unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit), with a context window up to 256 K tokens. This model performed three tasks—narrative summarization, structured key-value summarization, and question answering—controlled solely via instruction prompts. Inference employed greedy decoding with a maximum output of 3,000 tokens. - Reverse Translation (English → Indic): Outputs were translated back to the source Indic language using the prajdabre/rotary-indictrans2-en-indic-dist-200M model, again guided by an appropriate language tag (e.g.,
\<2xx>) (Novoa et al., 14 Jan 2026).
This modular design allowed for translation-based interoperability, maximizing the power of large English-centric LLMs even in low-resource settings.
2. Distillation and Model Compression
A central aspect of the approach was the use of knowledge distillation to train a lightweight yet performant generation model:
- Base Model: 32 transformer decoder layers, hidden size ≈ 4,096, multi-head self-attention, matching Qwen3-4B’s macroarchitecture.
- Weight Quantization: Compressed to 4-bit weights via the Unsloth framework, significantly reducing memory requirements (≈ 6 GB for inference).
- Distillation Setup:
- Teacher: Qwen3-4B-Instruct (≈ 4 B parameters).
- Student: 2.55B parameter, 4-bit quantized model.
- Loss Function: The distillation loss was based on soft targets with adjustable temperature :
with and denoting teacher and student output distributions, respectively. - Combined Objective: During training,
where is standard cross-entropy with ground truth, and controls the balance between task fidelity and teacher imitation.
No LoRA/adapters or task-specific parameter heads were used; multitask handling was entirely instruction-based (Novoa et al., 14 Jan 2026).
3. Multitask Training and Prompting
A single decoder model managed all target tasks by leveraging natural-language prompts. For each training example, the input comprised a task instruction (e.g., “Summarize the following dialogue:”) concatenated with the English dialogue. The target output was either:
- A narrative summary,
- Structured key-value pairs,
- Answers to specified questions.
All tasks shared parameters, with no separate heads; task distinction was handled by prefix instructions. The model maximized the log probability of target outputs conditioned on the instruction and dialogue, i.e., , across all tasks. This multitask setup enabled strong zero- and few-shot generalization and simplified deployment (Novoa et al., 14 Jan 2026).
4. Data Preprocessing and Translation Details
The pipeline imposed rigorous formatting and preprocessing:
- Dialogue Formatting: Structured XML/JSON sources were converted to lines such as “Doctor: … Patient: …”, with extraneous whitespace removed, punctuation normalized, and Unicode encoding made consistent.
- Special Tokens: Language directionality (e.g.,
\<2en>,\<2hi>) signaled the reverse/forward translation process to IndicTrans2 models. - Tokenization: Text was tokenized using byte-level BPE for the generation model (inherited from Qwen) and SentencePiece for IndicTrans2.
- Limits: Translation inputs and outputs were capped at 2,048 tokens. There was no further cleaning of generated English for translation output, with post-translation formatting handled by the reverse translation model.
Greedy decoding was used for both generation and translation stages; batch sizes were adjusted dynamically to the available GPU memory (e.g., batches of 4–8 dialogues on a 24 GB device) (Novoa et al., 14 Jan 2026).
5. Evaluation and Performance Analysis
The Kl33n3x system’s performance was benchmarked by task-based win rates and standard NLP metrics across nine languages:
| Task / Metric | Best-Performing Languages | Reported Results |
|---|---|---|
| QnA Win Rate | Marathi, Tamil, Hindi | 86.7%, 86.7%, 80.0%, respectively |
| Narrative Summarization Win Rate | Hindi, Gujarati | 66.7%, 73.3% |
| Structured Summarization Win Rate | Telugu, Assamese | 66.7%, 60.0% |
| QnA F1 / BERTScore | – | 0.43–0.67 / 0.82–0.86 |
| Narrative Summarization F1 / BERTScore | – | 0.81–0.92 / 0.83–0.84 |
| Structured Summarization F1 / BERTScore | – | 0.35–0.43 / 0.90–0.92 |
Strong win rates for Marathi and Tamil QnA (86.7%) and robust Hindi QnA (80.0%) demonstrate the efficacy of the pipeline for mid-resource Indic languages, while underperformance in Telugu QnA (13.3%) highlights translation challenges for low-resource/nuanced cases. F1 and BERTScore metrics, as reported, further substantiate competitive performance in both question answering and summarization tasks (Novoa et al., 14 Jan 2026).
6. Strengths and Limitations
Strengths:
- Utilizes specialized Indic-to-English translation models to leverage the strengths of English-centric LLMs for downstream tasks.
- Long (256 K token) context enables preservation of full clinical dialogue, critical for coherence in medical scenarios.
- Single multitask, instruction-tuned decoder simplifies deployment and maintenance; the model’s size and quantization reduce inference resource requirements.
- Distilled 2.55B parameter model matches or exceeds much larger models in multilingual clinical settings with substantially lower compute costs.
Limitations:
- Pipeline latency is introduced due to sequential translation and generation stages.
- Translation errors, particularly with domain-specific or medical terminology, can propagate and affect downstream quality (as evidenced by Telugu QnA results).
- IndicTrans2’s 2,048-token limit risks the truncation of long dialogues, with potential loss of key medical information.
- Loss of cultural, idiomatic, or domain nuances in translation can impact quality, especially for low-resource languages (Novoa et al., 14 Jan 2026).
7. Significance and Implications
The NLPAI4Health 2025 Shared Task system introduced by Kl33n3x demonstrates that compact, distilled multilingual architectures can retain strong performance across multiple clinical NLP tasks and languages, even without specialized fine-tuning for each specific task or language. This suggests practical feasibility for efficient, instruction-driven LLM deployment in real-world low-resource and cross-lingual clinical environments. The approach’s reliance on translation pipelines, multitask conditioning, and model quantization frames a replicable methodology for future shared tasks in multilingual medical dialogue processing, particularly where computational resources or in-language supervision are limited (Novoa et al., 14 Jan 2026).