- The paper quantifies large discrepancies between radiologist and LLM judge ratings, showing near-zero agreement on translation quality.
- It demonstrates that LLM-generated translations are concise and fluent but often miss radiology-specific terminology and clinical nuances.
- The study suggests a hybrid workflow combining automated screening with selective expert review to ensure high-fidelity translations for clinical applications.
Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports
Study Motivation and Context
The accurate translation of radiology reports is increasingly critical in cross-border clinical practice, medical education, and AI research. The recent proliferation of LLMs capable of highly fluent, context-aware neural machine translation (NMT) raises questions about translation adequacy in high-stakes, domain-specific settings such as radiology. Concomitantly, the LLM-as-a-judge paradigm—using LLMs themselves for automated translation quality assessment—has become a dominant methodological tool, yet its validity for high-fidelity medical content remains poorly established.
This study addresses both of these gaps: (1) quantifying the educational suitability and fidelity of LLM-generated translations of English chest CT reports into Japanese, and (2) critically examining the alignment (or lack thereof) between expert radiologist evaluation and LLM-based automated judging. The analysis leverages DeepSeek-V3.2, a leading open-weight LLM, and the CT-RATE-JPN dataset, with evaluation triangulated between two blinded human radiologists and three top-tier LLM judge models (DeepSeek-V3.2, Mistral Large 3, GPT-5).
Experimental Design
Dataset and Translation Procedures
The evaluation utilized 150 chest CT reports from the CT-RATE-JPN validation set. Each report had two competing Japanese translations: (a) a human-edited version with multi-stage radiologist quality control, and (b) a DeepSeek-V3.2-generated translation prompted for domain fidelity. Sentence- and lexical-level statistics, including total length, segmentation, and type-token ratio (TTR), were calculated to expose structural differences between translation methods.
Human and LLM-Based Evaluation Protocol
Blinded, pairwise evaluation was conducted on four criteria: terminology accuracy, readability/fluency, overall clinical suitability, and radiologist-style authenticity. Two radiologists (one attending, one resident) individually completed all assessments while blinded to translation source. Three LLMs served as independent automated judges over the same 150 report pairs, given the English source and translation A/B assignments in random order. Agreement statistics and confusion matrices were computed using quadratic weighted kappa (QWK).
Key Quantitative Results
- Radiologist Inter-Rater Agreement: Agreement between the two radiologists was negligible across criteria (QWK 0.01–0.06; raw agreement 28–37%), consistently below conventional thresholds for even minimal annotation reliability.
- Radiologist Preferences: Radiologist 1 slightly favored LLM translations for readability and overall quality (51%) but judged terminology as generally equivalent (59% ties). Radiologist 2 more often found the translations equivalent for readability (75% ties) and preferred human-edited translations for overall quality (40% vs 21%).
- LLM Judge Behavior: All three LLM judges overwhelmingly favored the LLM-generated translation across criteria with systematic rates: terminology accuracy (79–91%), readability/fluency (70–95%), overall quality (83–95%), and notably, radiologist-style authenticity (>93%).
- Radiologist vs LLM Judge Agreement: Human-LLM judge agreement was near-zero or sub-zero (QWK –0.04 to 0.15), confirming substantive divergence in translation quality standards.
- Linguistic Discrepancies: LLM translations were systematically shorter (~94% of the length), comprised more sentences (mean 20.3 vs 19.6), and had higher TTR, although content-word lexical diversity was equivalent, suggesting that surface-level fluency did not entail loss of unique clinical vocabulary.
Interpretation and Implications
Systematic LLM Judge Bias
The study provides direct empirical evidence of systematic, strong directional bias in LLM-as-a-judge evaluation: all judge models overwhelmingly preferred LLM-generated output regardless of ground-truth adjudication, and often failed to penalize clinically non-idiomatic or inaccurate translations. Qualitative analysis highlighted that rationales emphasizing "conciseness" and "naturalness" predominated, particularly for readability and style, regardless of whether clinical-specific terminology was preserved or used in accordance with radiologist conventions.
This aligns with recent findings on self-preference and fluency bias in LLM evaluation circuits [e.g., (Wataoka et al., 2024), 2024.emnlp-main.474]. Optimization for instruction-following and fluency in LLM training leads to the systematic overvaluation of surface-level coherence over clinical authenticity or regulatory-grade accuracy. The judge models also exhibited poor sensitivity to Japanese radiology-specific register, at odds with experienced radiologist preferences.
Human Expert Evaluation Inconsistency
The study underscores a high degree of subjectivity in human translation quality judgments even among domain experts, which impacts the reproducibility and generalizability of "expert validated" translation workflows. The two-radiologist panel exhibited divergent thresholds and stylometric expectations, especially in differentiating overall quality and the style expected of a native radiologist report. This variance cautions against drawing clinical implementation conclusions from single-rater or non-adjudicated studies of machine translation in radiology.
Educational and Data Curation Recommendations
For low-consequence use cases (e.g., large-scale multilingual pretraining, low-stakes education corpora), LLM-generated translations—especially as implemented with DeepSeek-V3.2—are generally sufficient in breadth and fluency. For high-criticality tasks (e.g., creation of standardized learning materials, clinical AI model training datasets where preservation of domain-specific uncertainty and idiom is required), exclusive reliance on LLM-based assessment is inappropriate. Instead, a staged approach is advised: scale via LLMs for initial expansion and screening, followed by selective expert human review for gold-standard content, balancing throughput with diagnostic/educational rigor.
Toward Radiologist-Aligned Automated Assessment
The persistent lack of agreement between LLM-judging and domain experts reveals the current limitations of scaling high-fidelity translation QA processes with generic LLM-based rubrics. The development of LLM judges explicitly aligned and externally validated against radiologist communities and task-specific rubrics, potentially incorporating real Japanese radiology training data and explicit clinical conventions, is a clear avenue for future research. Prompt optimization, rubric restructuring, and domain-adaptive instruction tuning remain open for investigation.
Limitations and Future Directions
The findings are limited to chest CT reports and Japanese translations; generalizability to other radiology subfields, document types, or languages is untested. Only two human raters were included, precluding robust consensus analysis. The LLM judgment protocol reflected single-prompt, single-rubric frameworks without systematic exploration of meta-evaluation parameters. Importantly, the study did not directly assess the impact on downstream educational outcomes.
Recommended future research includes outcome-based evaluation in educational use, inclusion of broader expert panels, and the explicit design of radiologist-aligned evaluation rubrics for both human and automated assessment. Expansion across medical imaging subdomains and source/target language pairs is needed to substantiate general claims about LLM translation in medical settings.
Conclusion
Translation of radiology reports by state-of-the-art LLMs such as DeepSeek-V3.2 produces linguistically fluent and structurally concise Japanese reports. However, both quantitative and qualitative analyses demonstrate that LLM-as-a-judge protocols are insufficient to ensure clinical and educational adequacy, given their marked preference for LLM outputs, insensitivity to domain-specific conventions, and lack of alignment with specialist human judgment. For high-stakes radiology educational and AI training applications, systematic expert review remains indispensable for quality assurance. Hybrid, scalable workflows combining LLM capacity and selective expert vetting represent the current optimal approach for curation of multilingual radiology textual resources.