TriBench-Ko: Korean Judicial LLM Benchmark

Updated 28 May 2026

TriBench-Ko is a risk-aware benchmark providing systematic evaluation of LLMs in Korean judicial workflows by mapping four judicial tasks to eight distinct risk types.
It applies a rigorous binary verification protocol on real-world Supreme and Constitutional Court decisions to expose failure modes such as omission and hallucination.
The benchmark guides deployment strategies by revealing operational risks and empirical performance metrics across juridical assistance tasks.

TriBench-Ko is a benchmark specifically designed for the evaluation of LLMs in Korean judicial workflows. It provides a comprehensive risk-aware testing framework that aligns directly with authentic tasks and decision structures encountered in courts, focusing on both task accuracy and operational risk. TriBench-Ko systematically assesses four core judicial assistance tasks against eight distinct deployment risk types by means of a rigorous, variant-rich binary verification protocol, and highlights critical LLM failure modes with detailed quantitative evidence (Lee et al., 5 May 2026).

1. Benchmark Motivation and Scope

TriBench-Ko arises from a key limitation observed in existing legal LLM benchmarks, such as LexGLUE, LegalBench, LawBench, LEXam, KBL, KCL, LBOX Open, AgentCourt, and Ready Jurist One. Prior benchmarks predominantly evaluate models using proxy tasks—bar examination-style questions, classification, multiple-choice, or short-answer formats—and report only aggregate accuracy or BLEU-style metrics. These approaches do not capture the complexity of real-world judicial workflows, which require document familiarization, legal research, issue framing, and evidentiary review, nor do they systematically expose failure modes manifesting under varied input conditions such as paraphrasing, repeated runs, or biased prompts.

Furthermore, previous benchmarks either ignore or insufficiently integrate risk dimensions such as hallucination, omission, demographic bias, statutory misapplication, prompt sensitivity, and adjudicative overreach. TriBench-Ko addresses these limitations by evaluating models as operational systems and by focusing directly on risk types arising in verified judicial tasks using authentic Korean legal documents.

2. Task Definitions and Dataset Composition

TriBench-Ko is grounded in full-text decisions from the Korean Supreme Court and Constitutional Court, selected to ensure doctrinal coverage across ten major legal domains. Its design maps four key adjudicative assistance tasks to a taxonomy of eight risk types in a Task × Risk matrix, resulting in 31 evaluated task-risk pairings and a total of 1,414 binary yes/no judgment items.

Task Definitions

Jurisprudence Summarization: The model must verify if a provided summary of a judicial decision correctly reflects statutory provisions, cited precedents, factual findings, and the logic leading to the holding.
Precedent Retrieval: The model verifies whether selected relevant prior cases amongst candidates actually support the input case, based on doctrinal and factual alignment.
Legal Issue Extraction: The task involves determining whether a candidate statement distills the legally determinative question from a factual scenario or decision text.
Evidence Analysis: The model must validate the correct mapping between evidence items and the fact propositions they are intended to support.

Data Annotation

Every item is constructed through a two-stage expert annotation cycle. First, legal-linguistically trained annotators generate input texts, a true statement, a false distractor (injecting a target risk), and rationale for each task-risk pair. Next, three certified Korean attorneys review all items for correctness, completeness, jurisdictional validity, and plausible distractors; consensus adjudication resolves any disputes.

Item Distribution by Task

Task	Source Items
Jurisprudence Summarization	80
Precedent Retrieval	67
Legal Issue Extraction	79
Evidence Analysis	39

Atomic binary judgments—including protocol-driven input and instruction variants—range from 130 to 875 per risk dimension.

3. Taxonomy of Risk and Evaluation Protocols

TriBench-Ko evaluates model outputs using a partitioned risk framework, operationalized through both direct item assessment and adversarial perturbations.

Risk Dimensions

Inaccuracy
- Hallucination: Acceptance or generation of non-existent facts, statutes, or cases.
- Omission: Failure to include required legal elements, with the output otherwise factual.
- Statutory Misapplication: Application or citation of inapplicable legal provisions.
Bias
- Demographic Bias: Different binary responses solely due to protected attribute swaps (e.g., gender, nationality).
- Overcompliance (Sycophancy): Unwarranted answer reversal under leading or biased prompts.
Inconsistency
- Prompt Sensitivity: Failures when the query is paraphrased (strict “both correct” criterion).
- Nondeterminism: Variation in repeated answers under identical prompts (zero temperature).
Adjudicative Overreach
- Use of language reserved for judicial decision-making rather than neutral assistance.

Evaluation Protocols and Metrics

Each item consists of an instruction and a candidate statement, to which the model must respond "yes" or "no." Four testing protocols are applied:

Single (one-shot),
Input text variant (for demographic bias),
Instruction variant (overcompliance and prompt sensitivity),
Repeat runs (nondeterminism).

The main metric is Macro-F₁, averaged across binary classes and risk dimensions, with standard precision, recall, and strict accuracy formulas applied. In retrieval-style settings, TriBench-Ko frames the task as binary verification over fixed candidates, scored by F₁.

Strict accuracy for variants and repeats requires perfect consistency (no partial credit). Soft metrics such as mean per-variant accuracy and pairwise agreement are also calculated for analysis.

4. Model Evaluation and Empirical Findings

Thirteen contemporary LLMs—including proprietary APIs (gpt-5.4, gpt-5.4-mini, gpt-4o), Korean-oriented models (kt-midm-2.0-base-instruct, A.X-3.1-Light, EXAONE-3.5-7.8B-it, kanana-1.5-8b-it), and open-weight instruction-tuned models (Qwen3.5-9B, Qwen3-8B, phi-4, Ministral-3-8B-it, gemma-3-12b-it, Llama-3.1-8B-it)—were evaluated.

Aggregate Risk Results

Macro-F₁ spans 0.835 (gpt-5.4) down to 0.342 (A.X-3.1-Light). Top models: gpt-5.4 (0.835), gpt-5.4-mini (0.781), Qwen3.5-9B (0.771).
Hallucination is well-controlled by top models (F₁ > 0.88); weaker models show marked decline.
Omission is the principal risk (mean F₁ ≈ 0.45), especially in Precedent Retrieval (mean F₁ = 0.293).
Statutory Misapplication challenges weaker models (F₁ < 0.50).
Bias risks are moderate (mean F₁ ≈ 0.70–0.75).
Prompt Sensitivity and Nondeterminism vary widely: top models exceed 0.90 F₁; weaker ones degrade to 0.35–0.60.
Adjudicative Overreach is infrequent (mean F₁ ≈ 0.80).

Task-Wise Performance

Task	Mean F₁
Jurisprudence Summarization	0.699
Precedent Retrieval	0.534
Legal Issue Extraction	0.68
Evidence Analysis	0.71

Omission remains the dominant failure across tasks, with notable drops in Precedent Retrieval. Hallucination is contained in top summarization models, but lower-tier models accept fabricated citations over 30% of the time (F₁ < 0.50). Adjudicative Overreach occurs sporadically even in leading commercial APIs (e.g., GPT-4O, strict accuracy ~0.66).

5. Critical Operational Insights

Several concrete conclusions regarding deployment and quality assurance in real-world judicial contexts are supported:

Any LLM output implicating detailed legal retention (omission risk) demands thorough human verification.
Precedent retrieval must be validated against official sources to catch missed or spurious authorities.
Summarization outputs require checklists for statutory, precedent, and factual completeness.
LLMs must not be regarded as autonomous legal decision-makers; overreach language should be systematically filtered.

Technically, human oversight is mandatory where legal detail, precedential accuracy, or neutral stance are required.

6. Recommendations for LLM Deployment in Judicial Workflows

TriBench-Ko findings support several evidence-based recommendations for LLM deployment in legal settings:

Employ Retrieval-Augmented Generation (RAG) strategies with verifiable external indices to reduce hallucination and omission, particularly in Precedent Retrieval.
Use symbolic or checklist-based modules to enforce statutory correctness and mitigate misapplication.
For less robust models, use prompt-ensemble or repeated-query strategies to lessen nondeterminism and prompt sensitivity exposure.
Incorporate standard red-team tests (demographic swaps, leading prompts) into all pre-deployment audits.
Implement UI-level controls that prevent output of adjudicative or normative language by the LLM.

This suggests that high-stakes adoption of LLMs in judicial assistance must be accompanied by robust risk inspection, domain-specific validation, and clear demarcation of model authority boundaries.

7. TriBench-Ko in Context and Availability

TriBench-Ko constitutes a distinct advance by coupling workflow-specific legal tasks to a fine-grained risk taxonomy, executed over expertly annotated, real-world Korean judicial materials. It is made available as a public resource for benchmarking LLMs in authentic legal settings, along with dataset and code at https://github.com/holi-lab/TriBench-Ko (Lee et al., 5 May 2026).

A plausible implication is that analogous risk-centric, workflow-grounded benchmarks may be necessary across other legal systems and languages to fully inform both policy and practical deployment of LLMs in judicial and regulatory environments.

Markdown Report Issue Upgrade to Chat

References (1)

TriBench-Ko: Evaluating LLM Risks in Judicial Workflows (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TriBench-Ko.