PILOT-Bench: Legal Reasoning in Patents

Updated 15 January 2026

PILOT-Bench is a benchmark designed to assess structured legal reasoning in the patent domain using an IRAC-aligned framework.
It aligns PTAB ex parte appeal decisions with USPTO patent texts to facilitate tasks like issue identification, board authority mapping, and subdecision prediction.
Empirical evaluations reveal significant performance gaps between commercial and open-source LLMs, highlighting challenges in class imbalance and schema adherence.

PILOT-Bench is a benchmark specifically designed to measure structured legal reasoning in the patent domain, with a focus on ex parte appeals at the United States Patent Trial and Appeal Board (PTAB). By aligning PTAB decision data with USPTO patent text at the case level, PILOT-Bench introduces IRAC-aligned (Issue–Rule–Application–Conclusion) classification tasks and provides a systematic apparatus for evaluating LLMs on nuanced legal analyses unique to the patent context (Jang et al., 8 Jan 2026).

1. Motivation and Distinctiveness

Existing patent corpora—including WIPO-alpha, CLEF-IP, USPTO-2M, BIGPATENT, HUPD, IMPACT, and Patent-CR—primarily support technical-text tasks such as classification, summarization, and retrieval, lacking formal structure for statutory or decisional elements. Legal benchmarks, conversely, cover IRAC reasoning but almost entirely outside the patent appeals field (e.g., LegalBench, LexGLUE). Previous studies on PTAB data typically emphasize outcome prediction in procedures like Post-Grant Review or Inter Partes Review, without modeling structured intermediate reasoning.

PILOT-Bench addresses these lacunae by (a) case-level alignment of PTAB ex parte appeals and USPTO patent documentation and (b) formal instantiation of three IRAC-aligned classification tasks to explicitly measure issue identification, rule mapping, and conclusion prediction within PTAB appeals (Jang et al., 8 Jan 2026).

2. Dataset Construction and Preprocessing

The benchmark leverages three principal data sources:

PTAB Metadata: 170,000 records via the USPTO PTAB API (v2).
PTAB Decisions: 25,000 documents, OCR-processed.
USPTO Full Text Patents: XML bulk data spanning 2006–2024.

Initial filtering by OCR quality (cover-page presence) and decision date (≥2006) yields 18,738 ex parte appeal records. Section segmentation (STATEMENT OF THE CASE and ANALYSIS) and metadata mapping produces 18,049 cases.

To mitigate label leakage, an LLM-assisted opinion split (Gemini‐2.5‐pro) decomposes input into four roles: appellant_arguments, examiner_findings, ptab_opinion, and facts. Only appellant_arguments and examiner_findings are retained as model inputs. On average, post-split sections are 297 and 307 words, respectively. Patent-case alignment is achieved by matching application numbers and selecting the publication temporally closest to the decision date, yielding 15,482 case-level aligned instances (the “Opinion Split Data”) (Jang et al., 8 Jan 2026).

3. IRAC-Aligned Task Formalizations

Let $X$ denote the input text (appellant_arguments ∥ examiner_findings). PILOT-Bench formulates three IRAC-aligned tasks:

Issue Type: Multi-label classification $f_I\colon X\to 2^{L_I}$ , with label set $L_I = \{\text{101}, \text{102}, \text{103}, \text{112}, \text{Others}\}$ representing statutory bases (e.g., novelty, non-obviousness).
Board Authorities: Multi-label classification $f_R\colon X\to 2^{L_R}$ , where $L_R$ consists of the 8 most frequent 37 C.F.R. provisions plus "Others" (e.g., "§1.131", "§41.50(a)").
Subdecision: Single-label multiclass $f_C\colon X\to L_C$ , with $|L_C^\mathrm{fine}|=23$ subdecision types (e.g., Affirmed, Reversed in Part with New Ground), and $|L_C^\mathrm{coarse}|=6$ for an aggregated view.

The label distributions are highly imbalanced: for example, "102" constitutes approximately 40% of Issue Type, "41.50(a)" 55% of Board Authorities, and "Affirmed" 45% of fine-grained Subdecision labels (Jang et al., 8 Jan 2026).

4. Evaluation Metrics

PILOT-Bench employs metrics standard for multi-label and multiclass tasks. For multi-label (Issue Type and Board Authorities):

Exact Match: Binary indicator if the predicted label set exactly matches the gold set.
Micro-F1 / Macro-F1:
- Micro-F1 aggregates TP, FP, and FN counts across labels:
$\mathrm{Precision}_{\mathrm{micro}} = \frac{\sum_c \mathrm{TP}_c}{\sum_c (\mathrm{TP}_c+\mathrm{FP}_c)},\quad \mathrm{Recall}_{\mathrm{micro}} = \frac{\sum_c \mathrm{TP}_c}{\sum_c (\mathrm{TP}_c+\mathrm{FN}_c)}$

$\mathrm{F1}_{\mathrm{micro}} = \frac{2\,\mathrm{Precision}_{\mathrm{micro}}\,\mathrm{Recall}_{\mathrm{micro}}} {\mathrm{Precision}_{\mathrm{micro}} + \mathrm{Recall}_{\mathrm{micro}}}$ - Macro-F1 computes F1 per class and averages: $f_I\colon X\to 2^{L_I}$ 0.

For Subdecision (single-label multiclass): Accuracy, Macro-F1, and Weighted-F1 are reported across 23 or 6 classes (Jang et al., 8 Jan 2026).

5. Benchmarking Protocol and Models

Evaluation spans both commercial (closed-source, zero-shot) and open-source LLMs:

Closed-source: Claude-Sonnet-4, Gemini-2.5-pro, GPT-4o, GPT-o3, Solar-pro2
Open-source: LLaMA-3.1-8B-Instruct, Mistral-7B, Qwen-3-8B, T5-Gemma-2B

All evaluations utilized NVIDIA RTX 4090 and H100 GPUs, with context windows between 2k and 4.9k tokens. Model input variants comprise Split (Base: separated appellant and examiner arguments), Merge (concatenated arguments), and Split+Claim (Split plus appended claim text). Prompt templates strictly enforce JSON output and label schema (Jang et al., 8 Jan 2026).

6. Empirical Results

Performance on Issue Type (Split Base):

Model	Exact Match	Macro-F1	Micro-F1
Claude-Sonnet-4	0.5871	0.5457	0.7905
Gemini-2.5-pro	0.5874	0.6630	0.7923
GPT-4o	0.5751	0.6519	0.7860
GPT-o3	0.5955	0.6639	0.7968
Solar-pro2	0.5583	0.5240	0.7707
LLaMA-3.1-8B	0.1826	0.1051	0.5793
Mistral-7B	0.3405	0.2111	0.6080
Qwen-3-8B	0.5561	0.5251	0.7741
T5-2B	0.0772	0.3845	0.4469

Commercial models uniformly attain Micro-F1 ≳ 0.79 on Issue Type, with open-source models trailing (Qwen-3-8B Micro-F1 0.77, Macro-F1 0.53). For Board Authorities, Gemini-2.5-pro achieves a Micro-F1 of 0.69; the best open-source model (Qwen-3-8B) yields only 0.20, with significant label schema violations observed. Subdecision (fine-grained) accuracy for commercial models falls in 0.56–0.59 range (Macro-F1 0.13–0.16); open models do not exceed 0.48 accuracy or 0.10 Macro-F1 (Jang et al., 8 Jan 2026).

Notable error patterns include:

Persistent class imbalance, reflected in low Macro-F1, especially for infrequent statutory and Board labels.
Schema violations by open models (e.g., out-of-schema labels, natural language responses).
Input augmentation (claims text) generally reduces performance (–2 to –4 points Micro-F1 on multi-label, –2 to –3 points Accuracy on Subdecision).

7. Implications, Limitations, and Future Directions

PILOT-Bench demonstrates that LLMs are effective in predicting frequent PTAB issue types and statutory authorities, but exhibit reduced performance on rare label classes and structured output constraints. Role-based input separation (Split) does not consistently improve performance relative to merged input, and addition of full claim text can degrade results due to information dilution.

Future research avenues include extension to IRAC Application tasks requiring generation and multi-step reasoning, instruction tuning or schema-constrained decoding for hallucination reduction, data balancing to improve label coverage, and extension to additional PTAB subprocedures (IPR and PGR) (Jang et al., 8 Jan 2026). The persistent gap between closed and open-source model performance (Micro-F1 ∼0.80 vs. ∼0.56 on Issue Type) quantifies the current frontier in patent-domain legal reasoning and dataset-driven LLM alignment.

Markdown Report Issue Upgrade to Chat

References (1)

PILOT-Bench: A Benchmark for Legal Reasoning in the Patent Domain with IRAC-Aligned Classification Tasks (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PILOT-Bench.

PILOT-Bench: Legal Reasoning in Patents

1. Motivation and Distinctiveness

2. Dataset Construction and Preprocessing

3. IRAC-Aligned Task Formalizations

4. Evaluation Metrics

5. Benchmarking Protocol and Models

6. Empirical Results

7. Implications, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

PILOT-Bench: Legal Reasoning in Patents

1. Motivation and Distinctiveness

2. Dataset Construction and Preprocessing

3. IRAC-Aligned Task Formalizations

4. Evaluation Metrics

5. Benchmarking Protocol and Models

6. Empirical Results

7. Implications, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research