Two-Level Judgment Module for AI Evaluation

Updated 20 November 2025

Two-Level Judgment Module is an architectural paradigm that decomposes decision-making into local processing and global meta evaluation stages.
It employs techniques like transformers, RNNs, and BiGRU to encode data in chunks and aggregate them hierarchically for improved accuracy and interpretability.
Applications span legal judgment prediction, LLM evaluation, and ensemble grading, providing robust filtering, consistency checks, and enhanced reliability.

A Two-Level Judgment Module is an architectural and methodological paradigm designed to enable multi-stage, hierarchical, or two-step reasoning, classification, or evaluation, typically in judgment prediction, evaluation benchmarking, or quality assurance settings. Across legal AI systems, LLM judge evaluation, ensemble grading alarm systems, and dialogue-agent assessment, the two-level judgment module operates by structurally decomposing judgment into at least two processing or decision layers, each targeting distinct semantic, structural, or statistical properties.

1. Core Structural Principles of the Two-Level Judgment Module

A two-level judgment module processes input data (e.g., text, model outputs, score vectors) by partitioning the inference or assessment workflow into two distinct but interconnected stages:

First-Level Local or Chunk-wise Processing: The initial level ingests input in manageable segments ("chunks," subtasks, or rater-specific score sets). Modules at this stage typically apply an encoder (e.g., transformer, RNN, BERT variant) or perform per-judge or per-rater analysis (e.g., correlation filtering) to extract local representations or filter out ill-performing candidates.
Second-Level Aggregation or Meta-Evaluation: The subsequent level aggregates or synthesizes the intermediate representations, predictions, or agreement patterns from all first-level units. Methods at this stage include hierarchical sequence modeling (e.g., BiGRU or transformer across chunk embeddings), groupwise agreement/statistical analysis, or logical consistency checks over observer ensembles.

This hierarchical structuring offers significant benefits in scalability (handling long or structurally complex documents), interpretability (isolation and tracing of decision sources), and reliability (via robust filtering or joint cross-checking across multiple sources) (Madambakam et al., 2023, Han et al., 10 Oct 2025, Prasad et al., 2023, Corrada-Emmanuel, 10 Sep 2025, Song et al., 2022).

2. Methodological Implementations and Formalism

Legal Document Judgment Systems

In legal AI, two-level modules address the challenge of processing voluminous judicial documents:

SLJP Model (Madambakam et al., 2023):
- Input fact description $X = \{x_1, x_2, ..., x_n\}$ is chunked into $K$ contiguous spans $fC_1, ..., fC_K$ .
- Each chunk is encoded (e.g., via XLNet), producing token-level embeddings $fe_i$ .
- Level 1: Each $fe_i$ is passed to a BiGRU with max-pooling, yielding chunk vectors $f_{h_i}$ .
- Level 2: The chunk vectors are input to a second BiGRU + global-max-pool, forming the global document vector $T_h$ .
- Judgment is then predicted through a sigmoid-based multi-label classification head:
$\hat{y} = \sigma(W^\text{out}T_h + b^\text{out})$ - The model employs binary cross-entropy loss and (optionally) class weighting.
Semi-supervised Hierarchical Encoders (Prasad et al., 2023):
- Level 1: Domain-specific BERT extracts [CLS] chunk representations.
- Level 2: Hierarchical transformer encoder (1–2 layers) with optional BiLSTM aggregates chunk features.
- Unsupervised clustering (HDBSCAN) is applied to chunk features; resulting cluster labels are concatenated with high-level features for final judgment prediction.

Sequential Multi-task & Decision Pipelines

SMAJudge for Appeal Judgment (Song et al., 2022):
- First level: Lower-court module predicts law articles, charges, and sentences from base facts using BiLSTM and a chain of multi-task LSTM heads.
- Second level: Appellate-court module re-encodes new facts and appeal grounds, applying attention mechanisms over facts aligned to grounds, then predicts affirmation/overturn and appellate law articles.
- The final vector fed to appellate heads concatenates both lower-court and appellate context representations, with loss terms for all stages combined (joint loss).

LLM Judge and Ensemble Evaluation

Judge’s Verdict Benchmark (Han et al., 10 Oct 2025):
- First level: Correlation filtering—calculate Pearson’s $r$ (or Spearman’s $\rho$ ) between the model’s and human consensus scores, requiring $r \geq 0.80$ to pass.
- Second level: Upon passing, compute Cohen’s $\kappa$ for agreement analysis. Compose a “mixed panel” (LLM + 3 humans), compute pairwise $\kappa$ , mean and variance for human-human pairs, then calculate the LLM’s $\kappa$ with each human. Derive a $z$ -score
$z = (\kappa_\text{LLM} - \mu_\text{human})/\sigma_\text{human}$

for tier assignment: $|z| < 1.0$ (human-like), $z > 1.0$ (super-consistent), $z < -1.0$ (reject).
No-Knowledge Alarm for Misaligned LLMs-as-Judges (Corrada-Emmanuel, 10 Sep 2025):
- First level: Ensemble of $M$ graders, each outputs marginal judgment counts $N^i_\ell$ on $Q$ tasks.
- Second level: Construct a linear program over all unknowns (true label counts $Q_\ell$ , response-by-true-label tables $R^i_{\ell_i,\ell_\text{true}}$ ), under constraints including grader-observation consistency, true-label consistency, and user-specified grading-ability thresholds.
- Infeasible LP solution triggers a no-knowledge alarm—at least one grader is necessarily misaligned. This approach guarantees zero false positives.

Framing and Robustness in LLM Judging (Rabbani et al., 14 Nov 2025):
- Level 1: For each example, judge using both direct factual and conversational (dialogue) framing.
- Level 2: If the model initially answers correctly, a “rebuttal” is issued; model is prompted for a second, pressured decision.
- The module collects initial and post-rebuttal answers for both settings, quantifies frame-induced performance shift ( $\Delta_\text{frame}$ ) and robustness under pressure ( $\text{Conv}_k$ ), to diagnose susceptibility to social framing and over-accommodation.

3. Theoretical and Practical Advantages

Scalable Document Processing: Hierarchical chunking overcomes input-length constraints inherent in deep pre-trained transformers, enabling robust semantic extraction from complex or lengthy legal texts (Madambakam et al., 2023, Prasad et al., 2023).
Divide-and-Conquer for Information Bottleneck: Per-chunk semantic condensation followed by inter-chunk aggregation prevents information dilution and mitigates vanishing gradient issues, crucial in multi-thousand-token scenarios.
Multi-Task, Dependency-Aware Modeling: Sequential execution (modeling lower-court → appellate-court or base → meta-classification) encodes interdependencies between tasks, harmonizing joint optimization and providing richer, context-sensitive outputs (Song et al., 2022).
Robustness and Reliability in Model Assessment: Two-level evaluation strategies (correlation + agreement, or LP-based logical consistency) provide rigorous, layered filtering—successive layers rule out weak or misaligned models, minimizing Type I errors (Han et al., 10 Oct 2025, Corrada-Emmanuel, 10 Sep 2025).
Interpretability and Traceability: Intermediate representations (attention weights, cluster assignments, agreement statistics) allow detailed post-hoc analysis, model introspection, and rationale extraction for both neural predictions and LLM judge decisions.

4. Representative Architectures, Equations, and Modules

The following table summarizes common two-level module forms as instantiated in prominent research:

Domain	First Level	Second Level
Legal doc. prediction (Madambakam et al., 2023)	Chunk transformer + BiGRU max-pool	BiGRU over chunk embeddings global max-pool
Legal doc. prediction (Prasad et al., 2023)	BERT[CLS] on chunk	Transformer encoder over chunk embeddings
Appeal judgment (Song et al., 2022)	Lower-court BiLSTM + multi-head LSTM	Appellate-court BiLSTM w/ attention
LLM judge eval (Han et al., 10 Oct 2025)	Pearson $r$ correlation screening	Cohen $\kappa$ , groupwise $z$ -score
Ensemble misalignment (Corrada-Emmanuel, 10 Sep 2025)	Per-grader marginals	LP feasibility consistency/alarm
Dialog framing (Rabbani et al., 14 Nov 2025)	Factual + conversational frame scoring	Rebuttal/pressure flip robustness

Each instantiation is grounded in a structural-first-level (basic semantic, statistical, or score extraction) and a meta-analytic second-level (aggregation, alignment, agreement, or logic consistency).

5. Evaluation Criteria and Empirical Performance

Empirical studies demonstrate clear advantages of two-level modules:

On the ILDC legal dataset, domain-specific BERT → hierarchical encoder models yield test accuracy of 83.7% compared to 73.8% for single-level LEGAL-BERT and 60.5% for BERT-base. Addition of unsupervised HDBSCAN clustering provides up to 0.4–0.5 point additional gain on validation sets (Prasad et al., 2023).
The SLJP model achieves superior F1 scores and reduced performance degradation over longer training epochs relative to base transformer models, confirming the resilience of divide-and-conquer two-level approaches (Madambakam et al., 2023).
In LLM judge evaluation, only models passing $r \geq 0.80$ in level 1 survived to more granular agreement analysis; of 54 LLMs, 27 achieved Tier 1 (human-like or super-consistent agreement) (Han et al., 10 Oct 2025).
The LP-based alarm framework detects ensemble misalignment with zero false positives regardless of gold-label access, scaling to arbitrary grader and label cardinality (Corrada-Emmanuel, 10 Sep 2025).
Dialogue system results show an average change of 9.24% in model performance under conversational re-framing, highlighting significant susceptibility to minimal dialogue context perturbations and justifying the two-level design for social robustness analysis (Rabbani et al., 14 Nov 2025).

6. Variants, Generalizations, and Application Guidelines

Two-level judgment modules admit several design and parameterization variants:

Attention Mechanism: Some architectures supplement or replace second-level global pooling with learned attention, producing weighted summaries for interpretable rationales (Madambakam et al., 2023, Song et al., 2022).
Agreement Metrics: Cohen’s $\kappa$ is standard for categorical agreement, with Krippendorff’s $\alpha$ optional for multi-coder scenarios; thresholds (e.g., $z$ -bandwidths in (Han et al., 10 Oct 2025)) can be tuned for application criticality.
Clustering Side-Information: Unsupervised chunk clustering serves as a frozen feature source, concatenated at the inter-chunk aggregation stage (Prasad et al., 2023).
Sequential or Parallel Pipelines: Architectures may be strictly serial (e.g., lower-court → appellate-court in (Song et al., 2022)), or parallel (multiple LLMs/raters in (Corrada-Emmanuel, 10 Sep 2025)), unified at the meta-level by statistical or logical aggregation.
Framing and Pressure Evaluation: For social tasks, the two-level module generalizes to compare performance under direct and conversational task framings, with optional adversarial prompt perturbations (Rabbani et al., 14 Nov 2025).

A plausible implication is that the two-level judgment module paradigm will become a default design principle in tasks where both local, context-sensitive processing and meta-level consistency or agreement assurance are critical, especially in legal AI, social LLM evaluation, and trustworthy ensemble grading architectures.

7. Impact and Significance

The two-level judgment module paradigm has yielded demonstrable improvements in document understanding, model evaluation, and ensemble reliability across benchmarked tasks. Its structural separation of local extraction and global/meta synthesis addresses the inherent bottlenecks of single-stage models—namely, input length limitations, interpretability deficits, and vulnerability to both error propagation and alignment uncertainty. Furthermore, the methodology serves as a foundation for explainable AI in sensitive domains (e.g., legal, policy, scientific peer review), where both process transparency and accuracy are non-negotiable (Madambakam et al., 2023, Prasad et al., 2023, Han et al., 10 Oct 2025, Corrada-Emmanuel, 10 Sep 2025, Rabbani et al., 14 Nov 2025, Song et al., 2022).