LLM-as-Judge for Semantic Judging of Powerline Segmentation in UAV Inspection

Published 7 Apr 2026 in cs.AI | (2604.05371v1)

Abstract: The deployment of lightweight segmentation models on drones for autonomous power line inspection presents a critical challenge: maintaining reliable performance under real-world conditions that differ from training data. Although compact architectures such as U-Net enable real-time onboard inference, their segmentation outputs can degrade unpredictably in adverse environments, raising safety concerns. In this work, we study the feasibility of using a LLM as a semantic judge to assess the reliability of power line segmentation results produced by drone-mounted models. Rather than introducing a new inspection system, we formalize a watchdog scenario in which an offboard LLM evaluates segmentation overlays and examine whether such a judge can be trusted to behave consistently and perceptually coherently. To this end, we design two evaluation protocols that analyze the judge's repeatability and sensitivity. First, we assess repeatability by repeatedly querying the LLM with identical inputs and fixed prompts, measuring the stability of its quality scores and confidence estimates. Second, we evaluate perceptual sensitivity by introducing controlled visual corruptions (fog, rain, snow, shadow, and sunflare) and analyzing how the judge's outputs respond to progressive degradation in segmentation quality. Our results show that the LLM produces highly consistent categorical judgments under identical conditions while exhibiting appropriate declines in confidence as visual reliability deteriorates. Moreover, the judge remains responsive to perceptual cues such as missing or misidentified power lines, even under challenging conditions. These findings suggest that, when carefully constrained, an LLM can serve as a reliable semantic judge for monitoring segmentation quality in safety-critical aerial inspection tasks.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces an LLM-as-Judge framework that semantically evaluates UAV powerline segmentation by assessing geometric consistency and plausibility.
It employs dual-axis repeatability and sensitivity metrics to measure performance changes under various weather-induced image perturbations, ensuring reliable real-time monitoring.
The approach decouples lightweight onboard segmentation from compute-heavy offboard auditing, enhancing safety and scalability in autonomous visual inspections.

LLM-as-Judge for Semantic Judging of Powerline Segmentation in UAV Inspection

Problem Context and Motivations

Automated UAV-based powerline inspection leverages lightweight segmentation models (e.g., U-Net) for extracting thin structures remotely. However, such models often encounter significant robustness challenges post-deployment because of domain and conditional shifts not reflected in training data, especially inclement weather and complex illuminations. Conventional segmentation metrics (IoU, pixel accuracy) require ground-truth annotations that are rarely available in operational settings. This absence of principled, automated, real-time quality control poses safety and reliability risks in autonomous visual inspection.

To address this gap, the authors conceptualize an LLM-as-Judge paradigm where a large multimodal LLM (MLLM) acts as an offboard watchdog, semantically verifying segmentation overlays without access to ground-truth. Unlike conventional performance monitoring, this approach emphasizes semantic plausibility, geometric consistency, and structural fidelity as proxies for reliability. The framework is modular: onboard segmentation operates on the UAV with minimal compute, while semantic judging and auditing occurs at the ground station, isolating heavy computation and supporting both live and offline decision support.

Prior research established LLMs as effective judges for text generation, open-ended QA, and other language tasks, but also documented systematic biases, output stochasticity, and limited agreement with human experts, especially in complex or domain-specific settings. In vision tasks, applications of MLLMs as evaluators remain sparse, primarily targeting generic safety classification and benchmarking. Existing studies reveal that while models like GPT-4V can align with human preferences for pairwise judgments, their absolute scoring can be inconsistent or biased, particularly when domain errors are subtle or geometric (e.g., in segmentation versus more holistic scene understanding). No prior work systematically studies the LLM-as-Judge paradigm for thin-object segmentation in safety-critical UAV inspections, leaving open questions of stability, sensitivity, and suitability under real-world perturbation.

Methodology

The core methodological contribution is a dual-axis evaluation of the LLM-as-Judge along:

Repeatability: Quantifies numeric and semantic output stability with respect to fixed image–overlay and prompt inputs, assessed via multiple runs with identical inference parameters. Key metrics are:
- Score agreement ( $A_s$ ): Proportion of images with identical quality scores across runs.
- Confidence agreement ( $A_c$ ): Proportion with identical/near-identical confidence values ( $\epsilon = 10^{-6}$ ).
- Combined stability ( $A_{s,c}$ ): Joint metric requiring both outputs to be stable.
- ICC(1,1): One-way random effects intraclass correlation for score reproducibility.
- Word-overlap: Semantic similarity in textual rationales.
Sensitivity: Measures the judge's perceptual responsiveness to controlled visual deterioration (rain, fog, snow, shadow, sunflare) at varying levels applied to RGB images prior to segmentation. Perturbed image–overlay pairs are rescored, and monotonicity of score/confidence drop is evaluated:
- $\Delta s_{t,k}$ , $\Delta c_{t,k}$ : Mean score/confidence loss under corruption type $t$ , severity $k$ .
- Spearman rank correlation: Assesses monotonicity with severity.
- Paired statistical tests and standardized effect sizes ( $d_z$ ): Significance of per-image perceptual changes relative to clean baseline.

The evaluation framework explicitly decouples perception (segmentation) from judgment, treating the LLM as an independent semantic auditor. All experiments employ GPT-4o as the offboard judge, and U-Net for segmentation, benchmarked using TTPLA.

Experimental Results

Repeatability

Across clean and corrupted TTPLA overlays, the LLM-as-Judge demonstrates high repeatability in discrete categorical scores. While perfect score agreement ( $A_s$ ) ranges from 78–91% depending on corruption type, ICC values are consistently high (0.858–0.917). Notably, extreme corruptions (e.g., fog, snow) can yield even higher agreement than clean images (e.g., fog $A_c$ 0 vs. clean $A_c$ 1), because segmentation often fails catastrophically (empty masks) resulting in consistent low scoring. In scenarios with structurally ambiguous overlays (as in cleaner images), judgment is less deterministic due to the need to reason about multiple line instances.

Confidence stability is more sensitive, especially under fog ( $A_c$ 2) and snow, as the LLM rightfully modulates uncertainty when visual cues are lost. Combined numeric stability drops accordingly in these conditions. Importantly, the LLM does not overconfidently hallucinate quality in the absence of evidence.

Textual rationales show expected diversity, as natural language is underconstrained, but word-overlap remains substantial, supporting underlying semantic consistency.

Sensitivity

Sensitivity analysis reveals that the LLM-as-Judge exhibits strong, monotonic, and statistically significant degradations in both score and confidence in response to corruptions aligned with field conditions:

Fog: Induces the largest and most stable degradation ( $A_c$ 3, $A_c$ 4, $A_c$ 5), saturating as severity increases—a plausible effect as geometric cues for powerlines are essentially erased.
Rain and Snow: Show consistent, monotonic score drops ( $A_c$ 6: 0.46 → 0.81 for rain, 0.70 → 0.96 for snow) and effect sizes up to $A_c$ 7, confirming that the judge detects incremental reliability loss.
Shadow and Sunflare: Produce lower-magnitude drops and moderate effect sizes, reflecting their more localized or ambiguous impact on line continuity.

Across all corruptions, the judge’s score signal is more discriminative than its confidence signal, except for fog, where both collapse. The judge exhibits no pathological insensitivity; effect sizes and paired testing reject the null hypothesis of random noise. Qualitative inspection (as visualized in the figures) supports robust perceptual grounding of judgments, with missing or malformed overlays reliably flagged.

Implications and Future Directions

The findings substantiate the viability of MLLMs as semantic auditors in vision pipelines where ground-truth is unavailable and human-in-the-loop monitoring is impractical. When rigorously evaluated, such models provide both discrete and confidence-calibrated judgments responsive to real-world perturbations, supporting explainable, threshold-based interventions in safety-critical deployment. This framework enables separation of concerns: fast, onboard segmentation and compute-heavy, offboard verification, facilitating broad scalability for fleet and near-real-time monitoring.

Theoretically, this reframing from accuracy-centric to semantic safety monitoring motivates further exploration of MLLM-based evaluators in other structured-output tasks, including but not limited to industrial inspection, medical imaging QA, and environmental monitoring.

Future work should address:

Generalization to additional segmentation error types and more complex multi-object scenes.
Systematic handling of judge bias, drift, and adversarial robustness.
Closed-loop adaptive autonomy, enabling on-the-fly reconfiguration or escalation protocols based on semantic judge feedback.
Cross-model/judge ensemble strategies to further improve reliability.

Conclusion

This study demonstrates that when carefully evaluated, large multimodal LLMs can provide robust, interpretable, and sensitive semantic monitoring for thin-structure segmentation tasks in real-world, safety-critical UAV applications. The LLM-as-Judge paradigm, underpinned by formal repeatability and sensitivity metrics, advances the state of the art in autonomous vision system auditing and provides a practical template for future multimodal reliability frameworks.

(2604.05371)

Markdown Report Issue