Verdi: Multi-Domain Computational Frameworks

Updated 3 July 2026

Verdi is a collection of computational frameworks that integrate formal verification, neural translation, vision-language reasoning, and LLM confidence estimation.
Each instantiation employs rigorous algorithmic design and evaluation techniques to deliver scalable and robust solutions across distributed systems, NLP, and autonomous driving.
These methods yield practical improvements in proof automation, quality prediction for bilingual corpora, driving safety, and trustworthy LLM evaluation.

Verdi refers to several advanced frameworks and systems across multiple domains in computational research:

a Coq-based platform for the mechanized verification of distributed systems and protocols under different network semantics;
a quality estimation algorithm for bilingual corpora using neural machine translation and cross-lingual features;
a framework for distilling vision-language reasoning into modular autonomous driving pipelines;
a method for post hoc confidence estimation in LLM verification judges.

Each instantiation of Verdi employs rigorous algorithmic design, formalization, and/or learning-based techniques and serves as a basis for research in system reliability, NLP resource curation, autonomy, or trustworthy evaluation.

1. Mechanized Verification of Distributed Protocols: Verdi in Coq

Verdi is a mechanized framework in Coq for expressing, prototyping, and proving the correctness of distributed systems, supporting a hierarchy of network semantics (e.g., reliable, lossy, or reordered message delivery). Its most prominent artifact is a mechanically verified implementation and proof of Raft state-machine replication, including linearizability and safety properties (Bayazıt et al., 26 Aug 2025).

Verdi’s workflow proceeds as follows:

Distributed protocols are initially verified under an idealized, crash-free semantics.
Using semantics transformers, the guarantees proven in this simplified setting are systematically transferred to more realistic, faulty environments (e.g., networks with message loss or duplication).
The codebase spans ~150,000 lines of Coq, with >300 proof scripts.

A recent case study systematically evaluated LLMs (e.g., GPT-4o, DeepSeek-Prover-V2) at generating unit lemmas (<30 lines) within Verdi. With full project context (all definitions and imports), LLMs achieved up to 50% automatic proof completions; in-file context alone yielded only 8–18%. LLM-generated proofs were frequently more concise: originals averaged 18 lines, whereas LLM solutions averaged 9–14. Generated scripts often fused multiple reasoning steps and employed classical Coq automation efficiently. Limitations include hallucination of lemma names, failure due to missing imports, and degradation on induction-heavy goals. Success rates improved significantly with prompt engineering and access to full dependency context. The findings indicate LLMs can serve as valuable proof-assistant collaborators for mechanized distributed protocol verification, particularly in boilerplate and local proof tasks (Bayazıt et al., 26 Aug 2025).

2. Quality Estimation for Bilingual Corpora: Verdi Algorithm for QE

Verdi, in the context of machine translation, is a quality estimation (QE) framework for bilingual corpora, designed for both word-level and sentence-level post-editing prediction without reference translations (Zhao et al., 2021).

Key technical features:

Dual-predictor architecture:
- A weight-shared, dual-direction Transformer NMT model, with conditional encoders alternating between source→target ("primal") and target→source ("dual") prediction.
- Mixture-of-experts approach, assigning each sentence pair to the highest-likelihood expert via hard EM.
- A pre-trained cross-lingual LLM (XLM, e.g., XLM-R) provides deep token-level context-aware embeddings, modeling cross-lingual alignment from input concatenations.
Dual model feature: a source-free hidden state encoding of target tokens, formed by elementwise products of encoder representations and token embeddings in the target-alone encoding path.
Features are concatenated and fed to a Bi-GRU estimator (with additional feed-forward layers for prediction).

Objective functions:

For the NMT predictor, a bidirectional cross-entropy loss is used:

$\mathcal{L}_{dual}(\theta) = \mathcal{L}_{src\to tgt}(\theta) + \mathcal{L}_{tgt\to src}(\theta)$

Word-level QE predicts $t_i \in \{OK, BAD\}$ for each target token $y_i$ and gaps; sentence-level QE regresses the post-editing effort (HTER).

Experimental results (WMT20 En–Zh):

Verdi outperformed prior strong baselines (e.g., XLM-PredEst, Bilingual Expert) on Pearson $r$ (0.6353 single, 0.6672 ensemble) and F1-BAD (0.7021 single, 0.7006 ensemble).
Filtering 1M lowest-quality pairs from a 6M parallel corpus, based on Verdi-predicted HTER, improved downstream MT BLEU (unfiltered: 20.39; filtered: 20.48) and reduced training cost.
The dual learning mechanism substantially boosts the robustness of word tagging and sentence regression, confirming benefit from symmetrical modeling.

3. VLM-Embedded Reasoning for Autonomous Driving: VERDI Framework

VERDI ("VLM-Embedded Reasoning for Autonomous Driving") is a methodology for training modular, differentiable end-to-end autonomous driving stacks with embedded commonsense reasoning distilled from vision-LLMs (VLMs) (Feng et al., 21 May 2025).

Architectural overview:

Built on VAD-Base, a 3-stage modular pipeline: Perception (multi-view camera to BEV), Prediction (BEV to agent trajectories), and Planning (map-grounded ego trajectory selection).
During training, a large VLM (Qwen-2.5-VL-72B) is queried via chain-of-thought (CoT) prompts at each submodule (perception, prediction, planning), yielding textual explanations.
Text outputs are embedded (all-mpnet-base-v2), projected into a shared latent space, and aligned with the corresponding submodule's hidden features using cosine similarity loss:

$L_f(f^{P}_i, f^{M}_i) = 1 - \cos(f^{P}_i, f^{M}_i)$

with $L_i = L_e(\theta_i) + \lambda_i L_f(f^{P}_i, f^{M}_i)$ .

At inference, only the e2e model is used; the expensive VLM pipeline is dropped.

Quantitative performance (nuScenes):

VERDI achieves 4.5 FPS, $\ell_2$ displacement at 1/2/3 s horizons $=[0.36, 0.62, 0.96]$ m (avg 0.65 m), a 10% improvement over VAD-Base and outperforming all methods not using VLMs at inference.
Ablation: aligning all three submodules (perception, prediction, planning) produces the best average $\ell_2$ error.
With this training-time distillation, robustness increases under partial observability and complex interactions—e.g., VLM-derived features support reasoning about occluded agents and plausible human-centric driving decisions.

4. Confidence Estimation for LLM Judges: VERDI (VERification-Decomposed Inference)

VERDI for LLM trustworthiness is a post hoc, single-call confidence estimation method for verification-based LLM judges that integrates consistency and structural analysis of reasoning traces, avoiding reliance on token log-probabilities (Qi et al., 11 May 2026).

Methodological breakdown:

Decomposes a verification prompt into claim extraction, per-claim adjudication, and verdict aggregation in one LLM call; parses the returned reasoning trace.
From the analysis trace, VERDI extracts three structural signals:
- Step-Verdict Alignment (SVA): Proportion of local conclusions agreeing with global verdict,
$\mathrm{SVA} = \frac{|S_{\mathrm{aligned}}|}{|S_{\mathrm{total}}|}$ - Claim-Level Margin (CLM): Fraction of subclaims that concur with the majority verdict,

$t_i \in \{OK, BAD\}$ 0 - Evidence Grounding Score (EGS): Length-weighted fraction of quoted spans actually grounded in evidence,

$t_i \in \{OK, BAD\}$ 1
Additional features: trace length, hedging and negation counts, quoted span count.
A Platt-scaled logistic regression maps standardized features to calibrated confidence.
Supports both regex or 33M-param NLI model parsing for robustness across structured and unstructured traces.

Empirical benchmarks:

Achieves AUROC in the range of 0.66–0.91 across GPT-4.1-mini, GPT-5.4-mini, and Qwen3.5 models on SummEval, FEVER, and SciFact; consistently outperforms answer-logprob-based confidence, which is either unavailable or anti-calibrated in these settings.
On internal evaluation rubrics, factual rubrics see AUROC up to 0.977; flagging 20% of cases can catch 71–88% of errors, enabling efficient human-in-the-loop routing.

5. Impact, Limitations, and Broader Research Context

Across diverse domains, Verdi has advanced state-of-the-art methodologies:

In mechanized verification, it enables scalable proofs for distributed protocols and demonstrates LLMs’ emerging competence, though limitations persist on induction-intensive or import-complex proofs (Bayazıt et al., 26 Aug 2025).
As a QE framework, it provides fine-tuned, robust, and interpretable scores for multilingual MT pipelines and corpus cleaning (Zhao et al., 2021).
In autonomy, it introduces a paradigm for training-time reasoning distillation, greatly improving efficiency and safety decomposition versus monolithic VLM stacks (Feng et al., 21 May 2025).
For LLM evaluation, it gives a path to robust trust estimation as logprob access is throttled, crucial for production deployments and alignment research (Qi et al., 11 May 2026).

Common limitations include performance sensitivity to available context in LLM-aided proof automation, reliance on the structure of reasoning traces, and degraded gains on style-based rubrics or when overall system accuracy is extremely high. The modular architectures and post hoc estimation strategies exemplified by Verdi provide generalizable templates for similar problems, suggesting future integration with hybrid symbolic-ML or retrieval-based approaches.