RFC Bench: Financial Misinformation & Protocol Conformance
- RFC Bench is a dual-purpose benchmark that evaluates reference-free financial misinformation detection and network protocol conformance.
- It employs adversarially perturbed financial news and traceable RFC parsing to challenge LLM-based reasoning and implementation verification.
- Metrics such as accuracy, precision, and MCC reveal key weaknesses in LLM belief-state reasoning and spec–implementation alignment.
RFC Bench refers to benchmarks or frameworks for evaluating reasoning, compliance, or efficiency in settings governed by RFCs (Request for Comments)—a standard, openly published specification format essential to the Internet and related systems. The term appears in two major technical contexts: (1) a task and evaluation framework for reference-free counterfactual financial misinformation detection; (2) an automated benchmarking suite for network protocol conformance and parser validation against RFC documents using LLMs and formal analysis pipelines. Both lines of work address deep model or system limitations in reference-free reasoning and spec–implementation alignment.
1. RFC Bench for Reference-Free Financial Misinformation Detection
RFC Bench, as defined in "All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection" (Jiang et al., 7 Jan 2026), is a dataset and evaluation protocol for LLMs, targeting their capacity to identify and diagnose minimally perturbed misinformation in financial news. The benchmark is constructed to probe fundamental weaknesses in model “belief-state” reasoning, especially under reference-free conditions typical of practical misinformation detection.
The benchmark operates at paragraph granularity, mirroring real-world financial news complexities, where manipulations are minimal and cues are highly dispersed. It defines two core tasks:
- Reference-Free Detection: Given a single paragraph , the model must output a binary label ( for original, for manipulated), formally .
- Comparative Diagnosis: Provided a factual–manipulated pair , the model determines the manipulation type from a fixed set (Directional Flipping, Numerical Perturbation, Sentiment Amplification, Causal Distortion), via .
2. Dataset Construction, Manipulation Taxonomy, and Quality Measures
The dataset was synthesized from 1,404 unique Yahoo Finance news articles spanning 223 U.S.-listed stocks. For each, the authors generated 2,042 carefully adversarial factual–perturbed paragraph pairs (retention after QA: 1,845, retention rate 0.894). Each perturbation falls into one of four categories:
- Directional Flipping: Inverts stance or trend signals (e.g., "rose 5%"→"fell 5%").
- Numerical Perturbation: Alters cardinal numbers without changing sign ("8%"→"28%").
- Sentiment Amplification: Escalates the risk or positivity/negativity of an account ("may compress"→"warned of bankruptcy").
- Causal Distortion: Substitutes justifications or links incorrect causes and effects.
Generation employed category-specific GPT-4.1 prompt templates, strict length and decoding controls, and enforced minimality using token-ratio filters (0.9–1.3). Human QA involved two annotation passes, with metrics including Percent Agreement , Macro-F1, Cohen’s , and Gwet’s AC1. Overall rewrite validity reached , Macro-F1 , , AC1 .
3. Evaluation Protocols and Metrics
Benchmarking targets the reference-free challenge: detection without external corroboration, as must be done in real-time news screening. Fourteen LLMs, spanning open (LLaMA, Qwen series) and closed-source (GPT-4.1, DeepSeek-chat), were scored under the following regimes:
- Reference-Free Detection (Task 1): Zero-shot and few-shot variants tested binary labeling accuracy.
- Comparative Diagnosis (Task 2): Models classified manipulation type in pairwise, 4-way zero-shot tasks.
- Invalid Output Rate: Fraction of outputs not mappable to a valid label.
- Core Metrics: Accuracy, Precision, Recall, F1-score, Macro-F1, Matthews Correlation Coefficient (MCC).
A summary of performance:
| Model | Ref-Free Acc | Ref-Free Macro-F1 | Comp Acc | Comp Macro-F1 |
|---|---|---|---|---|
| Qwen3-8B (Non-Thinking) | 0.530 | 0.528 | 0.850 | 0.790 |
| GPT-4.1 | 0.527 | 0.507 | 0.969 | 0.965 |
| DeepSeek-reasoner | 0.536 | 0.528 | 0.936 | 0.937 |
Invalid output rates peaked with smaller models (up to 1099 invalid Task 1 outputs for LLaMA 8B), but were negligible for strong LLMs and for pairwise diagnosis. Few-shot (up to 8-shot) settings marginally improved performance, but never to robust levels.
4. Key Findings and Error Analyses
Reference-free settings revealed a consistent failure pattern: models accepted or rejected inputs based not on internal fact coherence, but superficial stylistic cues and model priors—termed “Accommodation-first” errors. Models falsely rejected plausible but forward-looking projections, or accepted high-fluency fabrications resembling trusted news sources. Instability and schema non-conformance (invalid outputs) were common with smaller models; Matthews Correlation Coefficient (MCC) was near zero for such settings.
Contrastingly, comparative diagnosis gave models an explicit anchor, dramatically improving reliability (up to 0.97 accuracy for GPT-4.1). The detection task became one of identifying minimal edits, for which LLMs were highly effective. Residual errors in Task 2 largely stemmed from mixed-manipulation cues, confusing e.g. polarity flips with numerical changes.
The central result is that existing LLMs lack robust reference-free reasoning under minimal, plausible adversarial manipulation in the financial domain, but are highly effective at surface-level cue localization with gold-context comparison.
5. Implications for Model Development and Future Benchmarking
RFC Bench (financial) exposes a key vulnerability for high-stakes model deployment: without external grounding or paired reference, even strong LLMs have unstable and unreliable beliefs in the presence of subtle semantic manipulations. Recommendations include pursuing models with explicit belief-state representations, internal self-consistency checks, and adversarial or counterfactual training protocols. Extensions to multilingual, multi-paragraph, and multimodal (e.g., tables/figures) scenarios are indicated as future directions.
By providing a paired-task structure, high-quality minimally perturbed data, and precise adjudication metrics, RFC Bench offers a rigorous, practically grounded testbed for LLMs targeting financial misinformation detection under both isolated and comparative review (Jiang et al., 7 Jan 2026).
6. RFC Bench in Network Protocol Conformance: Parser and Implementation Validation
In the domain of network protocols, “RFC Bench” also refers to rigorous frameworks for validating the conformance of protocol parsers and implementations against RFC formal specifications. The key contributions are embodied in "Validating Network Protocol Parsers with Traceable RFC Document Interpretation" (Zheng et al., 25 Apr 2025) and "An LLM Agent for Functional Bug Detection in Network Protocols" (Zheng et al., 31 May 2025). These works define automated, traceable pipelines marrying LLM-powered RFC interpretation, symbolic analysis, and in-depth program retrieval.
Primary Components:
- Document Tree (DocTree) Extraction: Parses RFC contents into a hierarchical representation, prompting LLMs to extract grammar and behavioral constraints by section, and correcting via a mini-DSL checker.
- Format Graph Construction: Aggregates local specs into a protocol-wide format graph (DAG), supporting path enumeration for syntactic and semantic property evaluation.
- Test Input Generation & Oracle Checking: Symbolic constraint solving (via Z3) generates both specification-conformant and negative test packets; parser outputs are compared to oracle predictions, tracing violations to specific RFC sections via DocTree mapping.
- Interactive Traceability and Refinement: Inconsistencies are categorized as implementation or specification (LLM hallucination) errors, enabling iterative spec refinement and bug report generation fully traceable to RFC text.
- Metrics: Precision, recall, and identification accuracy are directly linked to coverage, property mutation, and fault localization.
Experimental application to diverse codebases (C, Python, Go; protocols: Babel, BFD, DHCP, etc.) resulted in the detection of 69 unique bugs (36 confirmed/fixed). Key ablations confirmed that traceability and property-level negative testing are essential, accounting for >100% gains over differential-only or non-traceable baselines (Zheng et al., 25 Apr 2025, Zheng et al., 31 May 2025).
7. RFC-Compliance Differential Analysis Across Updates
In "Uncovering Gaps Between RFC Updates and TCP/IP Implementations" (Wu et al., 28 Oct 2025), RFC Bench is extended to track RFC evolution across protocol versions and associated implementations. The methodology introduces formal alignment graphs between changing RFC “functional entries” and evolving code, leverages intermediate representations (IRs) for both, and uses LLMs to both identify specification/code mismatches and infer downstream vulnerabilities. Evaluation on 144 TCP-related RFCs and seven major OS kernels yielded significant accuracy (GPT-4o: 91.1%), surfacing concrete security-relevant inconsistencies (e.g., secret key re-generation, challenge-ACK handling). This work establishes an automated, scalable regime for continuous compliance checking as RFCs and implementations evolve (Wu et al., 28 Oct 2025).
In summary, RFC Bench, in both its financial and protocol validation manifestations, serves as a suite of structured challenges and pipelines leveraging LLM and symbolic methods for rigorous, reference-free diagnosis of either semantic manipulation (in text) or compliance failures (in system code). It is defined by its traceability, adversarial adversarial construction, and high relevance for evaluating and advancing the state of trustworthy automated reasoning under minimal external grounding (Jiang et al., 7 Jan 2026, Zheng et al., 25 Apr 2025, Zheng et al., 31 May 2025, Wu et al., 28 Oct 2025).