NoisyBench: Evaluating Noise Robustness

Updated 14 January 2026

NoisyBench is a comprehensive framework designed to assess noise robustness in computation, learning, and reasoning systems.
It employs detailed noise taxonomies and controlled injection protocols across various domains including federated learning, LLM reasoning, and quantum computing.
Its standardized tasks and metrics enable reproducible comparisons of defense strategies, guiding improvements in resilient machine learning and system designs.

NoisyBench is a generalized term for benchmarking frameworks and datasets systematically designed to evaluate robustness, detection, and mitigation strategies for noise in computation, learning, and reasoning systems. NoisyBench instances exist for federated learning, reasoning with distractor contexts, quantum computing under device noise, text classification and sequence labeling with label noise, and graph neural networks under noisy supervision. These benchmarks provide standardized tasks, metrics, noise taxonomies, and empirical protocols to facilitate fair, reproducible comparison of noise-robust methodologies across domains.

1. Conceptual Foundations and Scope

NoisyBench encapsulates the principle that benchmarking under "clean" or sanitized conditions substantially overestimates system performance. In practice, data and context are frequently perturbed by real-world noise: label errors, distractor contexts, device non-idealities, semantic hallucinations, temporal correlations, and environmental fluctuations. The NoisyBench paradigm is to construct controlled yet challenging evaluation protocols where such noise is injected intentionally and systematically, thereby enabling precise measurement of both degradation and efficacy of defense mechanisms across modalities and architectures. Implementations range from classical timing benchmarks (Chen et al., 2016), federated noisy label learning (Liang et al., 2023), contextual distractors for LLMs (Lee et al., 12 Jan 2026), instance-dependent label noise in NLP (Rączkowska et al., 2024, Merdjanovska et al., 2024), graph neural networks under noise (Wang et al., 2024), and quantum device/circuit-level noise evaluations (Resch et al., 2019).

2. Noise Taxonomies and Injection Protocols

Across domains, NoisyBench employs rich noise taxonomies and rigorous injection procedures:

Federated Learning: Noise can be global (homogeneous flipping with fixed rate and symmetric/asymmetric class transitions), localized (heterogeneous per-client transition matrices and rates), or instance-dependent (Liang et al., 2023). Transition matrices $T_{ij} = P(\hat{y}=j \mid y=i)$ govern corruption, with partitions spanning IID and multiple classes of non-IID assignment.
Reasoning and LLMs: Distractors include random documents (RD), synthetic hard negatives (HN), and irrelevant multi-turn chat histories (RC), injected as context preceding the query (Lee et al., 12 Jan 2026).
Embodied QA: NoisyBench for EQA defines latent hallucination (referencing non-existent entities), memory noise (attribute recall errors), perception noise (scene mislabeling via injected perturbations), and semantic noise (substitution by related entities) (Wu et al., 2024).
NER / Text classification: Real noise stems from expert annotation errors, crowdsourcing artifacts, automatic distant supervision, weak sources, and LLM-generated labels. Empirical error breakdowns (missing mention, spurious non-entity, type/boundary confusion) are computed token- and entity-level (Rączkowska et al., 2024, Merdjanovska et al., 2024).
Graph Label Noise: Standard symmetric (uniform class flipping) and asymmetric (pairwise) noise regimes are injected onto node labels per known class structure (Wang et al., 2024).
Timing/Hardware Benchmarks: Environmental timing noise, OS jitter, clock error, and context switch delays are modeled as additive delays, with robust estimation via min-statistics and repetition parameter optimization (Chen et al., 2016).
Quantum Devices: Gate noise models include depolarizing (Pauli), amplitude damping, phase noise, and coherent over-rotation. Benchmarks inject these via Kraus operators after each cycle in simulations and hardware runs (Resch et al., 2019).

3. Evaluation Protocols and Metrics

NoisyBench frameworks specify robust, domain-specific metrics and protocols:

Domain	Key Metrics	Evaluation Protocol
Federated/Label	Accuracy, Macro-F1, Drop Ratio, Sensitivity, Grad Norm	Federated averaging with noise-injected partitioning (Liang et al., 2023)
LLM Reasoning	pass@k, $\Delta$ Accuracy, Sensitivity $S$ , Noise-curve AUC	Context prepending, multi-context trials (Lee et al., 12 Jan 2026)
EQA	LLM-Match Accuracy, Detection Rate, Correction Rate	Human verification, self-correction prompts (Wu et al., 2024)
NER/NLP	Token/entity-level F1, memorization metrics	Clean/test splits, error breakdown, memorization tracking (Rączkowska et al., 2024, Merdjanovska et al., 2024)
Graph	Test accuracy, node-type specific error rates	Semi-supervised node classification, noise-injected train/val (Wang et al., 2024)
Timing	min(wall-time), unimodality, regression detection	Repeated trials, automatic n selection, min estimator (Chen et al., 2016)
Quantum	Process fidelity, diamond norm, hellinger, trace distance	Gate/circuit-level simulation/testing, randomized compiling (Resch et al., 2019)

In noisy label detection, NoisyBench tasks fix the cleaned fraction at the dataset's true noise rate, measuring false negative rate at this operating point ( $\mathrm{FNR}^{*}$ ) to standardize method comparison (Pickler et al., 17 Oct 2025).

4. Baseline Methods and Defense Strategies

NoisyBench typically provides implementations and comparisons of mainline and robust approaches for the domain:

Label Noise Defense: Regularization (Mixup), robust loss (SCE, GCE, MAE), loss adjustment (DYR-series), sample selection (Co-teaching), soft-label temporal ensembling, confident learning, early learning regularization (Liang et al., 2023, Rączkowska et al., 2024, Merdjanovska et al., 2024).
Reasoning / LLM Contexts: Standard prompting, context engineering, SFT, RL with outcome reward, rationale-aware reward (RARE) (Lee et al., 12 Jan 2026).
EQA: Self-correction prompting (NAP, NACoT), confidence-based view selection (Wu et al., 2024).
Graph: Loss correction, robust losses, curriculum and community regularization, structure augmentation (NRGNN, RTGNN, CLNode, UnionNET, PIGNN, CGNN) (Wang et al., 2024).
Timing: Algorithmic n-selection and minimum-estimate, as in BenchmarkTools (Chen et al., 2016).
Quantum: Randomized compiling, noise-extrapolation via Richardson methods, dynamical decoupling (Resch et al., 2019, Garmon et al., 2019).

Empirically, many methods show negligible gains on real noise; soft-label regularization helps with class-conditional noise, while instance-dependent noise remains challenging (Rączkowska et al., 2024, Merdjanovska et al., 2024). For context-distractor benchmarks, RARE substantially improves accuracy under noise (+55% relative gain for reasoning models) compared to outcome-only RL (Lee et al., 12 Jan 2026).

5. Key Empirical Insights and Comparative Analysis

NoisyBench exposes phenomena masked on sanitized benchmarks:

On federated learning, label-skew partitions and localized noise yield severe degradation, especially in Non-IID regimes; globalized/symmetric noise is easier only in centralized settings (Liang et al., 2023).
In LLM/contextual reasoning, random and hard distractors cause catastrophic drops (up to 80%), showing that clean accuracy does not predict noise resilience; agentic tool-use workflows amplify errors via context re-injection and emergent misalignment (Lee et al., 12 Jan 2026).
In embodied QA, latent hallucination and memory noise cause severe accuracy loss; self-correction prompt mechanisms improve detection and answer quality, but all agents remain far from human-level robustness (Wu et al., 2024).
For NER/text classification, simulated noise substantially underestimates the challenge posed by real annotation errors; memorization is immediate and persistent with real noise, as opposed to delayed with synthetic (Merdjanovska et al., 2024, Rączkowska et al., 2024).
On graphs, specialized GLN methods outperform naive robust losses under label noise, but no universally superior model exists; pair-type asymmetric noise is especially disruptive due to its misleading effect on network prediction (Wang et al., 2024).
In timing, traditional statistics (mean, trimmed mean) produce biased and multi-modal estimates in noisy environments; minimum wall-time per trial yields robust, reproducible estimates without kernel-level access (Chen et al., 2016).
Quantum benchmarks demonstrate that coherent noise types cause far larger process infidelities than stochastic Pauli noise, mitigated by randomized compiling; choice of metric critically affects interpretation (Resch et al., 2019).

6. Frameworks, Tools, and Accessibility

NoisyBench implementations span PyTorch/FedLab (FedNoisy), BenchmarkTools.jl (Julia timing), GitHub releases (AlleNoise, NoisyGL), and domain-specific open repositories (Quantum NoisyBench). All target reproducibility via open splits, metric definitions, and standardized simulation or data-collection pipelines (Liang et al., 2023, Wang et al., 2024, Rączkowska et al., 2024). In quantum computing, integration with OpenPulse and cross-platform calibration is required to bridge hardware specifics (Garmon et al., 2019).

NoisyBench protocols are designed to be plug-and-play: wrapping inference loops, using four lines of code for context distractors (Lee et al., 12 Jan 2026), or parameterizing repetition/measurement budgets for timing (Chen et al., 2016). Open datasets include human-verified clean labels for diagnostic clarity, enabling per-class confusion and noise-rate analysis (Rączkowska et al., 2024).

7. Open Challenges and Future Directions

Several issues persist across all NoisyBench frameworks:

Incorporating instance/context-dependent noise models, adversarial perturbations, and open-set errors into standard benchmarks.
Expanding benchmarks to cover more languages, medical or scientific modalities, and multi-modal scene grounding tasks.
Developing robust aggregation algorithms, personalized federated methods, secure/private computation under label noise (differential privacy, homomorphic encryption).
Extending evaluation to complex reasoning pipelines, multi-step agentic workflows, and adaptive learning using small curated clean sets.
Rigorous theoretical analysis, e.g. convergence bounds for federated algorithms under heterogeneous noise, and characterization of non-Markovian noise memory in quantum devices.
Improving hybrid strategies combining human-in-the-loop, meta-learning, and automated pattern discovery to approach clean-subset oracle upper bounds.

NoisyBench frameworks represent a critical standard in the experimental evaluation of robust learning and inference. By systematically cataloging and quantifying the impact of real-world noise across domains, they reveal authentic failure modes and enable directed progress towards resilient computational systems.