FlakyGuard: Test Repair & FL Robustness
- FlakyGuard is a dual-framework system addressing both non-deterministic test failures in CI and negative benefits in federated learning through automated, adaptive remediation.
- Its software engineering implementation uses dynamic call graphs and LLM-guided context traversal to isolate minimal yet critical code segments, achieving a repair success rate of 65.8%.
- In federated learning, FL-GUARD employs per-client performance tracking and robust statistical smoothing to detect and mitigate negative federated learning, restoring positive client gains.
FlakyGuard denotes two distinct frameworks in the literature, each addressing reliability and robustness in large-scale software or distributed machine learning systems via automated, adaptive mechanisms. In the context of continuous-integration-driven software engineering, FlakyGuard (Li et al., 18 Nov 2025) is a system for the automatic diagnosis and repair of non-deterministic ("flaky") tests. In federated learning, FL-GUARD (also FlakyGuard) (Lin et al., 7 Mar 2024) is a holistic runtime framework for detecting and mitigating "Negative Federated Learning" (NFL), where client participation yields negative benefit. Both systems deploy algorithmic monitoring and targeted remediation at industrial scale, but they operate in different technical domains.
1. Automated Repair of Flaky Tests in Large Software Repositories
FlakyGuard (Li et al., 18 Nov 2025) targets the challenge of non-deterministic ("flaky") test failures, a major operational burden in modern continuous integration (CI) environments. Such test cases intermittently pass or fail across codebase states that are nominally unchanged. In enterprise monorepos (e.g., Uber's, with ≈100M lines of Go code), flaky tests consume CI resources, trigger manual triage, and degrade developer trust in regression safety nets.
The principal technical hurdle, termed the context problem, is the difficulty of supplying a LLM with the correct scope of contextual code for effective diagnosis. Prior repair systems either provide too little (e.g., only the test code) or too much context (e.g., all production source), leading to poor root cause inference and low-quality patches.
FlakyGuard resolves this by (a) representing executed code as a dynamic call graph, and (b) traversing this graph in a selective, LLM-guided manner to isolate just the most relevant context for LLM-based repair.
2. Dynamic Call Graph Construction and LLM-Guided Traversal
FlakyGuard models the code under test as a directed graph
where is the set of executed methods (uniquely identified), and are the observed invocation edges, including special handling for concurrent execution (e.g., adding edges for goroutine launches). Dynamic call graph extraction is realized by instrumenting methods during test runs, logging only the paths exercised by the given flaky test. This results in a precise, non-static but execution-specific context representation, pruned of unrelated code segments that would obfuscate root cause analysis.
A selective graph traversal algorithm then operates as follows:
- Starting at root nodes (failed test methods), FlakyGuard conducts a breadth-first search.
- At each visited node, candidate children (direct callees) are presented to the LLM in miniature, alongside short code snippets.
- The LLM returns up to children judged most likely to be causally related to the flakiness, concentrating search on plausible failure loci.
- Traversal iterates, constructing a context window that remains both deep (can follow long, causal chains) and focused (never floods the LLM prompt window).
- A final global filter, again LLM-driven, ranks all traversed nodes, retaining only the top most relevant context methods.
This process is formalized in Algorithm 1 of (Li et al., 18 Nov 2025), with built-in LLM integration points for context curation and relevance ranking.
3. LLM-Powered Test Fix Synthesis and Iterative Validation
Following context selection, FlakyGuard constructs a prompt for an autonomous LLM session. This prompt includes directives (modify only test code), error diagnostics (messages, stack traces, reproduction rates), and the selected context code. In Go codebases, widespread use of table-driven tests necessitates test function simplification: the failed test case is isolated, passed to the LLM, and fixes are subsequently transplanted back into the original test function via an AST-based patch merge.
Test fixes undergo automatic validation: after patch application, tests are rerun (with up to 1000 repeats), and only if they pass is a pull request submitted. If failures persist, FlakyGuard enters a multi-level iterative repair loop—with up to 18 attempts combining retries over context selection, LLM prompting ("thoughts"), and candidate patch generation—addressing LLM non-determinism and context uncertainty.
4. Empirical Results and Qualitative Evaluation
Evaluation on Uber's Go monorepo (6000+ engineers, 1115 reproducible flaky tests over six months) demonstrates the following metrics:
| Approach | Fixed / Reproducible | Success % |
|---|---|---|
| FlakyDoctor | 134 / 295 | 45.4 |
| Agentless + RepoGraph | 159 / 295 | 53.9 |
| AutoCodeRover | 158 / 295 | 53.6 |
| FlakyGuard (selective DCG) | 194 / 295 | 65.8 |
Over the full evaluation, FlakyGuard achieves a repair success rate of 47.6% (380/798) with a developer acceptance rate of 51.8% for patches, outperforming all state-of-the-art baselines by at least 22 percentage points (Li et al., 18 Nov 2025). Root causes span schedule randomness (37%), unordered-collection iteration (33%), timestamp discrepancies (12%), state pollution (8%), time-window flakiness (7%), and others (3%). All surveyed developers found generated root cause explanations useful.
Key feedback metrics:
- Fix quality rating (mean 4.42/5)
- Estimated time savings: 57.9% experienced at least one day saved per incident.
5. System Limitations and Open Research Directions
FlakyGuard's architecture, while effective, exposes several principal limitations:
- LLM non-determinism remains a challenge despite multi-pass retries.
- The implementation targets Go with Bazel; generalization requires adaptation of instrumentation and AST patching routines for other languages, although Java prototypes exist.
- The global filtering step may prune essential nodes; adjusting or ensemble-ranking could improve recall.
- Patch acceptance is sensitive to project-specific policies; ~48.2% of patches are rejected due to nonconformance with local helper APIs or style, suggesting that incorporating repository-wide code searches or code embedding-based retrieval could raise acceptance rates.
Proposed future work includes hybrid static-dynamic context modeling, graph-neural relevance pre-selection, knowledge-transfer across projects, and interactive developer-in-the-loop refinements.
6. FlakyGuard in Federated Learning: FL-GUARD for NFL Detection and Recovery
A separate instantiation of the term appears in federated learning research as FL-GUARD (Lin et al., 7 Mar 2024). Here, the framework is a runtime, plug-and-play wrapper for any FL system, designed to detect and mitigate negative federated learning (NFL). FL-GUARD operates by tracking per-client "performance gain" (local improvement over private baseline) in every round, aggregating these with robust statistical smoothing (median, moving average), and triggering local adaptation only when NFL is detected for a threshold number of rounds.
Key mechanisms:
- Per-client gain computation: , where is local validation accuracy, local baseline.
- Server-side NFL detection by monitoring aggregate (weighted sum of gains) and switching to a two-model adaptation scheme when necessary:
- Each client maintains (a) the global model (standard FL), and (b) a personalized adapted model , jointly optimized for local loss and quadratic proximity to .
- The adaptation's regularization parameter is dynamically set as a sigmoidal function of model divergence and gradient alignment, eliminating the need for explicit hyperparameter tuning.
- Experiments with non-IID, poisoned, and differentially private settings (CIFAR-10, Shakespeare) show FL-GUARD efficiently detects NFL within 10–40 rounds, restores client benefit, and maintains positive gain where vanilla FL collapses (Lin et al., 7 Mar 2024).
Compatibility with robust aggregation (FedProx, TrimmedMean, multi-Krum), avoidance of overhead until NFL is observed, and resilience to uncooperative or limited-capacity clients (via partial deployment) are central strengths.
7. Summary and Impact
FlakyGuard, whether as an industrial code-repair system (Li et al., 18 Nov 2025) or as FL-GUARD for federated learning robustness (Lin et al., 7 Mar 2024), advances automated, scalable solutions for reliability in interactive, distributed environments. In both domains, core principles are:
- Continuous monitoring for latent, non-deterministic failure patterns.
- Targeted, context-sensitive remedial action using automated reasoning (LLM or adaptive optimization) based on succinctly curated context.
- Validation at scale, resulting in significant resource savings and improved operational trust.
These systems illustrate the centrality of context selection, algorithmic efficiency, and adaptive recovery in modern software and learning system reliability.