Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLM-Assisted FP Filtering

Updated 26 March 2026
  • LLM-Assisted FP Filtering is an approach that uses large language models to refine automated pipelines by reducing false positives in diverse domains.
  • It employs methods like chunk-level semantic filtering and dynamic thresholding, achieving FPR reductions as significant as from over 90% to 6.3% in security and SAST applications.
  • The technique enhances interpretability and precision by integrating LLM-based scoring with agentic frameworks and behavioral filters for improved data quality and robust decision-making.

LLM-Assisted FP Filtering refers to the application of LLMs to reduce or eliminate false positives (FPs) in automated pipelines across information retrieval, data quality management, security, program analysis, and intelligent code completion. These techniques leverage the semantic reasoning and adaptive judgment of LLMs to enhance the specificity, accuracy, and interpretability of filtering mechanisms that previously relied on heuristics or task-specific ML models.

1. Key Concepts and Problem Formulation

A false positive (FP) in automated decision-making refers to a non-relevant, inapplicable, or benign item erroneously classified as relevant, malicious, or otherwise of interest. LLM-assisted FP filtering utilizes LLMs to make finer-grained relevance or suitability judgments, yielding binary (accept/reject), continuous, or probabilistic outputs that refine or replace upstream filtering steps.

Evaluated metrics typically include:

  • False Positive Rate (FPR): FPR=FPFP+TN\mathrm{FPR} = \frac{FP}{FP+TN}
  • Recall / True Positive Rate (TPR): TPR=TPTP+FN\mathrm{TPR} = \frac{TP}{TP+FN}
  • FP Identification Rate (FPIR): FPIR=FPinitial−FPresidualFPinitial\mathrm{FPIR} = \frac{FP_{\text{initial}} - FP_{\text{residual}}}{FP_{\text{initial}}}

Domains of application include:

2. Methodological Frameworks

2.1 Chunk-level and Passage-level Semantic Filtering

Methods such as ChunkRAG segment documents into semantically coherent units (chunks), with downstream filtering at the chunk rather than document level. A typical chunking pipeline computes sentence embeddings, groups sentences by inter-sentential cosine similarity, and regulates chunk length via a character threshold (LmaxL_{\text{max}}), e.g., 500 (Singh et al., 2024). LLMs then assign a relevance score Rj=fϕ(Q,Cj)R_j = f_\phi(Q, C_j) to each chunk given a query QQ, using highly structured prompts (either zero- or few-shot). Chunks are retained or discarded based on dynamic thresholding, with empirical tuning of θ\theta (similarity threshold), LmaxL_{\text{max}}, and the filtering cutoff τ=μR+α⋅σR\tau = \mu_R + \alpha \cdot \sigma_R.

2.2 LLM-Grounded Judgement in Security and RAG

In security use cases, LLM-assisted FP filtering couples semantic feature extraction (intent, tone, entities) with RAG to ground LLM scam-likelihood assessments in retrieved, curated evidence (Chan et al., 27 Jan 2026). The pipeline typically includes:

  • LLM feature extraction from incoming messages.
  • Retrieval of top-kk evidence from labeled corpora.
  • Construction of a prompt concatenating the message with retrieved evidence passages.
  • LLM scoring, optionally fused with average retrieval similarity: Sfraud(m)=α sim(m)‾+(1−α) sLLM(m)S_{\mathrm{fraud}}(m) = \alpha\,\overline{\mathrm{sim}(m)} + (1-\alpha)\,s_{\mathrm{LLM}}(m).

2.3 LLM Agents in SAST False Positive Reduction

LLM agent frameworks (e.g., Aider, OpenHands, SWE-agent) are deployed for vulnerability alert triage (Xiong et al., 30 Jan 2026). Agentic models utilize multi-turn reason-act loops and can interact with codebases using various developer tools. Compared with vanilla prompting, agentic designs provide markedly improved FP suppression on strong LLM backbones, achieving FPR reductions from ∼\sim92% to as low as 6.3% in best configurations (OWASP Benchmark, SWE-agent, Claude Sonnet 4). However, aggressive FP filtering may also suppress true positives, particularly in domain-sensitive CWEs (e.g., weak cryptography), resulting in non-trivial trade-offs.

2.4 Behavioral and Line-level Filtering

For LLM code suggestion systems, pre-invocation behavioral filters predict acceptance likelihood based solely on telemetry (typing speed, edit history, help usage, prior acceptances) (Awad et al., 24 Nov 2025). In LLM pretraining data curation, line-level FP filtering is supervised by LLM annotations, later scaled via a DeBERTa-v3 classifier to billions of tokens; lines are labeled clean/non-clean with binary thresholds on classifier outputs (Henriksson et al., 13 Jan 2025).

2.5 Pseudo-Relevance Feedback with LLM Denoising

Hybrid pipelines for information retrieval integrate LLM-based document vetting as a pre-filter to classical pseudo-relevance feedback (e.g., RM3), where only documents accepted by an LLM as relevant are used for expansion (Otero et al., 16 Jan 2026). LLMs are prompted for binary "true/false" relevance decisions. Empirical ablations show that mid-range filter thresholds admit optimal feedback diversity, avoiding topic drift.

3. Empirical Results and Comparisons

Task/Domain Upstream/Prev. FPR Post-filter FPR F1/Accuracy Gain Backbone Impact Cost Trend
ChunkRAG PopQA RAG Baseline: 50.5% 64.9% accuracy +10 pp vs best prior LLM-based chunk scoring Latency via batching
Security RAG Fraud 17.2% (no RAG) 3.5% (RAG+LLM) FPR ↓80% GPT-4, ensembled sim Linear with retrieval size
SAST FP filtering 92.1% 6.3% (best) Recall: ≤93.3% (FP) Claude, GPT-5 > DeepSeek Agentic → more compute
Web pretrain data filter – – HellaSwag +0.10 GPT-4o-mini guidance Scaling via classifier

Analysis demonstrates LLM-assisted FP filtering achieves substantial reductions in FPR and/or marked accuracy improvements relative to heuristic, document-level, or "blind" baselines. In RAG contexts, hallucination and off-target content are reduced. SAST false positive identification rates exceed 90% in top configurations; however, full automation may incur true positive loss in cryptographic CWEs.

4. Operational Guidelines and Trade-Offs

  • Thresholding and Ranking: Empirical selection of filter thresholds (Ï„\tau) impacts precision-recall trade-off and context utilization. Dynamic thresholds adapt better than fixed Ï„\tau in RAG and retrieval settings.
  • Agent Robustness and Model Choice: Agentic methods yield higher gains with strong backbones (Claude 4, GPT-5). For weak LLMs, vanilla prompting or hybrid baselines can suffice. Agentic workflows are also more computationally intensive.
  • Cost–Benefit Analysis: Filtering can be staged: inexpensive one-shot agents (e.g., Aider) for bulk, agentic or human review for ambiguous or policy-sensitive cases (Xiong et al., 30 Jan 2026).
  • Data and Feature Choice: For line-level data QC, LLM-generated fine-grained labels are collapsed to coarse groups and upscaled via classifier, with operational FPR governed by calibrated probability thresholds (Henriksson et al., 13 Jan 2025). In code suggestion, all features are privacy-preserving and strictly behavioral.

5. Interpretability, Limitations, and Extensions

Interpretability is a major advantage of LLM-based FP filtering relative to purely neural-generative strategies:

  • In IR/PRF pipelines, only corpus-grounded terms are allowed; LLM filters never generate, only assess, which mitigates hallucination (Otero et al., 16 Jan 2026).
  • Decisions in RAG and security are explainable via model rationale or output scores.
  • For SAST, agentic frameworks provide step-wise rationale and tool invocation traces.

Documented limitations include model miscalibration (both over- and under-suppression of FPs/TNs), category-dependent performance (notably in cryptographic vulnerabilities and policy-driven alerts), susceptibility to LLM biases in data labeling, and increased computational cost for interactive/static analysis agent frameworks. Future directions call for human-in-the-loop stages for ambiguous cases, real-time corpus updates, and adaptation to non-English or low-resource domains.

6. Representative Implementations and Reproducibility

Key architectural motifs are outlined in the cited works:

  • ChunkRAG: NLTK sentence tokenization, transformer embedding, sequential chunking by cosine similarity (θ=0.7\theta=0.7), LLM/fallback embedding scoring, dynamic thresholding, strict context handoff to the generator (Singh et al., 2024).
  • Fraud Pipeline: LLM-based feature extraction, vector retrieval over segmented corpora, grounded LLM judgment, hybrid scoring, significance testing for FPR reduction (Chan et al., 27 Jan 2026).
  • Web Data Filtering: LLM-guided annotation, cluster-based label mapping, DeBERTa-v3 classifier, Platt scaling for threshold tuning, evaluation by clean data uplift on HellaSwag (Henriksson et al., 13 Jan 2025).
  • SAST Agents: Characterization of agentic frameworks by reasoning loop depth, tool access, and interaction pattern. Cost/accuracy trade-offs empirically tabulated (Xiong et al., 30 Jan 2026).
  • Code Suggestion Pre-Filter: Client-side CatBoost classifier on aggregated behavioral telemetry, millisecond-scale latency, strict privacy/no code inspection (Awad et al., 24 Nov 2025).
  • PRF with LLM Filtering: Prompt-driven binary (true/false) document acceptance, RM3 estimation on filtered set, empirical ablations for filter threshold and kk (Otero et al., 16 Jan 2026).

A plausible implication is that LLM-assisted FP filtering is transforming precision-critical automation in IR, security, data quality, and developer tooling, but operational deployment requires nuanced calibration of thresholds, agent configuration, and reviewer composition, tailored to workload and domain.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLM-Assisted FP Filtering.