LogSieve: Optimized Log Reduction & Sparse Convolution

Updated 29 January 2026

LogSieve is a dual-framework system that reduces CI logs and optimizes sparse convolution by eliminating noise while preserving key information.
It uses a two-stage process with regex boilerplate removal followed by embedding classification to maintain diagnostic fidelity.
In sparse algorithmics, it applies number theory to achieve near-optimal hash collision rates for efficient convolution and pattern matching.

LogSieve refers to two distinct but fundamentally technical frameworks introduced in recent literature: one for semantics-preserving reduction of Continuous Integration (CI) logs for sustainable LLM analysis (Barnes et al., 28 Jan 2026), and another as an analytic number theory–inspired algorithmic toolkit for efficient sparse convolution and related combinatorial problems (&&&1&&&). Both share a core objective: eliminating “noise” or “redundancy” to accelerate downstream computation while maximizing information retention, but they deploy different methodologies and target divergent domains.

1. Formal Problem Statement and Optimization Objectives

In the CI log analysis context (Barnes et al., 28 Jan 2026), the input is an unstructured log $L = \{\ell_1, \ell_2, ..., \ell_N\}$ generated by automated builds. The goal is to derive a relevance mask $R: L \rightarrow \{0,1\}$ and an associated reduction $f(L) = \{\ell_i \in L \mid R(\ell_i) = 1\}$ , satisfying $|f(L)| \ll |L|$ while preserving "diagnostic information." The challenge is to optimize a tradeoff between aggressiveness of reduction (measured in lines/tokens removed) and semantic fidelity for RCA (Root-Cause Analysis) tasks.

The LLM inference cost is linearly proportional to the number of input tokens, so reducing input log size directly translates to lower computational cost and energy consumption: $\Delta \mathrm{Energy} \approx \delta\,C_{\text{in}}\,T_{\text{in}};\qquad \Delta\mathrm{CO}_2 \approx \delta \, C_{\text{in}}\, T_{\text{in}} \times \mathrm{CI}_\mathrm{grid}$ where $\delta$ is the fraction of removed tokens, $C_{\text{in}}$ denotes per-token energy, and $\mathrm{CI}_\mathrm{grid}$ is the grid emission factor.

In the analytic number theory–driven LogSieve for algorithms (Jin et al., 2024), the aim is to hash integer sets via modular arithmetic while minimizing collision rates in "buckets" for efficient sparse convolution, Hamming distance, and similar tasks. Key optimization is to reduce expected hash collisions from $O(\log N/Q)$ (for naive random-prime hashing) to $O(1/Q)$ , approaching the lower bound.

2. Methodology and Architectural Components

2.1 LogSieve for CI Log Reduction

This instance of LogSieve employs a two-stage pipeline:

Stage 1: Heuristic Boilerplate Removal
- Removes lines matching low-information patterns via regexes, eliminating timestamps, progress bars, and environment dumps.
Stage 2: Embedding-Based Relevance Classification
- For each remaining line $\ell_i$ , compute an embedding $e_i = \mathrm{Embed}(\ell_i)$ (supporting models include BERT, TF-IDF, and LLaMA3).
- Apply a logistic regression classifier to yield a relevance score $s_i = \sigma(w^\top e_i + b)$ . Threshold $\theta$ determines retention: $R(\ell_i) = 1$ iff $s_i \geq \theta$ .
Algorithmic Structure:

def LogSieve_Filter(Log_L, theta):
    Stage1 = [l for l in Log_L if not matchesBoilerplate(l)]
    Reduced = [l for l in Stage1 if sigmoid(w.T @ Embed(l) + b) >= theta]
    return Reduced

Embedding models and classifier parameters are tunable.

2.2 LogSieve in Sparse Algorithmics (Large Sieve Approach)

Collision Control via Large Sieve Inequality:
- Leverages the large sieve from analytic number theory:
$\sum_{\alpha \in \mathcal{X}} \left|\sum_{n=0}^{N-1} a_n e(\alpha n)\right|^2 \leq (\delta^{-1} + N)\sum_{n=0}^{N-1}|a_n|^2$ - Achieves $O(1/Q)$ collision probability by selecting primes $p \in [Q/2, Q]$ and hashing $x \bmod p$ ; ensures bucket sizes of $O(\ln \ln N)$ given $Q \approx a / \ln\ln N$ for support size $a$ .
Sparse Convolution via Peeling and Prony’s Method:
- Iteratively recovers nonzero outputs by isolating "light" buckets (low-collision) using carefully selected moduli.
- Recovers polynomial coefficients via Prony's method with moment-matrix sparsity tests and lifting via precomputed tables.

3. Quantitative Performance and Empirical Results

CI Log Reduction

On 20 open-source Android repositories (14646 lines, labeled with $\kappa=0.80$ ), LogSieve demonstrated:

Line reduction: $42\%$ (from 732 to 350 lines on average)
Token reduction: $40\%$ (from 26543 to 13355 tokens)
Semantic fidelity:
- Cosine similarity: $0.93$
- GPTScore: $0.93$
- Exact-match accuracy (failure categorization): $80\%$
Classifier accuracy: Up to $97\%$ , weighted F1: $0.97$
Efficiency: Direct proportional reduction in LLM inference energy; e.g., $13\,188$ tokens reduced $\approx 1.32\,J$ /run.
Comparative Results (vs. LogZip, Random Removal):

Approach	CosSim	GPTScore	Exact-Match
LogSieve	0.93	0.93	0.80
Random	0.90	0.86	0.70
LogZip	0.70	0.41	0.20

Statistical significance: Paired $t$ -test ( $p<0.01$ ); McNemar's test on categorization ( $\chi^2=4.8$ , $p\approx0.03$ ) (Barnes et al., 28 Jan 2026).

Sparse Convolution and Pattern Matching

The LogSieve framework in combinatorial algorithms produced:

Sparse Nonnegative Convolution: Las Vegas $O(t\log t)$ , recovering all $t$ nonzeros with high probability.
Text-to-Pattern Hamming: Deterministic $O(n\sqrt{m\ln\ln m})$ time for length- $m$ patterns in length- $n$ texts.
General Sparse Convolution: Monte Carlo $O(t\log t)$ for $N \leq t^{1.99}$ (Jin et al., 2024).

4. Technical Metrics and Fidelity Measures

Line and Token Reduction:

$\mathrm{LineRed} = 1 - \frac{|f(L)|}{|L|},\quad \mathrm{TokenRed} = 1 - \frac{T_\mathrm{kept}}{T_\mathrm{orig}}$

Semantic similarity: Cosine similarity in embedding space and GPTScore (normalized $[0,1]$ ).
Categorization (Exact-match):

$\mathrm{EM} = \frac{\#\{\text{label}_\mathrm{full} = \text{label}_\mathrm{red}\}}{\text{total cases}}$

Energy/CO\textsubscript{2} Impact:

$\Delta C_\mathrm{in} = p_\mathrm{in} \frac{\Delta T_\mathrm{in}}{1000},\quad \Delta \mathrm{CO}_2 = \Delta E \times \mathrm{CI}_\mathrm{grid}$

5. Integration, Deployment, and Case Studies

LogSieve is natively integrable as a GitHub Action or CLI step, compatible with CI infrastructures. A canonical YAML deployment for GitHub Actions enables sequential build, logging, log reduction using LogSieve (with user-selectable classifier and threshold), and subsequent LLM-based analysis. The system loads embedding models and classifier artifacts, streams input logs line-wise, applies boilerplate filtering, computes embeddings for remaining lines in batches, classifies, and writes filtered output (Barnes et al., 28 Jan 2026).

6. Discussion, Limitations, and Future Directions

LogSieve’s strengths include high semantic fidelity (up to $93\%$ similarity), substantial token/line reductions, and superior performance over baseline reduction strategies. However, context loss may occur in rare scenarios, especially if relevant multi-line structures (e.g., split stack traces) cross filter thresholds. Embedding classifiers exhibit $\sim3\%$ error (false negatives/positives). Potential mitigations include confidence-based flagging of ambiguous lines ( $0.4 \leq s_i \leq 0.6$ ).

Future enhancements may involve adaptive thresholds tailored to failure types or CI stages, fusion with structural log parsers, evaluation across broader CI ecosystems, and comprehensive measurement of energy/carbon impact in real-world pipelines. Human-in-the-loop interpretability assessment and full production deployment monitoring are active areas for further investigation.

7. Significance Across Domains

LogSieve, in both CI log reduction and algorithmic applications, systematically refines information streams for downstream reasoning or computation. In CI, it facilitates deeper LLM-based analysis at reduced energy and cost, promoting sustainability at scale. In algorithmics, it tightly integrates analytic number theory—through the large sieve inequality—with randomized and deterministic hashing schemes, providing theoretical guarantees and practical speedups for sparse convolution and pattern matching problems. Both frameworks underscore the cross-domain utility of "sieving" processes to optimize for relevance, efficiency, and task-specific fidelity.

References:

"LogSieve: Task-Aware CI Log Reduction for Sustainable LLM-Based Analysis" (Barnes et al., 28 Jan 2026)
"Shaving Logs via Large Sieve Inequality: Faster Algorithms for Sparse Convolution and More" (Jin et al., 2024)

Markdown Report Issue Upgrade to Chat

References (2)

LogSieve: Task-Aware CI Log Reduction for Sustainable LLM-Based Analysis (2026)

Shaving Logs via Large Sieve Inequality: Faster Algorithms for Sparse Convolution and More (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LogSieve.