Greedy Inference Methods

Updated 3 February 2026

Greedy Inference Methods are algorithms that construct solutions by iteratively selecting the locally optimal action based on a problem-specific gain function.
They achieve provable approximation guarantees, such as the (1-1/e) bound in submodular maximization, ensuring efficient performance in tasks like structure learning and sparse recovery.
Their computational efficiency and adaptability enable practical applications across combinatorial optimization, Bayesian inference, decision tree construction, and subword tokenization in NLP.

Greedy Inference Methods are a class of algorithms characterized by constructing a solution in an iterative, myopic fashion—selecting at each step the locally optimal action according to a problem-specific criterion, with no backtracking or global optimization. They have established theoretical and practical efficacy for a wide range of inference problems including combinatorial optimization, structure learning in graphical models, sparse estimation, approximate inference in probabilistic models, decision-tree construction, bayesian inference via transport map compositions, as well as subword tokenization in NLP. Greedy inference is often computationally efficient and amenable to provable approximation guarantees, sometimes matching or exceeding the sample-complexity and statistical efficiency of more expensive convex or global methods.

1. Core Principles and Variants

At the heart of greedy inference is iterative local optimization: at each step, the algorithm augments (or occasionally prunes) the current partial solution by the action that appears to yield the greatest marginal progress with respect to a monotonic or submodular score function.

Selection Rule: The choice at each iteration is dictated by a problem-specific gain function, often expressible as a marginal improvement in a well-defined objective (e.g., submodular maximization, decrease in loss, increase in coverage).
Variants: Greedy inference methods appear in several algorithmic forms:
- Forward greedy: iteratively add the best candidate (classic in submodular and sparse structure selection).
- Backward greedy: iteratively remove the least useful element (pruning).
- Double greedy: simultaneous consideration of forward and backward moves, especially in unconstrained submodular maximization (Hemmi et al., 2022).
- Block or parallel greedy: update multiple elements at each step for efficiency/speedup (e.g., block variants in Kaczmarz/greedy sparse regression (Zhang et al., 2020, Sancetta, 2016)).
- Hybrid or breadth/depth phased: interleaving multiple types of greedy moves to avoid poor local optima, e.g., alternating between edge and turn-phases in causal structure learning (Linusson et al., 2021).

Approximation guarantees of greedy methods usually rely on the monotonicity and (often) submodularity of the objective function, which yield classic bounds such as the $(1-1/e)$ -approximation for monotone submodular maximization.

2. Greedy Inference in Structure Learning and Graphical Models

Causal Discovery: Greedy edge-walks (GES, GIES, MMHC) can be understood as simplex-type algorithms walking on the edge-graph of a convex polytope representing Markov equivalence classes (characteristic-imset polytope, $\operatorname{CIM}_p$ ) (Linusson et al., 2021). Each move (addition, deletion, or reversal of an edge) corresponds to moving to an adjacent vertex with a higher score (e.g., BIC). Greedy CIM and skeletal greedy CIM, leveraging a complete local characterization of edge and turn pairs, provably generalize previous algorithms, with the skeletal greedy CIM showing superior empirical recovery of the true Markov equivalence class in synthetic and real-data settings.
Permutation-based Causal Inference: Greedy permutation search on the DAG-associahedron contracts the search space to a polytope whose vertices encode minimal I-maps consistent with permutations. Local flips (covered edge reversals) yield provably consistent and computationally scalable algorithms in both low- and high-dimensional regimes, especially under faithfulness or BIC scoring (Solus et al., 2017).
Sparse Inverse Covariance Estimation: Forward-backward greedy algorithms for Gaussian graphical model recovery achieve sparsistency with sample complexity $O(d\log p)$ under much weaker conditions than $\ell_1$ -regularized methods (graphical lasso), and exhibit improved tolerance to strong correlations (Johnson et al., 2011).

3. Greedy Algorithms for Combinatorial and Bayesian Inference

Decision Tree Construction from Decision Rule Systems: A greedy polynomial-time algorithm exists for simulating a decision tree from a set of decision rules. The method recursively covers the maximal-length rules by solving a greedy set-cover subproblem at each step (selecting attributes with maximal rule overlap), yielding a decision tree of provable depth and polynomial runtime (Durdymyradov et al., 2024).
MAP Inference in Determinantal Point Processes: Greedy algorithms yield high-quality approximations in submodular MAP problems such as DPP MAP inference (Chen et al., 2017, Hemmi et al., 2022, Han et al., 2017). Key computational innovations include:
- Incremental Cholesky updates: Allowing $O(N^2M)$ or $O(M^3)$ implementations for long sequences, with extensions for sliding-window diversity constraints.
- Lazy and fast combinations: Exploiting submodularity with priority-queue–based marginal gain evaluation to dramatically reduce evaluations, and combining with fast rank-one updates for further acceleration (Hemmi et al., 2022).
- Block and batch updates: Approximate the most beneficial sets in few steps via stochastic trace estimates or Taylor expansions (Han et al., 2017).
Greedy Motzkin-Kaczmarz Methods: For solving large linear systems, greedy selection of directions (rows with maximal residual) yields iterative methods with nearly optimal convergence factors compared to randomized and classic Kaczmarz algorithms. Block greedy variants further accelerate convergence on overdetermined or sparse systems (Zhang et al., 2020).
Bayesian Inference via Compositions of Greedy Lazy Maps: In high-dimensional posterior approximation, greedy identification of active subspaces via score-covariance (KL-bound-driven selection) and composition of shallow transport maps yields improved convergence and accelerated training versus base flows, with empirical gains in MCMC preconditioning and accuracy (Brennan et al., 2019).

4. Greedy Methods in Model Selection, Variable Selection, and Tokenization

High-dimensional Prediction and Sparse Regression: Greedy algorithms (pure, orthogonal, relaxed, constrained, and projection-type) for prediction/variable selection achieve minimax or near-minimax adaptation rates under weak dependence and unbounded regressors (Sancetta, 2016). The Frank-Wolfe and constrained-greedy variants are especially efficient, yielding fast Lasso-type solutions without inner convex optimization.
Subword Tokenization in NLP: Greedy inference algorithms, particularly longest-prefix and longest-token variants, are the de facto decoding protocols for subword tokenizers such as BPE, WordPiece, and UnigramLM. Controlled evaluations show that greedy segmentation achieves state-of-the-art morphological alignment, encodes human cognitive plausibility in predicted complexity, and attains near-optimal information-theoretic efficiency relative to dynamic-programming–based or merge-order methods (Uzan et al., 2024). Empirically, longest-prefix is especially effective in aligning token boundaries to gold-standard morphemes, suggesting that the memoryless property of greedy decoding is well matched to the morphological structure of most languages.

5. Theoretical Guarantees and Complexity

Approximation Factors: For submodular maximization, greedy algorithms often guarantee $(1-1/e)$ -optimality for the cardinality-constrained case, and $1/2$ for unconstrained double-greedy (Hemmi et al., 2022). In DPP MAP inference, further error due to log-det approximation can be bounded explicitly (Han et al., 2017).
Complexity: Greedy inference is typically polynomial in the ambient dimension or sample size under favorable problem structure (e.g., $O(p\cdot 2^{p-1})$ per move for causal edge-walks on $\operatorname{CIM}_p$ ; $O(N^2M)$ for Cholesky-based DPP greedy MAP), and admits further reductions via lazy evaluation, fast matrix updates, and block processing (Hemmi et al., 2022, Chen et al., 2017).
Sample Complexity: In high-dimensional graphical structure estimation, greedy methods achieve $O(d \log p)$ sample sizes under restricted eigenvalue and smoothness conditions significantly milder than the irrepresentable conditions required by $\ell_1$ -methods (Johnson et al., 2011).
Consistency Results: For permutation-based and polytope-based greedy causal discovery, pointwise and high-dimensional uniform consistency are achievable under faithfulness and sparsity, with rigorous error and convergence analyses (Solus et al., 2017, Linusson et al., 2021).

6. Empirical Performance and Application Domains

Empirical studies consistently find greedy inference highly competitive or superior, especially in regimes where problem size, sparsity, or underlying graphical constraints favor local incremental improvements:

Causal Structure Learning: Skeletal greedy CIM outperforms hybrid and constraint-based algorithms in recovering the true Markov equivalence class; breadth/depth-phase alternations can further match or outperform classical GIES (Linusson et al., 2021).
MAP Inference with DPPs: Fast greedy methods enable millisecond-latency, high-diversity real-time recommendation at scale (e.g., in the Netflix and MillionSong datasets) with small or negligible accuracy loss (Chen et al., 2017, Hemmi et al., 2022).
Sparse Network Recovery: Greedy support selection achieves successful graph recovery at much lower sample sizes than convex $\ell_1$ benchmarks, with empirical phase transition curves matching theory (Johnson et al., 2011).
Tokenization: Greedy segmentation protocols outstrip dynamic programming and merge-consistent inference for subword boundary alignment without added computational cost (Uzan et al., 2024).
Bayesian Posterior Approximation: Greedy lazy-map composition yields lower KL divergence and higher effective sample size in high- and ultra-high-dimensional inverse and regression problems (Brennan et al., 2019).
Monte Carlo Inference: Greedy importance sampling, coupling local search with IS, effectively targets high-contribution regions for substantial variance reduction, outperforming vanilla IS, rejection sampling, and Gibbs/MCMC on a variety of synthetic and real distributions (Schuurmans et al., 2013).

7. Limitations, Extensions, and Future Directions

Worst-case Performance: Some greedy methods can be trapped in local optima or exhibit exponential worst-case cost, especially in absence of submodularity or strong convexity. For instance, the exponential size of the DAG or causal structure polytopes limits scaling, though practical implementations with bounded depth/restarts remain feasible for large problems (Solus et al., 2017).
Extensions: Advanced greedy strategies employ breadth-first, phased, or hybrid selection rules to escape poor basins, integrate block updates, or exploit problem-specific structural decompositions (e.g., active subspaces, conditional independence constraints).
Theoretical Open Directions: Ongoing research seeks to further characterize the edge-structure of relevant polytopes in graphical learning, to generalize greedy transport maps to non-Gaussian and non-linear Bayesian inference, and to leverage greedy-block methods in broader nonparametric settings.

Greedy inference continues to play a foundational and evolving role across statistical learning, Bayesian inference, combinatorial optimization, and language processing, combining algorithmic tractability with strong theoretical and empirical performance guarantees.