Info-Theoretic Recovery Thresholds
- Information-Theoretic Recovery Thresholds define the minimal sample sizes, SNR levels, or measurement conditions needed for accurate recovery of structured objects from noisy or incomplete data.
- They employ statistical tools such as union bounds, large deviation techniques, and Fano's inequality to set sufficiency and necessity conditions across models like sparse recovery, community detection, and graph alignment.
- Practical implications include guiding sensor design for optimal data quality, clarifying the statistical vs. computational gaps, and improving inference in high-dimensional settings.
Information-theoretic recovery thresholds delineate the minimal requirements—specifically, sample sizes, measurement regimes, or SNR conditions—under which it is possible to reliably recover structured objects (such as supports, communities, permutations, or network structures) from noisy or incomplete data, regardless of computational complexity. These thresholds are central in signal processing, statistical learning, network science, and modern high-dimensional statistics, as they define fundamental phase transitions for inference problems. This article surveys canonical models, formal threshold criteria, core methodologies for proving sufficiency and necessity, and key regimes where information-theoretic and algorithmic recovery diverge, with emphasis on sparse signal recovery from heterogeneous data as developed in the mixed-quality setting (Chaabouni et al., 11 May 2026), as well as representative theorems across structured models.
1. Canonical Models and Problem Settings
Information-theoretic thresholds arise in problems where an unknown combinatorial structure generates observable data corrupted by randomness, noise, or partial observation. Prototypical examples include:
- Sparse vector recovery: Identifying the support of a -sparse vector from linear measurements , often under i.i.d. Gaussian noise . Variants include mixed-quality settings, where measurements have heterogeneous noise variances (Chaabouni et al., 11 May 2026), [0702301].
- Community detection in Stochastic Block Models (SBM): Partitioning nodes into communities given observed edge weights or adjacency matrices, with information-theoretic thresholds in terms of Renyi divergence (Jog et al., 2015).
- Graph alignment: Recovering a hidden permutation between two correlated Erdős–Rényi graphs under specified correlation structure (Cullina et al., 2017, Huang et al., 2024).
- Rank minimization: Recovering a low-rank matrix from linear projections, with thresholds for nuclear norm minimization established via geometric and information-theoretic criteria (Oymak et al., 2010).
- Recovery in structured networks: Identification of parent sets in diffusion networks (Park et al., 2016), edge recovery in dynamic models (Du et al., 7 Apr 2025), and community detection in higher-order (hypergraph) SBM (Liang et al., 2021).
Each model defines a hypothesis class (e.g., all supports of size , all community assignments, all permutations) and an information channel 0. The threshold is the minimal value of a problem parameter (samples, SNR, etc.) above which recovery is possible in probability as 1.
2. Formal Threshold Criteria: Sufficient and Necessary Conditions
Thresholds are typically characterized in terms of two-sided bounds:
- Sufficiency: There exists a decoder (possibly of unbounded complexity) that achieves recovery with failure probability vanishing as 2 if the resources (e.g., samples, SNR) exceed a critical value.
- Necessity: For any decoder, failure probability is bounded away from zero (or tends to one, i.e., strong impossibility) if resources are below the critical value.
For sparse support recovery (mixed-quality measurements), key results from (Chaabouni et al., 11 May 2026) are:
- Let 3 have exactly 4 nonzeros, with 5 high-quality (noise 6) and 7 low-quality (noise 8) measurements.
- Critical (homogeneous) sample threshold: 9 for sparse regime (0).
Agnostic (unknown variances) sufficient condition: For any fixed 1, some 2,
3
Informed (known variances/MLE) condition: 4 The thresholds for exact support recovery in the canonical homogeneous case are 5 at high SNR, with refined scaling at finite SNR [0702301].
For SBM (Jog et al., 2015), the sharp threshold is set by the order-1/2 Rényi divergence 6 between within- and between-community edge distributions: 7 In graph alignment (Cullina et al., 2017), exact recovery in sparsely correlated ER graphs requires the intersection marginal 8 to satisfy 9.
3. Quantitative Expressions: Price of Quality and Sample Trade-offs
A salient feature of mixed-quality measurement regimes is the introduction of linear trade-offs between high- and low-quality samples, formalized as the "Price of Quality": 0 where 1 are coefficients depending on the noise variances and regime (see above). The price of quality 2 is the number of low-quality samples needed to replace one high-quality sample:
- Agnostic (unknown variance):
3
- Informed (variance known):
4
This captures that, for agnostic decoders, even very noisy measurements retain significant value (at most a factor 2 gap), whereas for an informed decoder, high-quality measurements can asymptotically substitute for arbitrarily many low-quality measurements.
4. Proof Techniques: Union Bounds, Large Deviations, and Fano's Inequality
Information-theoretic sufficiency is typically established via a union bound (over all incorrect hypotheses) coupled with Chernoff-type or moment-generating function (MGF) large deviation bounds for log-likelihood differences or loss gaps. The dominant error probability per incorrect model is suppressed exponentially in the number of measurements, counterbalanced by the (exponentially large) number of candidates. In (Chaabouni et al., 11 May 2026), for each incorrect support at Hamming distance 5, probability that it appears preferable to the truth is controlled as: 6 where 7 are explicit SNR-dependent constants.
Necessity (impossibility) is typically produced via Fano's inequality, bounding the mutual information between the structured object and data, and showing that channel capacity is insufficient to distinguish among all possibilities. In the mixed-quality and related settings, the channel capacity per sample is replaced by expressions involving 8 or generalizations for weighted/nonscalar data [0702301], (Jog et al., 2015).
5. Regimes and Phenomena: SNR, Partial Recovery, and Algorithmic Gaps
- SNR Regimes: At high SNR, all models collapse to information-theoretic minimal sample thresholds, i.e., 9. At low SNR or in highly heterogeneous regimes, informed decoders optimally drop low-quality data; agnostic decoders cannot but retain significant efficiency.
- Partial Recovery: Allowing a fraction 0 of incorrect entries reduces the sample complexity, with sharp thresholds still typically governed by 1-type scaling but with smaller leading constants (Scarlett et al., 2015, Reeves et al., 2010, Truong et al., 2019, Huang et al., 2024).
- Statistical vs. Computational Gaps: In certain models, notably planted dense cycles or submatrix localization, the information-theoretic threshold can be strictly below what is achievable efficiently—typically detected via low-degree polynomial barrier arguments (Mao et al., 2024, Pandey et al., 2024, Liang et al., 2021).
- Adaptive Sensing: In group testing and 1-bit compressive sensing, adaptivity cannot reduce sample complexity, while in subsampling sparsity regimes, mild performance gains are possible (Aksoylar et al., 2014).
6. Applications and Practical Implications
- Budget allocation in experiments: The derived price-of-quality criteria provide concrete design guidelines for allocating costly high-quality vs. cheap low-quality sensors, with rigorous backing for when dropping noisy data is information-theoretically justified (Chaabouni et al., 11 May 2026).
- Network structure learning: For diffusion models or network dynamics, thresholds clarify the number of observed cascades or temporal trajectories needed (typically 2) for reliable parent-set determination (Park et al., 2016, Du et al., 7 Apr 2025).
- Matrix recovery: Null space analysis and phase diagrams of nuclear norm minimization reveal explicit oversampling factors compared to degrees of freedom, quantifying the sharp transition to recoverability (Oymak et al., 2010).
- Quantum information: In quantum error correction, information-theoretic recovery thresholds are characterized via mutual trace distance and coherent information criteria, with optimality of Petz and Schumacher–Westmoreland decoders established (Kim, 4 Mar 2026, Beigi et al., 2015).
7. Outlook and Open Directions
Contemporary research focuses on sharpening constant factors in threshold conditions, exploring finer-grained partial recovery phases, characterizing tight statistical-computational gaps in models with complex dependency structures (e.g., geometric random graphs, hypergraphs), and establishing robust analogs of these thresholds under additional or more general constraints (e.g., missing data, heterogeneous graphons, quantum channels) (Huang et al., 2024, Panagiotou et al., 2022, Liang et al., 2021). In many cases, algorithmic thresholds (for tractable decoding) exhibit order-wise or constant factor gaps from their information-theoretic counterparts, motivating ongoing research on efficient algorithms that close these gaps or prove them intrinsic.