On additive averaging kernels for finite Markov chains

Published 14 Apr 2026 in math.PR, cs.IT, math.CO, math.OC, and stat.CO | (2604.12334v1)

Abstract: We study additive mixtures of Markov kernels of the form $A_α= αP + (1-α)G$, where $α\in [0,1]$, $P$ is a baseline sampler and $G$ is a Gibbs kernel induced by a partition of the state space. We first motivate the study of $A_α$, which can be interpreted as the projection of a lifted Markov chain. We then consider the minimisation of distance to stationarity under two objectives: the squared Frobenius norm and the Kullback-Leibler (KL) divergence. For the Frobenius objective, we derive explicit trace formulas and identify a Cheeger-type functional that characterises optimal two-block partitions. This yields a structured combinatorial optimisation problem admitting a difference-of-submodular decomposition, enabling efficient approximation via majorisation-minimisation. We also obtain geometric decay rates governed by the absolute spectral gap of $P$. For the KL divergence, we establish convexity-based bounds showing that the divergence of $A_α$ is controlled by those of both $P$ and $G$, thereby reducing partition selection to the Gibbs component. Numerical experiments on the Curie-Weiss model demonstrate that suitable choice of both the partition and the parameter $α$ can significantly accelerate convergence in total variation distance. We observe a consistent trade-off between local exploration and global averaging, with intermediate values of $α$ achieving the best performance across regimes.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper proposes additive averaging kernels that combine a local baseline kernel (P) and a global Gibbs kernel (G) to accelerate convergence.
It optimizes the trade-off parameter α and partition structure under Frobenius and KL divergence objectives using combinatorial and submodular methods.
Empirical studies on the Curie–Weiss model show that tuning α in additive mixtures significantly improves worst-case total variation convergence.

Additive Averaging Kernels for Finite Markov Chains: Theory, Optimization, and Empirical Analysis

Introduction and Motivation

The paper "On additive averaging kernels for finite Markov chains" (2604.12334) investigates Markov kernel mixtures of the form $A_\alpha = \alpha P + (1-\alpha)G$ . Here, $P$ is a $\pi$ -stationary "baseline" kernel, and $G$ is a Gibbs kernel induced by a partition of the finite state space. The parameter $\alpha \in [0,1]$ governs the trade-off between local (via $P$ ) and partition-based global (via $G$ ) dynamics. Motivated by the recent study of group-averaged and composition-based kernels, the authors seek to determine whether computationally leaner additive mixtures can achieve similar acceleration in convergence to stationarity.

The main results include the characterization of optimal partitions and $\alpha$ under Frobenius and Kullback-Leibler (KL) objectives, combinatorial and submodular optimization strategies, and decay bounds in terms of the spectral gap. Empirical investigation is provided on the Curie–Weiss model, showing nontrivial optimal $\alpha$ and partition structures leading to significant accelerations in worst-case total variation (TV) convergence.

Theoretical Framework and Kernel Construction

The kernel $A_\alpha$ is constructed as a convex combination, blending $P$ 0 and $P$ 1. The $P$ 2 kernel averages within blocks ("orbits") of a partition, while $P$ 3 provides standard local exploration. Analysis proceeds by interpreting $P$ 4 as a marginal of a lifted chain $P$ 5 on an augmented space $P$ 6, where the auxiliary variable selects between $P$ 7 and $P$ 8. This formalism justifies the randomization inherent in $P$ 9 without sacrificing reversibility when $\pi$ 0 is reversible.

Frobenius Norm and Combinatorial Partition Optimization

The authors target both the squared Frobenius norm $\pi$ 1 (with $\pi$ 2 the rank-one stationary kernel) and KL divergence from stationarity. For the Frobenius setting, they derive explicit trace formulas. The core result is that minimization over two-block partitions reduces to maximizing a Cheeger-type functional $\pi$ 3 with respect to the stationary flow between blocks, normalized by $\pi$ 4:

$\pi$ 5

This functional appears in both combinatorial optimization and spectral theory, connecting to edge expansion (Cheeger's constant). The optimization of $\pi$ 6 over partitions thus becomes a difference-of-submodular problem, which is intractable in general but is amenable to majorization-minimization (MM) heuristics.

Furthermore, the authors prove geometric decay rates for the Frobenius distance in terms of the absolute spectral gap $\pi$ 7, yielding decay of the form $\pi$ 8 per step.

Figure 1: Worst-case total variation distance for different samplers, demonstrating the impact of group-averaging and additive mixtures over the baseline.

KL Divergence: Convexity-Based Bounds and Optimal Partitions

For the KL objective, convexity yields that the KL divergence of $\pi$ 9 to stationarity is upper bounded by a convex combination of the divergences for $G$ 0 and $G$ 1:

$G$ 2

Critically, $G$ 3 coincides with the Shannon entropy of the block structure. Thus, optimal partition selection for KL minimization reduces to entropy minimization, and the optimal partition collects least probable states into singleton or small blocks, with the remainder forming a large block.

Figure 2: Worst-case total variation distance for different samplers, each optimized over its own Frobenius-optimal partition.

Submodular Optimization and MM Algorithm

The combinatorial optimization of $G$ 4 can be decomposed into a difference-of-supermodular functions. Building on recent methods for minimizing differences of submodular functions, the authors present MM surrogates, which iteratively majorize the non-convex objective by easier-to-optimize supermodular functions. This yields practical approximations to the partition optimization problem when the state space is large.

Single-site ("singleton") approximations are also considered. In this regime, the optimal subset is shown to be a singleton associated with the maximal $G$ 5, with an additive approximation guarantee for the Frobenius objective.

Spectral and Cheeger Analyses

Explicit formulas relate the structure of $G$ 6 and its convergence to the projection chain $G$ 7 on the orbit space, Cheeger-type inequalities for expansion, and spectral properties of $G$ 8. The analysis reveals that group-averaged and additive kernels perform best when their partitions cut bottlenecks, in some cases selecting highly unbalanced cuts for the (multiplicative) group-averaged kernels and more balanced cuts for additive mixtures.

Numerical Experiments: Trade-off of Exploration and Averaging

Comprehensive experiments are conducted on the Curie–Weiss model in both high- and low-temperature regimes, with and without external field bias. Baseline Glauber ( $G$ 9), additive kernels $\alpha \in [0,1]$ 0, and multiplicative kernels ( $\alpha \in [0,1]$ 1, $\alpha \in [0,1]$ 2) are compared.

Key empirical findings:

Group-averaged multiplicative samplers uniformly achieve fastest mixing, followed by $\alpha \in [0,1]$ 3 (for moderate $\alpha \in [0,1]$ 4), then baseline $\alpha \in [0,1]$ 5.
Partitions optimizing the group-averaged kernels tend to be highly unbalanced in low-temperature or metastable regimes, while those for $\alpha \in [0,1]$ 6 are more balanced in nearly symmetric settings.
Tuning $\alpha \in [0,1]$ 7 is crucial; both endpoints ( $\alpha \in [0,1]$ 8 or $\alpha \in [0,1]$ 9) result in suboptimal mixing, while intermediate $P$ 0 (often near $P$ 1) yields optimal TV contraction.
Singleton approximations can provide efficient partition selection with provable guarantees and little sacrifice in mixing performance.

Figure 3: Magnetization profile of Frobenius-optimal cut for $P$ 2, $P$ 3 shows concentration on a small portion of the state space.

Figure 4: Worst-case total variation distance for $P$ 4 at varying $P$ 5 and fixed time $P$ 6, illustrating trade-offs in convergence speed.

Figure 5: Dependence of worst-case total variation distance of $P$ 7 on $P$ 8 for selected time horizons, typifying the U-shaped effect of trade-off parameter.

Implications and Future Directions

The investigation substantiates the utility of additive mixtures for MCMC acceleration with strong theoretical underpinnings. The construction enables efficient, structure-informed randomization, avoids the cost of full composition-based group averages, and leverages aggressive partitioning without compromising reversibility.

Practical implications:

The methodology is particularly germane for models with known bottlenecks or group symmetries, where block-structured partitions can be chosen efficiently.
Trade-off tuning of $P$ 9 is central, and the theory provides explicit decay rates and guidance for practical parameter selection.
Submodular techniques offer scalable heuristics even for combinatorially large state spaces.

Theoretical extensions:

The lifted Markov chain construction admits further generalization, potentially facilitating non-reversible or higher-order additive mixtures.
Difference-of-submodular optimization for other objectives or over larger partition classes remains an open direction, as do connections to advanced isoperimetric and spectral techniques in Markov chain geometry.

Conclusion

This work rigorously describes, analyzes, and empirically validates additive averaging kernel methods for finite-state Markov chains, bridging gaps between theory and computation. The partition optimization problem, formerly a combinatorial challenge, is rendered tractable via structural and submodular insights, and the explicit trade-offs between local exploration and global averaging are both characterized and exploited. The findings inform both principled MCMC design and broader inquiries into state space geometry and functional optimization in stochastic processes.

Markdown Report Issue