Learning-Rate-Free Blockwise Optimization
- Learning-rate-free blockwise optimization is a paradigm that avoids global learning-rate tuning by leveraging block-specific structure and closed-form subproblem solutions.
- It employs adaptive geometric regularization, curvature-based scaling, and Bregman divergences to update blocks, ensuring stability in nonconvex and non-Lipschitz settings.
- Empirical studies in matrix factorization, imaging inverse problems, and manifold optimization demonstrate accelerated convergence and reduced hyperparameter tuning.
Learning-rate-free blockwise optimization encompasses a class of algorithms for large-scale and nonconvex optimization that systematically avoid explicit learning-rate (stepsize) selection by leveraging blockwise problem structure, adaptive geometric regularization, closed-form subproblem solutions, and intrinsic scaling via curvature or reference functionals. Such techniques dispense with outer hyperparameter tuning, accelerate convergence in challenging settings (including non-Lipschitz and manifold-valued problems), and unify diverse algorithmic paradigms across first- and second-order, primal-dual, and non-Euclidean optimization.
1. Motivation and Problem Classes
Classical iterative optimization schemes—proximal gradient, block coordinate descent (BCD), Newton-type, and primal-dual splitting—commonly require global stepsize parameters, tuned to upper bounds on Lipschitz or curvature constants. In many large-scale or structured problems, global conditions such as -Lipschitz smoothness, Euclidean curvature uniformity, or parameter separability fail: prominent cases include nonnegative matrix factorization (NMF), Poisson likelihood models, tensor decompositions, manifold-constrained semidefinite programs, and spatially inhomogeneous imaging problems. Stepsize mis-specification induces instability, slow convergence, or necessitates costly inner line-search/backtracking.
Learning-rate-free blockwise optimization methods circumvent these obstacles via intrinsic local scaling or exact subproblem minimization, partitioning variables into blocks indexed by problem geometry, spectral properties, spatial locations, or model structure. The optimization proceeds by updating each block (or selected subset) using local curvature information, adaptive Bregman divergences, or closed-form solutions, so that no hand-tuned global learning rate is present or required (Gao et al., 2019, Jahani et al., 2020, Valkonen, 2016, Peng et al., 2023).
2. Core Methodologies
Blockwise Two-Reference Bregman Proximal Gradient (B2B)
Blockwise two-reference Bregman proximal gradient (B2B) addresses composite, potentially nonconvex, and non-globally Lipschitz smooth objectives
by employing two convex, block-specific reference functions to induce Bregman divergences in the update (Gao et al., 2019). Each block update comprises:
- A direction-finding step: minimize .
- A Bregman projection step: .
Choosing to exploit the curvature of and structural properties of ensures exact or closed-form updates, with no explicit parameter. Greedy, randomized, or cyclic block selection rules are supported.
Symmetric Blockwise Truncated Optimization (SONIA)
SONIA partitions the variable space into a block (subspace) 0 of dimension 1 (e.g., via curvature sketches) and its orthogonal complement. The update at iteration 2 is
3
where 4 spans 5, 6 is the partial Hessian, and 7 is a small truncation parameter. No global stepsize appears; blockwise scaling is “absorbed” in the spectral truncation and 8 selection, which can be fixed or adaptively bounded by eigenvalues (Jahani et al., 2020).
Block-proximal Primal-Dual Methods with Spatially Adapted Acceleration
Block-proximal primal-dual hybrid gradient (PDHG, or Chambolle-Pock) methods are generalized to block coordinates with per-block step-length operators:
9
where 0 are projectors. Step-lengths 1 are recalculated at each iteration by closed-form blockwise algebraic formulas linked to blockwise curvature, compatibility, and over-relaxation, thus obviating any globally tuned stepsize (Valkonen, 2016).
Learning-Rate-Free Block Coordinate Descent on Manifolds
In the general Riemannian block coordinate descent (BCD) framework, the variable blocks 2 lie on smooth closed manifolds 3. Each iteration exactly (or nearly exactly) minimizes the overall objective with respect to a single block, holding all others fixed, thereby eliminating the need for a global stepsize. The only requirements are blockwise smoothness and compactness assumptions; convergence holds with rates matching those achievable by optimal step-size-tuned full-gradient methods (Peng et al., 2023).
3. Block Partitioning, Reference Functions, and Curvature Adaptivity
A fundamental aspect of learning-rate-free blockwise methods is the design of the block structure and reference functionals:
- Block partitioning can correspond to columns in matrix factorization, spectral eigenspaces, spatial domains (e.g., pixels in imaging), or geometric subspaces (manifold coordinates, low-rank factors).
- Reference functions (as in B2B) encode the local curvature or domain geometry (Euclidean, quadratic, entropy-like, etc.), which allows the induced Bregman divergences to both regularize the subproblems and absorb the effective learning rate.
- Curvature adaptivity in SONIA (and similar schemes) arises from direct computation or estimation of partial Hessian spectra in selected subspaces, automatically scaling steps according to local conditioning and truncating small eigenmodes to ensure stability. In block-proximal PDHG, blockwise step-lengths are updated so as to satisfy a generalized compatibility condition, analogous to an implicit blockwise CFL condition, computed fully algebraically.
4. Algorithmic Schemes and Convergence Properties
Implementation Patterns
The learning-rate-free blockwise paradigm encompasses:
- Closed-form block updates, such as Bregman projections or exact block minimizations, requiring no line-search or backtracking.
- Adaptive block selection schemes: cyclic, greedy (Gauss-Southwell), and randomized sampling, each with associated convergence rate guarantees.
- Pseudocode structures that explicitly avoid any global stepsize or schedule, computing all scaling terms from blockwise second-order, convexity, or algebraic properties intrinsic to the local problem.
Convergence Results
Under blockwise relative smoothness, strong convexity, and/or compactness (depending on method), learning-rate-free blockwise optimizers achieve:
- Sublinear convergence rates 4 or 5 for projected-gradient or primal-duality gaps, with acceleration possible in strongly convex blocks (6 ergodic rates) (Gao et al., 2019, Valkonen, 2016).
- Exact linear convergence when partial strong convexity holds, as in truncated block Newton or SONIA-type updates (Jahani et al., 2020).
- Convergence to critical points or stationary solutions in the nonconvex or manifold setting, with explicit rate bounds under mild assumptions (Peng et al., 2023).
A summary table of key algorithmic features:
| Method/Class | Stepsize Tuning Required | Block Adaptivity | Closed-form Block Updates |
|---|---|---|---|
| B2B (Gao et al., 2019) | No | Yes (h₁, h₂) | Often |
| SONIA (Jahani et al., 2020) | No | Yes (curvature) | Yes |
| Block-PDHG (Valkonen, 2016) | No | Yes (per block) | Yes |
| Manifold BCD (Peng et al., 2023) | No | Yes (geometry) | Yes |
5. Practical Applications and Empirical Observations
Learning-rate-free blockwise algorithms have been demonstrated on a range of high-dimensional and challenging problems:
- Nonnegative matrix factorization (NMF): B2B yields stable, fast, and tuning-free solutions even for non-Lipschitz or highly anisotropic cases (Gao et al., 2019).
- Empirical risk minimization: SONIA achieves convergence and accuracy comparable or superior to well-tuned first- and second-order methods, with no external stepsize tuning needed (Jahani et al., 2020).
- Imaging inverse problems: Block-proximal methods with spatially adapted acceleration enable pixelwise step-length adjustments, achieving large reductions in iteration counts and CPU time for denoising, deblurring, and inpainting (Valkonen, 2016).
- Manifold-structured inference and estimation: Blockwise exact subproblem solves appear in semidefinite programming (Burer-Monteiro), geometric estimation (essential matrix, absolute pose), and robust estimation via IRLS, all requiring no stepsize or learning-rate hyperparameters (Peng et al., 2023).
The empirical evidence indicates that these approaches not only provide robust convergence but also eliminate the major practical bottleneck of stepsize tuning—a particularly acute issue in the presence of non-uniform curvature or non-convexity.
6. Comparison with Classical and Contemporary Optimization
Learning-rate-free blockwise optimization contrasts sharply with conventional full-gradient, classical BCD, or line-searched proximal/dual methods:
- No global learning-rate parameter: All curvature information and local scalings are absorbed by reference divergences, spectral decompositions, or per-block algebra (Gao et al., 2019, Jahani et al., 2020, Valkonen, 2016, Peng et al., 2023).
- No inner loop line-search/backtracking: Once reference functions or block-specific parameters are fixed, the algorithm requires no further adaptation.
- Parameter-free in practice: The only remaining “parameters” (e.g., block structure, regularization strength) are determined by problem setup, not iterative tuning.
- Broad applicability: The paradigm extends beyond Euclidean convexity to nonconvex, non-Lipschitz, and manifold-constrained settings.
A plausible implication is that for a wide class of structured high-dimensional problems, such learning-rate-free blockwise algorithms should be the default, especially where global Lipschitz moduli are unavailable or step-size sensitivity is extreme.
7. Extensions, Stochastic Variants, and Future Directions
Important extensions and ongoing research include:
- Stochastic blockwise algorithms: Learning-rate-free principles extend to stochastic gradient and stochastic Hessian-vector variants, with convergence guarantees depending only on blockwise variance and curvature bounds (Jahani et al., 2020, Valkonen, 2016).
- Distributed and asynchronous blockwise updates: Particularly relevant for large-scale, multi-worker, or GPU cluster settings, where per-block adaptivity enables high parallelization and near-linear speedup (Valkonen, 2016).
- Adaptive block partitioning: Dynamic block structure selection based on empirical curvature, variable activity, or other criteria can further enhance performance, especially in nonstationary learning contexts (Jahani et al., 2020).
- Generalization to non-Euclidean and generalized entropy geometries: New reference function choices allow for direct application of learning-rate-free ideology in information geometry, optimal transport, and quantum settings (Gao et al., 2019).
On both theoretical and practical dimensions, learning-rate-free blockwise optimization offers an attractive alternative to parameter-heavy or brittle classical schemes, with proven efficacy across a spectrum of contemporary machine learning and scientific computing tasks.