Ray Regression Predictor: Convergence Insights
- Ray Regression Predictor (RRP) is a geometric and analytic framework that defines a unique affine ray detailing parameter convergence in logistic regression.
- It partitions data into strongly convex and separable subsets to compute a maximum-margin direction and an optimal offset that control implicit bias.
- RRP offers provable risk and parameter convergence rates, with directional convergence at O(ln ln t/ln t) and offset convergence at O((ln t)²/√t) under gradient descent.
The Ray Regression Predictor (RRP) is a geometric and analytic framework characterizing the asymptotic trajectory of parameter iterates when logistic regression is trained with first-order methods, particularly gradient descent, on arbitrary data. RRP formalizes the exact subspace, maximum-margin direction, and strongly convex offset controlling the implicit bias and parameter convergence rates under logistic (or exponential) risk minimization. The RRP is defined as a unique affine ray in parameter space determined by a data-dependent decomposition and possesses provable parameter and risk convergence properties that hold for general linearly separable and nonseparable regimes (Ji et al., 2018).
1. Geometric Structure and Definition
RRP arises from a structural decomposition of the design matrix , defined for labeled instances with , as . The rows of are partitioned into:
- Strongly Convex Part, : corresponds to the maximal subset where the risk function is strongly convex.
- Separable Part, : consists of all rows for which there exists 0 with 1 and 2.
Define 3 and 4, yielding an orthogonal decomposition of parameter space. 5 is proven to be linearly separable within 6. The corresponding strict margin is
7
where 8 is 9 projected onto 0. The unique maximum-margin separator within this subspace is
1
for any dual optimum 2. On 3, the unique strongly convex risk minimizer is
4
Ray Regression Predictor (RRP): The RRP is defined as
5
Gradient descent iterates 6 satisfy, for large 7,
8
thus tracking the RRP ray in direction and offset.
2. Risk Convergence Foundations
For empirical logistic or exponential risk
9
with 0 and gradient descent 1, risk convergence is established using:
- Magic Smooth-Descent Lemma: For 2-smoothness, if 3, then for any 4,
5
- Fixed-Direction Rate Lemma: Setting 6,
7
Combining these, risk convergence for 8 yields
9
3. Parameter Convergence Theorems
3.1 Offset Convergence on 0
For the 1-component, let 2 be the modulus of strong convexity of 3. It follows that for 4 and arbitrary step-size sequence,
5
For 6, this yields the offset-convergence theorem:
7
3.2 Directional Convergence on 8
For the 9 component, norm growth is established: 0, with 1 and 2. Using a Fenchel–Young argument, for fully separable 3 (i.e., 4) and 5,
6
i.e.,
7
4. Practical Construction of the RRP
The construction of the Ray Regression Predictor in practice may be summarized by the following workflow:
- Separable Subset Identification: Employ a greedy separability test on each example to construct the separable subset 8, corresponding to the partition 9.
- Offset Computation: Compute 0 using any standard solver for convex minimization.
- Maximum-Margin Direction: Compute
1
for the maximum-margin separator within 2.
- Final RRP: The RRP is then the ray 3. Gradient descent (with constant or decaying steps) on the empirical risk automatically yields iterates tracking the RRP: the offset 4 is recovered first at rate 5, with directional convergence to 6 at rate 7.
5. Summary of Key Rates and Theoretical Guarantees
The table summarizes the principal rates for risk and parameter convergence:
| Quantity | Convergence Rate | Conditions |
|---|---|---|
| 8 | 9 | 0 |
| 1 | 2 | Strongly convex 3 |
| 4 | 5 | Fully separable 6 |
| 7 | 8 | Asymptotic, under above |
These results hold for gradient descent initialization at 9, with either constant or inverse-root step size as specified, and for the empirical logistic or exponential loss. The RRP fully explicates the implicit bias and convergence path of iterates in high-dimensional, possibly partially or fully separable logistic regression tasks (Ji et al., 2018).