Feasible Learning (2501.14912v1)

Published 24 Jan 2025 in cs.LG and cs.AI

Abstract: We introduce Feasible Learning (FL), a sample-centric learning paradigm where models are trained by solving a feasibility problem that bounds the loss for each training sample. In contrast to the ubiquitous Empirical Risk Minimization (ERM) framework, which optimizes for average performance, FL demands satisfactory performance on every individual data point. Since any model that meets the prescribed performance threshold is a valid FL solution, the choice of optimization algorithm and its dynamics play a crucial role in shaping the properties of the resulting solutions. In particular, we study a primal-dual approach which dynamically re-weights the importance of each sample during training. To address the challenge of setting a meaningful threshold in practice, we introduce a relaxation of FL that incorporates slack variables of minimal norm. Our empirical analysis, spanning image classification, age regression, and preference optimization in LLMs, demonstrates that models trained via FL can learn from data while displaying improved tail behavior compared to ERM, with only a marginal impact on average performance.

Summary

The paper proposes Feasible Learning (FL), a sample-centric paradigm that frames machine learning as a feasibility problem solved via primal-dual optimization to satisfy per-sample loss constraints.
FL and its relaxed version, Resilient Feasible Learning (RFL), achieve comparable average performance to ERM while demonstrating improved tail behavior by reducing the number of samples with high losses.
Primal-dual algorithms are introduced for solving FL and RFL, yielding sample weights (Lagrange multipliers) that correlate with training difficulty and provide insights into data importance.

The paper introduces Feasible Learning (FL), a sample-centric learning paradigm that formulates learning as a feasibility problem. Instead of optimizing for average performance like Empirical Risk Minimization (ERM), FL seeks a predictor that satisfies a bounded loss constraint for each training sample. The paper also introduces Resilient Feasible Learning (RFL), a relaxation of FL that incorporates slack variables of minimal norm to address the challenge of setting a meaningful threshold and to handle potential infeasibility in practical applications.

The key aspects of the FL paradigm are:

It is sample-centric, requiring satisfactory performance across all training samples, unlike ERM which optimizes average performance.
It does not inherently favor any model that meets the loss constraints.
It induces functional regularization, preventing overfitting by not demanding loss reduction beyond a specified threshold.

The authors adopt a primal-dual optimization approach to solve FL problems, which dynamically re-weights each sample's importance during training. The weight of each sample corresponds to the Lagrange multiplier associated with its constraint. For RFL, the paper shows that it is equivalent to a non-convex, strongly-concave min-max problem. Primal-dual algorithms are provided for solving both FL and RFL problems.

The paper makes the following contributions:

It proposes FL, which frames learning as a constraint satisfaction problem.
It introduces RFL, a relaxation of FL that addresses potential infeasibility issues, and proves that RFL is equivalent to a non-convex, strongly-concave min-max problem.
It provides primal-dual algorithms for solving FL and RFL problems.
It empirically explores FL and RFL problems and their solutions.

In Section 2, the Feasible Learning paradigm advocates for learning through solving a feasibility problem. Specifically, FL considers an optimization problem with a trivial, constant objective while enforcing a loss constraint for each training sample: $\min_{\theta \in \Theta} \,\, 0$ s.t. $g(\theta) \le \epsilon$ , where $\epsilon = \epsilon \mathbb{1}$ is the constraint level.

$\theta$ represents the parameters of the predictor.
$\Theta$ is the parameter space.
$g(\theta)$ represents the loss incurred on the $i$ -th data point, where $g(\theta) = [g_1(\theta), ..., g_n(\theta)]^\top$ .
$\epsilon$ is the maximum allowed per-sample loss.

In Section 2.1, the authors solve FL problems by leveraging Lagrangian duality. The min-max Lagrangian game associated with the equation above is: $\min_{\theta \in \Theta}\, \max_{\lambda \ge 0} \, L(\theta, \lambda) = \lambda^\top (g(\theta) - \epsilon)$ , where $\lambda \geq 0$ is the vector of Lagrange multipliers associated with the constraints.

$\lambda$ represents the dual variables (Lagrange multipliers).
$L(\theta, \lambda)$ is the Lagrangian function.

Alternating Gradient Descent Ascent (GDA) updates yield:

$\lambda_{t+1} \leftarrow \Big[ \lambda_t + \eta_x \underbrace{\left( g(\theta_t) - \epsilon \right)}_{\nabla_{\lambda} L(\theta_t, \lambda_t)} \Big]_+$

$\theta_{t+1} \leftarrow \theta_t - \eta_\theta \bigg[ \underbrace{\sum_{i=1}^n \lambda_{t+1}^{(i)} \, \nabla_{\theta} \, g_i(\theta_t)}_{\nabla_{\theta} L(\theta_t, \lambda_{t+1})} \bigg]$ , where $[\, \cdot \,]_+$ denotes a projection onto $R_{\geq 0}^n$ to enforce $\lambda \geq 0$ , and $\eta_{\{\theta,\lambda\}}$ are step sizes.

$\lambda_{t+1}$ is the updated dual variable at time $t+1$ .
$\lambda_t$ is the dual variable at time $t$ .
$\eta_x$ is the step size for updating the dual variable.
$g(\theta_t)$ is the loss vector evaluated at $\theta_t$ .
$\epsilon$ is the constraint level.
$\theta_{t+1}$ is the updated primal variable at time $t+1$ .
$\theta_t$ is the primal variable at time $t$ .
$\eta_\theta$ is the step size for updating the primal variable.
$\lambda_{t+1}^{(i)}$ is the $i$ -th component of $\lambda_{t+1}$ .
$\nabla_{\theta} \, g_i(\theta_t)$ is the gradient of the loss for the $i$ -th data point with respect to $\theta$ , evaluated at $\theta_t$ .

In Section 3, the paper addresses potential misspecification of FL problems by relaxing the constraints using slack variables, denoted by $u$ . Given $\alpha > 0$ , the following constrained optimization problem is considered: $\min_{\theta \in \Theta, u \ge 0} \, \frac{\alpha}{2} ||u||^2$ s.t. $g(\theta) \le \epsilon + u$ .

$u$ represents the slack variables.
$\alpha$ is a parameter that determines the cost of relaxing the constraints.

In Section 3.1, the authors solve RFL problems using the Lagrangian approach. The min-max Lagrangian game associated with the equation above is given by: $\min_{\theta \in \Theta, u \ge 0} \, \max_{\lambda \ge 0} \,\, L(\theta, u, \lambda) = \frac{\alpha}{2} ||u||^2 + \lambda^\top (g(\theta) - \epsilon - u)$ .

It is shown that the Lagrangian problem for RFL can be solved via a quadratically-regularized version of the FL Lagrangian: $\min_{\theta \in \Theta} \, \max_{\lambda \ge 0} \, L_{\alpha}(\theta, \lambda) = L(\theta, \lambda) - \frac{1}{2 \alpha} ||\lambda||^2$ . $L_{\alpha}$ is strongly concave on $\lambda$ .

The paper shows that:

$\min_{\theta \in \Theta, u \ge 0} \, \max_{\lambda \ge 0}\,\, L(\theta, u, \lambda) = \min_{\theta \in \Theta} \,\, \frac{\alpha}{2} \left\Vert \left[ g(\theta) - \epsilon \right]_+ \right\Vert^2$

The paper evaluates the Feasible Learning framework empirically, and demonstrates that FL and RFL present a compelling alternative to the widely used ERM framework. The paper shows that models trained via FL can learn. In problems where infeasibility leads to poor optimization dynamics for FL, RFL alleviates this issue, achieving good performance. It is observed that FL produces a more concentrated loss distribution across (training and test) samples, resulting in fewer instances with excessively high losses.

The paper conducts experiments on:

CIFAR10 image classification using ResNet-18 models.
UTKFace age regression using ResNet-18 models.
Preference optimization in LLMs, fine-tuning an 8 billion parameter Llama-3.1 model on a cleaned version of Intel Orca DPO pairs dataset.
Two-Moons classification using Multi-Layer Perceptrons.

The experimental results support the claim that FL can achieve comparable average performance to ERM while providing improved tail behavior. Specifically, FL yields a less heavy-tailed loss distribution than ERM. The Lagrange multipliers correlate with the difficulty of fitting each sample.