Papers
Topics
Authors
Recent
2000 character limit reached

Feasible Learning (2501.14912v1)

Published 24 Jan 2025 in cs.LG and cs.AI

Abstract: We introduce Feasible Learning (FL), a sample-centric learning paradigm where models are trained by solving a feasibility problem that bounds the loss for each training sample. In contrast to the ubiquitous Empirical Risk Minimization (ERM) framework, which optimizes for average performance, FL demands satisfactory performance on every individual data point. Since any model that meets the prescribed performance threshold is a valid FL solution, the choice of optimization algorithm and its dynamics play a crucial role in shaping the properties of the resulting solutions. In particular, we study a primal-dual approach which dynamically re-weights the importance of each sample during training. To address the challenge of setting a meaningful threshold in practice, we introduce a relaxation of FL that incorporates slack variables of minimal norm. Our empirical analysis, spanning image classification, age regression, and preference optimization in LLMs, demonstrates that models trained via FL can learn from data while displaying improved tail behavior compared to ERM, with only a marginal impact on average performance.

Summary

  • The paper proposes Feasible Learning (FL), a sample-centric paradigm that frames machine learning as a feasibility problem solved via primal-dual optimization to satisfy per-sample loss constraints.
  • FL and its relaxed version, Resilient Feasible Learning (RFL), achieve comparable average performance to ERM while demonstrating improved tail behavior by reducing the number of samples with high losses.
  • Primal-dual algorithms are introduced for solving FL and RFL, yielding sample weights (Lagrange multipliers) that correlate with training difficulty and provide insights into data importance.

The paper introduces Feasible Learning (FL), a sample-centric learning paradigm that formulates learning as a feasibility problem. Instead of optimizing for average performance like Empirical Risk Minimization (ERM), FL seeks a predictor that satisfies a bounded loss constraint for each training sample. The paper also introduces Resilient Feasible Learning (RFL), a relaxation of FL that incorporates slack variables of minimal norm to address the challenge of setting a meaningful threshold and to handle potential infeasibility in practical applications.

The key aspects of the FL paradigm are:

  • It is sample-centric, requiring satisfactory performance across all training samples, unlike ERM which optimizes average performance.
  • It does not inherently favor any model that meets the loss constraints.
  • It induces functional regularization, preventing overfitting by not demanding loss reduction beyond a specified threshold.

The authors adopt a primal-dual optimization approach to solve FL problems, which dynamically re-weights each sample's importance during training. The weight of each sample corresponds to the Lagrange multiplier associated with its constraint. For RFL, the paper shows that it is equivalent to a non-convex, strongly-concave min-max problem. Primal-dual algorithms are provided for solving both FL and RFL problems.

The paper makes the following contributions:

  • It proposes FL, which frames learning as a constraint satisfaction problem.
  • It introduces RFL, a relaxation of FL that addresses potential infeasibility issues, and proves that RFL is equivalent to a non-convex, strongly-concave min-max problem.
  • It provides primal-dual algorithms for solving FL and RFL problems.
  • It empirically explores FL and RFL problems and their solutions.

In Section 2, the Feasible Learning paradigm advocates for learning through solving a feasibility problem. Specifically, FL considers an optimization problem with a trivial, constant objective while enforcing a loss constraint for each training sample: minθΘ0\min_{\theta \in \Theta} \,\, 0 s.t. g(θ)ϵg(\theta) \le \epsilon, where ϵ=ϵ1\epsilon = \epsilon \mathbb{1} is the constraint level.

  • θ\theta represents the parameters of the predictor.
  • Θ\Theta is the parameter space.
  • g(θ)g(\theta) represents the loss incurred on the ii-th data point, where g(θ)=[g1(θ),...,gn(θ)]g(\theta) = [g_1(\theta), ..., g_n(\theta)]^\top.
  • ϵ\epsilon is the maximum allowed per-sample loss.

In Section 2.1, the authors solve FL problems by leveraging Lagrangian duality. The min-max Lagrangian game associated with the equation above is: minθΘmaxλ0L(θ,λ)=λ(g(θ)ϵ)\min_{\theta \in \Theta}\, \max_{\lambda \ge 0} \, L(\theta, \lambda) = \lambda^\top (g(\theta) - \epsilon), where λ0\lambda \geq 0 is the vector of Lagrange multipliers associated with the constraints.

  • λ\lambda represents the dual variables (Lagrange multipliers).
  • L(θ,λ)L(\theta, \lambda) is the Lagrangian function.

Alternating Gradient Descent Ascent (GDA) updates yield:

λt+1[λt+ηx(g(θt)ϵ)λL(θt,λt)]+\lambda_{t+1} \leftarrow \Big[ \lambda_t + \eta_x \underbrace{\left( g(\theta_t) - \epsilon \right)}_{\nabla_{\lambda} L(\theta_t, \lambda_t)} \Big]_+

θt+1θtηθ[i=1nλt+1(i)θgi(θt)θL(θt,λt+1)]\theta_{t+1} \leftarrow \theta_t - \eta_\theta \bigg[ \underbrace{\sum_{i=1}^n \lambda_{t+1}^{(i)} \, \nabla_{\theta} \, g_i(\theta_t)}_{\nabla_{\theta} L(\theta_t, \lambda_{t+1})} \bigg], where []+[\, \cdot \,]_+ denotes a projection onto R0nR_{\geq 0}^n to enforce λ0\lambda \geq 0, and η{θ,λ}\eta_{\{\theta,\lambda\}} are step sizes.

  • λt+1\lambda_{t+1} is the updated dual variable at time t+1t+1.
  • λt\lambda_t is the dual variable at time tt.
  • ηx\eta_x is the step size for updating the dual variable.
  • g(θt)g(\theta_t) is the loss vector evaluated at θt\theta_t.
  • ϵ\epsilon is the constraint level.
  • θt+1\theta_{t+1} is the updated primal variable at time t+1t+1.
  • θt\theta_t is the primal variable at time tt.
  • ηθ\eta_\theta is the step size for updating the primal variable.
  • λt+1(i)\lambda_{t+1}^{(i)} is the ii-th component of λt+1\lambda_{t+1}.
  • θgi(θt)\nabla_{\theta} \, g_i(\theta_t) is the gradient of the loss for the ii-th data point with respect to θ\theta, evaluated at θt\theta_t.

In Section 3, the paper addresses potential misspecification of FL problems by relaxing the constraints using slack variables, denoted by uu. Given α>0\alpha > 0, the following constrained optimization problem is considered: minθΘ,u0α2u2\min_{\theta \in \Theta, u \ge 0} \, \frac{\alpha}{2} ||u||^2 s.t. g(θ)ϵ+ug(\theta) \le \epsilon + u.

  • uu represents the slack variables.
  • α\alpha is a parameter that determines the cost of relaxing the constraints.

In Section 3.1, the authors solve RFL problems using the Lagrangian approach. The min-max Lagrangian game associated with the equation above is given by: minθΘ,u0maxλ0L(θ,u,λ)=α2u2+λ(g(θ)ϵu)\min_{\theta \in \Theta, u \ge 0} \, \max_{\lambda \ge 0} \,\, L(\theta, u, \lambda) = \frac{\alpha}{2} ||u||^2 + \lambda^\top (g(\theta) - \epsilon - u).

It is shown that the Lagrangian problem for RFL can be solved via a quadratically-regularized version of the FL Lagrangian: minθΘmaxλ0Lα(θ,λ)=L(θ,λ)12αλ2\min_{\theta \in \Theta} \, \max_{\lambda \ge 0} \, L_{\alpha}(\theta, \lambda) = L(\theta, \lambda) - \frac{1}{2 \alpha} ||\lambda||^2. LαL_{\alpha} is strongly concave on λ\lambda.

The paper shows that:

minθΘ,u0maxλ0L(θ,u,λ)=minθΘα2[g(θ)ϵ]+2\min_{\theta \in \Theta, u \ge 0} \, \max_{\lambda \ge 0}\,\, L(\theta, u, \lambda) = \min_{\theta \in \Theta} \,\, \frac{\alpha}{2} \left\Vert \left[ g(\theta) - \epsilon \right]_+ \right\Vert^2

The paper evaluates the Feasible Learning framework empirically, and demonstrates that FL and RFL present a compelling alternative to the widely used ERM framework. The paper shows that models trained via FL can learn. In problems where infeasibility leads to poor optimization dynamics for FL, RFL alleviates this issue, achieving good performance. It is observed that FL produces a more concentrated loss distribution across (training and test) samples, resulting in fewer instances with excessively high losses.

The paper conducts experiments on:

  • CIFAR10 image classification using ResNet-18 models.
  • UTKFace age regression using ResNet-18 models.
  • Preference optimization in LLMs, fine-tuning an 8 billion parameter Llama-3.1 model on a cleaned version of Intel Orca DPO pairs dataset.
  • Two-Moons classification using Multi-Layer Perceptrons.

The experimental results support the claim that FL can achieve comparable average performance to ERM while providing improved tail behavior. Specifically, FL yields a less heavy-tailed loss distribution than ERM. The Lagrange multipliers correlate with the difficulty of fitting each sample.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 13 likes about this paper.