- The paper introduces EvoMAL, a framework that automatically evolves interpretable symbolic loss functions by combining genetic programming for structure search with gradient-based parameter tuning.
- It demonstrates superior or comparable performance to handcrafted and parametric loss methods across various regression and classification tasks.
- The study reveals that meta-learned loss functions can implicitly tune training dynamics and induce regularization effects, improving convergence and generalization.
This paper, "Learning Symbolic Model-Agnostic Loss Functions via Meta-Learning" (Learning Symbolic Model-Agnostic Loss Functions via Meta-Learning, 2022), introduces Evolved Model-Agnostic Loss (EvoMAL), a novel meta-learning framework designed to automatically discover high-performing and interpretable symbolic loss functions. The core idea is to combine the strengths of evolutionary computation (specifically Genetic Programming or GP) for searching the symbolic structure of the loss function and gradient-based optimization for tuning its parameters. This hybrid approach aims to overcome limitations of previous methods, such as relying on predefined parametric forms or being computationally intractable.
The traditional approach to machine learning relies heavily on handcrafted loss functions like Mean Squared Error or Cross-Entropy. While these are general-purpose, the No Free Lunch Theorems suggest that specializing components to specific task subclasses can improve performance. Loss function learning, a subfield of meta-learning, seeks to learn task-specific or domain-specific loss functions directly from data. Previous methods have typically fallen into two categories: gradient-based approaches that learn parameters of a fixed parametric form (like a neural network), which are often black-box and limited by the chosen architecture; and evolution-based approaches (like GP or CMA-ES) that can learn symbolic structures but have faced computational tractability issues, especially when combined with parameter optimization.
EvoMAL addresses these challenges through a two-phase meta-training process structured as a bilevel optimization problem:
- Symbolic Search (Outer Loop - GP): Genetic Programming is used to search the space of mathematical operations and terminals (fθ(x) - model prediction, y - true label, constants) to discover candidate symbolic loss function structures, represented as expression trees. The paper proposes a function set (Addition, Subtraction, Multiplication, Protected Division (AQ), Min, Max, Sign, Square, Absolute, Protected Log, Protected Square Root, Tanh) and terminal set that are designed to be task and model-agnostic and ensure closure (avoiding undefined results like NaN or Inf). To handle structural constraints (loss functions must take fθ(x) and y as arguments), a correction strategy is employed. Optionally, a non-negative output constraint can be enforced via an activation function like Softplus.
- Parameter Optimization (Inner Loop - Unrolled Differentiation): Once a symbolic structure is proposed by GP, it is transformed into a trainable "meta-loss network". This involves transposing the expression tree and parameterizing its edges. The weights of this meta-loss network (ϕ) are then optimized using unrolled differentiation. This involves simulating several gradient steps of a base model (fθ(x)) training using the current meta-loss function, and then computing the meta-gradient of a task-specific performance metric (LT) with respect to the meta-loss network weights ϕ. This process effectively finds the optimal parameters ϕ for the given symbolic structure.
- Evaluation: The performance (fitness) of a candidate symbolic loss function with its optimized parameters is evaluated by training a base model on a task using this learned loss function for a predetermined number of steps (Stesting). The fitness is the final performance metric score (e.g., error rate).
Several time-saving measures (filters) are incorporated into the EvoMAL framework to make the meta-training computationally feasible:
- Symbolic Equivalence Filter: Prevents re-evaluating loss functions with identical symbolic structures.
- Pre-Evaluation Filter (Poor Training Dynamics): A quick check based on optimizing predictions directly (without training the base network) is used to identify unpromising loss functions early and assign them worst-case fitness, avoiding costly full evaluations. This is based on checking the correlation between minimizing the candidate loss and minimizing the desired performance metric.
- Pre-Evaluation Filter (Gradient Equivalence): Identifies loss functions with near-identical gradient behavior based on gradient norms and caches their fitness.
- Partial Training Sessions: Fitness evaluation is performed using a truncated number of base training steps (Stesting=500) based on the observation that early training performance correlates well with final performance.
The authors conducted extensive experiments on diverse regression (Diabetes, Boston Housing, California Housing) and classification (MNIST, CIFAR-10, CIFAR-100, SVHN) tasks using various neural network architectures (MLP, Logistic Regression, LeNet-5, AlexNet, VGG-16, AllCNN-C, ResNet-18, PreResNet, WideResNet, SqueezeNet, PyramidNet). They compared EvoMAL against a baseline (handcrafted loss), ML3 (gradient-based, parametric), TaylorGLO (evolution-based, parametric Taylor polynomial), and GP-LFL (evolution-based GP without local search).
Key Results:
- Meta-Testing Performance: EvoMAL consistently achieved superior or comparable performance (lower error rate/MSE) compared to the baseline and other state-of-the-art loss function learning methods across most tasks and models in the direct learning setting. This demonstrates the effectiveness of the hybrid approach.
- Loss Function Transfer: EvoMAL and GP-LFL showed better meta-generalization capabilities when transferring learned loss functions from CIFAR-10 to CIFAR-100 compared to the parametric ML3 and TaylorGLO. This supports the hypothesis that symbolic representations generalize better than black-box parametric forms. However, transfer performance wasn't always better than the baseline, suggesting potential meta-overfitting to the source task and highlighting the need for meta-regularization in transfer settings.
- Meta-Training Performance & Run-Time: EvoMAL's hybrid search significantly improved the search efficiency and effectiveness compared to GP-LFL, finding better solutions faster. While more computationally expensive than ML3 and GP-LFL, and slightly more than TaylorGLO, EvoMAL is substantially more efficient than its pure evolution-based predecessor (GLO), making it feasible on commodity hardware. The time-saving filters played a crucial role in reducing computation, particularly by obviating or caching evaluations for symbolically or functionally equivalent/unpromising candidates.
Analysis of Meta-Learned Loss Functions:
- Analyzed learned symbolic structures revealed common patterns. For classification, variants resembling cross-entropy, focal loss (prioritizing hard samples), and functions inducing implicit label smoothing were observed. For regression, patterns suggesting robustness to outliers (e.g., using square root or thresholding) were common.
- Loss Landscapes: Unlike prior findings, the analysis showed that EvoMAL can produce both flatter and sharper loss landscapes depending on the model, suggesting that landscape flatness alone might not fully explain performance gains.
- Implicit Learning Rate Tuning: Experiments indicated that learned loss functions can implicitly scale the gradients, effectively tuning the learning rate. While this contributes to faster convergence (and better fitness in partial training), the performance gains over a well-tuned baseline loss function persist even when the baseline's learning rate is optimized, suggesting that learned losses provide a benefit beyond just implicit tuning.
In conclusion, EvoMAL successfully combines symbolic search with gradient-based optimization to meta-learn interpretable and performant loss functions. It empirically outperforms existing methods and is computationally tractable for practical applications. Future work could explore applying the framework to learn other learning components (optimizers, activation functions) or learning loss functions jointly with neural network architectures to exploit co-adaptation. The analysis suggests that learned loss functions can induce regularization effects like label smoothing and implicitly tune training dynamics, offering insights into why they improve performance.