Joint Learning of Energy-based Models and their Partition Function (2501.18528v2)

Published 30 Jan 2025 in cs.LG and stat.ML

Abstract: Energy-based models (EBMs) offer a flexible framework for parameterizing probability distributions using neural networks. However, learning EBMs by exact maximum likelihood estimation (MLE) is generally intractable, due to the need to compute the partition function (normalization constant). In this paper, we propose a novel formulation for approximately learning probabilistic EBMs in combinatorially-large discrete spaces, such as sets or permutations. Our key idea is to jointly learn both an energy model and its log-partition, both parameterized as a neural network. Our approach not only provides a novel tractable objective criterion to learn EBMs by stochastic gradient descent (without relying on MCMC), but also a novel means to estimate the log-partition function on unseen data points. On the theoretical side, we show that our approach recovers the optimal MLE solution when optimizing in the space of continuous functions. Furthermore, we show that our approach naturally extends to the broader family of Fenchel-Young losses, allowing us to obtain the first tractable method for optimizing the sparsemax loss in combinatorially-large spaces. We demonstrate our approach on multilabel classification and label ranking.

Summary

The paper introduces a joint learning framework for energy-based models and log-partition functions that reformulates maximum likelihood estimation as a min-min optimization problem.
It employs neural network parameterization and stochastic gradient descent to bypass intractable MCMC sampling, yielding improved performance in tasks like multilabel classification.
Experimental results show that the approach recovers the MLE solution and enhances model generalization through a novel regularization effect.

Joint Learning of Energy-based Models and Their Partition Function

This paper investigates a method to jointly learn probabilistic Energy-based Models (EBMs) and their corresponding partition functions, providing theoretical advancements and practical implementations capable of managing combinatorially large discrete spaces. Developed by researchers at Google DeepMind, the proposed approach tackles the intractability of exact maximum likelihood estimation (MLE) for EBMs, shifting towards a tractable optimization problem solvable via stochastic gradient descent (SGD) without relying on Markov chain Monte Carlo (MCMC) methods.

Key Contributions and Methodology

The primary innovation lies in jointly learning an EBM's energy function and its log-partition, both parameterized as neural networks. The method is rooted in the formulation of a novel min-min optimization problem where a separate function, treated as a Lagrange multiplier, is introduced to address the normalization constraint inherent in EBMs. The authors define a loss function incorporating both the energy model and the partition function, minimizing the discrepancy between observed and model-predicted data.

Theoretical insights demonstrate the recovery of the MLE solution when optimization is conducted in the space of continuous functions. The proposed method extends to the Fenchel-Young loss family, allowing efficient optimization of the sparsemax loss—an advancement over previous approaches necessitating k-best oracles.

Experiments and Results

The authors validated their approach through experiments on multilabel classification and label ranking, employing various model architectures like linear models, multilayer perceptrons (MLPs), and residual networks. Numerical evaluations exhibited that the proposed approach converges towards exact MLE, improving efficiency and accuracy in predicting multimodal outputs.

Results indicated that their method offers significant improvements, especially in multilabel classification, where the flexibility of jointly learning the partition function enabled better estimation on unseen data. The authors observed an intriguing regularization effect where reduced prior sampling yielded higher F1 scores, suggesting a new avenue for model generalization.

Implications and Future Directions

The paper's findings extend the applicability of EBMs in complex discrete spaces, paving the way for practical deployments in various AI fields such as structured prediction and unsupervised learning. From a theoretical standpoint, this approach encourages the exploration of joint learning frameworks in alternative probabilistic modeling tasks.

Future research could explore the integration of this method with higher-dimensional spaces and other probabilistic frameworks, potentially extending its utility in real-world applications like natural language processing and automated reasoning. Moreover, the proposal for an MCMC-free method invites further investigation into optimizing neural models for probabilistic inference beyond traditional sampling techniques.

In summary, this paper presents a methodologically sophisticated framework for jointly learning EBMs and their partition functions, marking a significant step toward more efficient, accurate, and tractable probabilistic model training. The demonstration of its efficacy across diverse tasks underscores its potential impact across the machine learning community and aligns with ongoing advancements in probabilistic inference and neural network optimization.