Probability Distribution Learning and Its Application in Deep Learning

Published 9 Jun 2024 in cs.LG, cs.IR, and stat.ML | (2406.05666v11)

Abstract: This paper aims to elucidate the theoretical mechanisms underlying deep learning from a probability distribution estimation perspective, with Fenchel-Young Loss serving as the loss function. In our approach, the learning error , which measures the discrepancy between the model's predicted distribution and the posterior expectation of the true unknown distribution given sampling, is formulated as the primary optimization objective. Therefore, the learning error can be regarded as the posterior expectation of the expected risk. As many important loss functions, such as Softmax Cross-Entropy Loss and Mean Squared Error Loss, are specific instances of Fenchel-Young Losses, this paper further theoretically demonstrates that Fenchel-Young Loss is a natural choice for machine learning tasks, thereby ensuring the broad applicability of the conclusions drawn in this work. In the case of using Fenchel-Young Loss, the paper proves that the model's fitting error is controlled by the gradient norm and structural error, thereby providing new insights into the mechanisms of non-convex optimization and various techniques employed in model training, such as over-parameterization and skip connections. Furthermore, it establishes model-independent bounds on the learning error, demonstrating that the correlation between features and labels (equivalent to mutual information) controls the upper bound of the model's generalization error. Ultimately, the paper validates the key conclusions of the proposed method through empirical results, demonstrating its practical effectiveness.