The Two-Pass Softmax Algorithm

Published 13 Jan 2020 in cs.PF and cs.LG | (2001.04438v1)

Abstract: The softmax (also called softargmax) function is widely used in machine learning models to normalize real-valued scores into a probability distribution. To avoid floating-point overflow, the softmax function is conventionally implemented in three passes: the first pass to compute the normalization constant, and two other passes to compute outputs from normalized inputs. We analyze two variants of the Three-Pass algorithm and demonstrate that in a well-optimized implementation on HPC-class processors performance of all three passes is limited by memory bandwidth. We then present a novel algorithm for softmax computation in just two passes. The proposed Two-Pass algorithm avoids both numerical overflow and the extra normalization pass by employing an exotic representation for intermediate values, where each value is represented as a pair of floating-point numbers: one representing the "mantissa" and another representing the "exponent". Performance evaluation demonstrates that on out-of-cache inputs on an Intel Skylake-X processor the new Two-Pass algorithm outperforms the traditional Three-Pass algorithm by up to 28% in AVX512 implementation, and by up to 18% in AVX2 implementation. The proposed Two-Pass algorithm also outperforms the traditional Three-Pass algorithm on Intel Broadwell and AMD Zen 2 processors. To foster reproducibility, we released an open-source implementation of the new Two-Pass Softmax algorithm and other experiments in this paper as a part of XNNPACK library at GitHub.com/google/XNNPACK.

Abstract PDF Upgrade to Chat

Citations (8)

View on Semantic Scholar

Summary

The paper introduces a Two-Pass softmax algorithm that leverages dual floating-point representation to eliminate an extra normalization pass.
It achieves up to 28% performance gains on AVX512 and 18% on AVX2 implementations by significantly reducing memory operations.
This approach offers practical benefits in high-performance environments and has potential applications in GPU and hardware accelerator optimizations.

Analysis of "The Two-Pass Softmax Algorithm"

The paper presents a refined approach to computing the softmax function, a critical element in machine learning, through the introduction of a Two-Pass algorithm. This is set against the backdrop of the traditional Three-Pass algorithm, which is conventionally utilized to mitigate numerical overflow issues.

Softmax Computation and Its Challenges

Softmax functions are pivotal in converting raw model outputs into probabilities, especially in classification tasks across expansive datasets. The conventional Three-Pass algorithm handles this by first calculating normalization constants and then the softmax values in three distinct phases. Despite its robustness, it is memory-bound and thus constrained by the bandwidth limitations inherent to HPC-class processors.

Two-Pass Softmax Algorithm: A Novel Approach

This research introduces a Two-Pass algorithm aimed at enhancing computational efficiency. The key innovation lies in its use of dual floating-point numbers to represent each intermediate value—capturing both the mantissa and the exponent. This allows the process to eschew the additional normalization pass of the traditional method while maintaining precision and avoiding numerical overflow.

Performance Evaluation

The implementation of the Two-Pass algorithm demonstrated superior performance on various x86-64 processors, such as Intel’s Skylake-X, Broadwell, and AMD’s Zen 2. The AVX512 implementation evidenced up to a 28% performance gain over the Three-Pass algorithm in cases where data did not fit in the processor cache. Similar improvements were observed in the AVX2 implementations, with gains up to 18%. These results suggest a significant reduction in memory bandwidth consumption, affirming the algorithm's efficiency.

Practical and Theoretical Implications

From a practical standpoint, the simplification of the softmax process to two passes significantly reduces memory operations, appealing to environments where computational resource optimization is crucial. Theoretically, this approach enhances our understanding of how alternative numerical representations can facilitate more efficient algorithmic constructs.

Future Prospects

There is considerable potential for applying this algorithm in GPU contexts and hardware accelerators, where memory access poses even greater costs. Future research may explore adapting the algorithm for such platforms, possibly extending its benefits even further.

Conclusion

This paper highlights an intriguing paradigm shift in softmax computation, challenging traditional approaches with a method that harmonizes computational performance and memory efficiency. The open-source release of this algorithm within the XNNPACK library underscores the authors' commitment to reproducibility and furtherance of industrial and academic research. The Two-Pass algorithm offers a compelling alternative for softmax calculations, promising improvements conducive to contemporary large-scale machine learning tasks.