FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Accelerated Systems (2312.07743v1)
Abstract: Word2Vec remains one of the highly-impactful innovations in the field of NLP that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm's inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture's peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89\%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.
- Acceleration of word2vec using gpus. In Proceedings of the 23rd International Conference on Neural Information Processing (10 2016), vol. 9948, pp. 269–279.
- Machine learning at the limit. In 2015 IEEE International Conference on Big Data (Big Data) (2015), pp. 233–242.
- One billion word benchmark for measuring progress in statistical language modeling. CoRR abs/1312.3005 (2013).
- Placing search in context: The concept revisited, 2001.
- Firth, J. G. A synopsis of linguistic theory 1930-1955" in studies in linguistic analysis.
- node2vec: Scalable feature learning for networks. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016).
- Blazingtext: Scaling and accelerating word2vec using multiple gpus. In Proceedings of the Machine Learning on HPC Environments (New York, NY, USA, 2017), MLHPC’17, Association for Computing Machinery.
- Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41, 4 (2015), 665–695.
- Parallelizing word2vec in shared and distributed memory. IEEE Transactions on Parallel and Distributed Systems 30, 9 (Sep. 2019), 2090–2100.
- Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3 (2015), 211–225.
- Mahoney, M. Large text compression benchmark. Unpublished paper, 2011.
- Efficient estimation of word representations in vector space, 2013.
- Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2013, pp. 3111–3119.
- The strange geometry of skip-gram with negative sampling. In EMNLP (2017).
- Parallel data-local training for optimizing word2vec embeddings for word and graph embeddings. In 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) (2019), pp. 44–55.
- Hogwild!: A lock-free approach to parallelizing stochastic gradient descent, 2011.
- Fpga-based acceleration of word2vec using opencl. 2019 IEEE International Symposium on Circuits and Systems (ISCAS) (2019), 1–5.
- Deepwalk: online learning of social representations. In KDD ’14 (2014).
- Optimizing word2vec performance on multicore systems. In Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms (New York, NY, USA, 2017), IA3’17, Association for Computing Machinery.
- What’s in your embedding, and how it predicts task performance. In Proceedings of the 27th International Conference on Computational Linguistics (2018), Association for Computational Linguistics, pp. 2690–2703.
- Efficient and accurate word2vec implementations in gpu and shared-memory multicore architectures. In 2017 IEEE High Performance Extreme Computing Conference (HPEC) (Sep. 2017), pp. 1–7.
- Tensorflow. Tensorflow. https://www.tensorflow.org/tutorials/text/word2vec.
- Attention is all you need. ArXiv abs/1706.03762 (2017).
- Řehůřek, R. Gensim. https://radimrehurek.com/gensim/models/word2vec.html.