Deep Networks With Large Output Spaces (1412.7479v4)

Published 23 Dec 2014 in cs.NE and cs.LG

Abstract: Deep neural networks have been extremely successful at various image, speech, video recognition tasks because of their ability to model deep structures within the data. However, they are still prohibitively expensive to train and apply for problems containing millions of classes in the output layer. Based on the observation that the key computation common to most neural network layers is a vector/matrix product, we propose a fast locality-sensitive hashing technique to approximate the actual dot product enabling us to scale up the training and inference to millions of output classes. We evaluate our technique on three diverse large-scale recognition tasks and show that our approach can train large-scale models at a faster rate (in terms of steps/total time) compared to baseline methods.

Citations (56)

View on Semantic Scholar

Summary

The paper presents a novel WTA hashing method to approximate the output layer's matrix product, enabling scalability for millions of classes.
It achieves significant speed-ups in training and inference on datasets like Imagenet 21K, Skipgram, and Sports 1M while maintaining competitive accuracy.
The approach reduces computational complexity using locality sensitive hashing, paving the way for practical applications in large-scale classification tasks.

Analysis of "Deep Networks with Large Output Spaces"

The paper "Deep Networks with Large Output Spaces" addresses a significant bottleneck in the scalability of deep neural networks for classification tasks involving a large number of output classes. Deep neural networks are typically limited by computational inefficiencies when applied to large-scale problems with output dimensions extending to millions of classes. This paper presents a novel approach leveraging Locality Sensitive Hashing (LSH) to overcome these limitations.

Methodological Innovation

The primary contribution of this research is the integration of Winner-Take-All (WTA) hashing as a mechanism to approximate the matrix product $x^T W$ in the final layer of deep networks. The approximation enables the system to scale efficiently without compromising accuracy. The approach involves computing hash codes for parameter vectors at the output layer, organizing these codes in hash tables, and retrieving relevant nodes during inference and training. This significantly reduces the computational complexity, specifically addressing the demand for evaluating and updating models with extensive output spaces.

Experimental Evaluation

The authors validate their methodology across three large-scale datasets: Imagenet 21K, Skipgram derived from Wikipedia text corpora, and the Sports 1M video dataset. In each scenario, the WTA-based softmax is compared to traditional softmax and hierarchical softmax methods. Results consistently show that the WTA approach achieves faster training and inference times while maintaining competitive accuracy levels. Notably:

Imagenet 21K: The WTA approach achieves substantial speed-ups, particularly with smaller batch sizes, demonstrating its efficiency for problems with numerous classes.
Skipgram: Despite processing fewer examples, the WTA method exhibits superior predictive performance over the hierarchical softmax approach.
Sports 1M: The WTA model surpasses both baseline methods in terms of accuracy and training speed. The stark contrast is attributed to the reduced in-class variance of video frames, showcasing the method's efficacy in contexts where data entries within a class exhibit low variance.

Implications and Future Directions

The paper offers promising implications for the deployment of deep networks in applications characterized by extensive output spaces. By reducing computational demands, the methodology can facilitate the practical application of deep networks in domains like image recognition, content identification, and video recommendations. Additionally, the hashing technique can potentially be extended to intermediate layers within the network to further enhance scalability by imposing sparsity constraints.

Future developments could explore the applicability of such hashing techniques in advancing hierarchical learning systems. Enhancing intermediate layers through similar methodologies could feasibly increase the number of active filters in convolutional networks, potentially accommodating tens of thousands of active filters concurrently.

Conclusion

Overall, the paper presents a compelling methodology for overcoming existing limitations in training and applying deep networks with large output spaces. The experimental results demonstrate the advantages of incorporating LSH for both researchers and practitioners looking to expand the capabilities of neural networks in high-dimensional settings. The proposed approach holds promise for further scalability improvements in deep learning applications, facilitating broader adoption in complex real-world scenarios.

PDF Markdown

Related Papers

YouTube

Show All Videos