Improved Bilinear Pooling with CNNs (1707.06772v1)

Published 21 Jul 2017 in cs.CV

Abstract: Bilinear pooling of Convolutional Neural Network (CNN) features [22, 23], and their compact variants [10], have been shown to be effective at fine-grained recognition, scene categorization, texture recognition, and visual question-answering tasks among others. The resulting representation captures second-order statistics of convolutional features in a translationally invariant manner. In this paper we investigate various ways of normalizing these statistics to improve their representation power. In particular we find that the matrix square-root normalization offers significant improvements and outperforms alternative schemes such as the matrix logarithm normalization when combined with elementwise square-root and l2 normalization. This improves the accuracy by 2-3% on a range of fine-grained recognition datasets leading to a new state of the art. We also investigate how the accuracy of matrix function computations effect network training and evaluation. In particular we compare against a technique for estimating matrix square-root gradients via solving a Lyapunov equation that is more numerically accurate than computing gradients via a Singular Value Decomposition (SVD). We find that while SVD gradients are numerically inaccurate the overall effect on the final accuracy is negligible once boundary cases are handled carefully. We present an alternative scheme for computing gradients that is faster and yet it offers improvements over the baseline model. Finally we show that the matrix square-root computed approximately using a few Newton iterations is just as accurate for the classification task but allows an order-of-magnitude faster GPU implementation compared to SVD decomposition.

Authors (2)

Tsung-Yu Lin (11 papers)
Subhransu Maji (78 papers)

Citations (193)

View on Semantic Scholar

Summary

The paper demonstrates that matrix square-root normalization, combined with elementwise square-root and ye normalization, significantly improves bilinear pooling accuracy in CNNs by 2-3% on recognition tasks.
It shows that using Lyapunov equation-derived gradients is more numerically stable and precise than SVD for training networks with matrix functions, while Newton iterations efficiently approximate matrix square roots.
The findings suggest improved normalization layers can enhance CNN precision for recognition tasks, potentially simplifying models without losing accuracy and paving the way for exploring unrolled iterations for faster processing.

Overview of Improved Bilinear Pooling with CNNs

This paper explores advancements in bilinear pooling methods applied to CNN features, specifically investigating various normalization strategies to enhance the representation capacity of bilinear pooled features. Bilinear pooling, a technique aggregating second-order statistics of convolutional features, is already recognized for its efficacy across tasks including fine-grained recognition, scene categorization, texture recognition, and more. However, the authors reveal that further representation improvements can be achieved by normalizing covariance matrices derived from bilinear pooling, demonstrating a marked improvement in model accuracy.

Key Findings and Numerical Results

The central contribution of this paper lies in showcasing the efficacy of the matrix square-root normalization over alternative normalization schemes like the matrix logarithm function. When combined with elementwise square-root and $\ell_2$ normalization, this method consistently improves accuracy by 2-3% across several fine-grained recognition datasets, setting a new benchmark in their respective tasks. Notably, this enhancement is achieved while maintaining translational invariance in representation, showcasing a significant advancement in feature normalization techniques.

Theoretical Insights and Methodological Refinements

The authors rigorously explore the computation of matrix function gradients critical to network training. They compare gradient estimation methods, specifically contrasting singular value decomposition (SVD) techniques with the Lyapunov equation approach. While SVD gradients suffer from numerical instability under certain eigenvalue conditions, this research reveals that Lyapunov-derived gradients not only bypass such issues but also deliver more precise models. Additionally, leveraging Newton iterations for approximating matrix square-root computations yields comparable accuracy to SVD methods while significantly boosting computational efficiency, especially on GPU architectures.

Implications and Future Directions in AI

The paper's findings offer substantial implications for optimizing CNN architectures, suggesting that introducing improved normalization layers can enhance recognition tasks' precision, irrespective of the network's core complexity. These advancements in bilinear pooling contribute to refining feature extraction processes, potentially simplifying models without conceding accuracy or increasing computational demands.

Looking forward, the researchers express interest in further exploring unrolled iterations for gradient computation, promising faster processing and a potential paradigm shift in training methodologies. Such developments might allow subsequent network layers to adjust seamlessly to errors introduced during iterative approximations, a promising avenue for deeper network architecture optimization.

In conclusion, this paper presents a meticulous exploration of matrix normalization in bilinear pooling, spotlighting viable methods to bolster CNN feature extraction processes. It contributes a practical and theoretically enriched framework for improved recognition performance in AI models, paving the way for future refinements in neural network architectures.