Exploring Architectures for CNN-Based Word Spotting (1806.10866v2)
Abstract: The goal in word spotting is to retrieve parts of document images which are relevant with respect to a certain user-defined query. The recent past has seen attribute-based Convolutional Neural Networks take over this field of research. As is common for other fields of computer vision, the CNNs used for this task are already considerably deep. The question that arises, however, is: How complex does a CNN have to be for word spotting? Are increasingly deeper models giving increasingly better results or does performance behave asymptotically for these architectures? On the other hand, can similar results be obtained with a much smaller CNN? The goal of this paper is to give an answer to these questions. Therefore, the recently successful TPP-PHOCNet will be compared to a Residual Network, a Densely Connected Convolutional Network and a LeNet architecture empirically. As will be seen in the evaluation, a complex model can be beneficial for word spotting on harder tasks such as the IAM Offline Database but gives no advantage for easier benchmarks such as the George Washington Database.
- R. Manmatha, C. Han, and E. Riseman, “Word spotting: a new approach to indexing handwriting,” in Proc. of the IEEE Comp. Soc. Conf. on Computer Vision and Pattern Recognition, 1996.
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
- K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in Proc. of the Int. Conf. on Learning Representations, 2015.
- S. Sudholt and G. Fink, “Evaluating word string embeddings and loss functions for CNN-based word spotting,” in Proc. of the ICDAR, Kyoto, Japan, 2017, to appear.
- J. Almazán, A. Gordo, A. Fornés, and E. Valveny, “Word spotting and recognition with embedded attributes,” TPAMI, vol. 36, no. 12, pp. 2552–2566, 2014.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proc. of the IEEE Comp. Soc. Conf. on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- T. Rath and R. Manmatha, “Word spotting for historical documents,” Int. Journal on Document Analysis and Recognition, vol. 9, no. 2–4, 2007.
- M. Rusiñol, D. Aldavert, R. Toledo, and J. Lladós, “Efficient segmentation-free keyword spotting in historical document collections,” Pattern Recognition, vol. 48, no. 2, pp. 545 – 555, 2015.
- D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. Journal of Computer Vision, vol. 60, 2004.
- F. Perronnin, J. Sánchez, and T. Mensink, “Improving the fisher kernel for large-scale image classification,” in Proc. European Conf. on Computer Vision, K. Daniilidis, P. Maragos, and N. Paragios, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 143–156.
- Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551, Dec 1989.
- X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, G. Gordon, D. Dunson, and M. Dudík, Eds., vol. 15. Fort Lauderdale, FL, USA: PMLR, 11–13 Apr 2011, pp. 315–323.
- R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” arXiv preprint arXiv:1507.06228, 2015.
- K. He and J. Sun, “Convolutional neural networks at constrained time cost,” CoRR, vol. abs/1412.1710, 2014. [Online]. Available: http://arxiv.org/abs/1412.1710
- S. Sudholt and G. A. Fink, “Attribute CNNs for Word Spotting in Handwritten Documents,” International Journal on Document Analysis and Recognition, 2018.
- K. Fukushima, “Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position,” Biological Cybernetics, vol. 36, no. 4, pp. 193–202, 1980.
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, and U. C. B. Eecs, “Caffe : Convolutional Architecture for Fast Feature Embedding,” in Proc. of the ACM Conference on Multimedia, 2014, pp. 675–678.
- S. Sudholt and G. A. Fink, “PHOCNet: A deep convolutional neural network for word spotting in handwritten documents,” in Proc. of the ICFHR, Shenzhen, China, 2016, pp. 277 – 282.
- M. D. Zeiler and R. Fergus, “Visualizing and Understanding Convolutional Networks,” in Proc. of the European Conference on Computer Vision, 2014, pp. 818–833.
- P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks,” in Proc. of the Int. Conf. on Learning Representations, 2014.
- P. Krishnan, K. Dutta, and C. Jawahar, “Deep feature embedding for accurate recognition and retrieval of handwritten text,” in Proc. of the ICFHR, 2016, pp. 289 – 294.
- T. Wilkinson and A. Brun, “Semantic and verbatim word spotting using deep neural networks,” in Proc. of the ICFHR, 2016, pp. 307 – 312.
- D. P. Kingma and J. L. Ba, “Adam: A Method for Stochastic Optimization,” in Proc. of the Int. Conf. on Learning Representations, 2015.
- K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” in Proc. of the Int. Conf. on Computer Vision, 2015, pp. 1026–1034.