A Hybrid Deep Learning Approach for Texture Analysis (1703.08366v1)

Published 24 Mar 2017 in cs.CV

Abstract: Texture classification is a problem that has various applications such as remote sensing and forest species recognition. Solutions tend to be custom fit to the dataset used but fails to generalize. The Convolutional Neural Network (CNN) in combination with Support Vector Machine (SVM) form a robust selection between powerful invariant feature extractor and accurate classifier. The fusion of experts provides stability in classification rates among different datasets.

Citations (5)

View on Semantic Scholar

Summary

The paper proposes a hybrid approach for texture analysis, combining a Convolutional Neural Network (CNN) for invariant feature extraction with a Support Vector Machine (SVM) for accurate classification.
Experiments on standard datasets like Brodatz and Kylberg demonstrate that the fusion of CNN and SVM leads to improved classification rates by leveraging the uncorrelated errors of the individual classifiers.
The hybrid model achieves high prediction accuracy, reaching 99.21% in tests, providing a robust method for texture classification applicable in areas such as remote sensing and image retrieval.

The paper "A Hybrid Deep Learning Approach for Texture Analysis" explores a hybrid approach to texture classification, leveraging both Convolutional Neural Networks (CNNs) and Support Vector Machines (SVMs). Texture classification has applications in remote sensing and forest species recognition. The authors posit that combining a CNN with an SVM creates a robust system, where the CNN acts as an invariant feature extractor and the SVM provides accurate classification. The fusion of these methods aims to improve classification rates across diverse datasets.

The paper discusses how textures are slowly varying, almost periodically repeating patterns within an image. It acknowledges the difficulty in fully understanding human visual perception of textures. The paper notes that texture analysis uses mathematical models to describe spatial variations in images, making it easier to describe textures via statistical patterns rather than geometrical edges. Texture analysis is used in fire smoke detection, remote sensing, and content-based image retrieval.

Texture feature extraction methods fall into three categories: statistical, structural/geometrical, and digital signal processing. Statistical methods rely on the spatial distribution statistics of gray level values, such as co-occurrence and autocorrelation functions. Geometrical methods decompose textures into geometrical primitives, often using edge detection techniques. Signal processing methods analyze the frequency domain of spatial information, with Fourier analysis being effective for textures exhibiting high periodicity.

The paper notes that CNNs have become popular in computer vision, which has reduced the popularity of SVMs. The paper argues that SVMs have demonstrated near state-of-the-art results in image classification experiments and can be complementary to CNNs.

Prior work in texture analysis includes:

Complete Local Binary Pattern (CLBP), where an image gray level local region is represented by its center pixel, and global thresholding is done before binary coding to generate rotational invariant features [5].
The use of Local Binary Patterns (LBP) for feature extraction with K nearest-neighbor (KNN) for classification [6].
The use of co-occurrence matrices (GLCM) and Gabor filters [7].
The use of color-based features and GLCM [8], and a mixture of feature extractors LBP, CLBP, GLCM and color features [9].
Color Local Gabor Binary Co-occurrence Pattern (CLGBOCP), which finds the spatial relationship of a pixel's neighborhoods using co-occurrence local binary edges and integrates LBP with co-occurrence matrix features [10].
Linear Dominant Local Binary Patterns (DLBP), which calculate the frequency of occurrence of rotational invariant patterns [11].

The paper points out that deep learning has been increasingly adopted to solve the problem. However, results have not been consistently better than previously reported results [12]. Variants of the proposed methodology have been used for object categorization [13] and hand gesture recognition [14].

The paper describes the CNN architecture, which consists of Convolutional, Sub-sampling, ReLU, and fully connected layers. Neurons are connected to a subset of the neurons of the previous layer, called the receptive field. Mean subtraction is applied to the training data during preprocessing, and initial weights are scaled by 0.01 to avoid symmetry. The first layer employs 64 filters with a depth of one and a filter size of 5x5. The pooling layer reduces computational parameters as the depth increases and provides non-linearity.

The paper states that SVMs aim to find the maximum marginal separation hyperplane of feature vectors and are less computationally intensive and less prone to noisy data compared to neural networks. SVM relies on Gamma and C for optimization, where Gamma defines the influence of a support vector on other support vectors, and C determines the tolerance for errors. Feature extraction for SVM is performed using GLCM with eight levels, three distances, and four directions. Thirteen features are calculated for each distance and direction, along with the mean of the four directions at each distance. The SVM is trained with a C of 2048 and a gamma of 0.0313.

The paper uses the Brodatz textures dataset, a standard benchmark for texture segmentation and classification, and the Kylberg image dataset. The Brodatz32 subset contains 32 texture classes, each with 64 samples of size 64x64. The Kylberg dataset has 28 classes, each with 160 samples of size 567x567, resized to 64x64.

The CNN-SVM fusion leverages the invariant features learning of CNN and the accurate separation of feature vectors with SVM. The training set is used for training both classifiers, while the validation set is used only by CNN to prevent overfitting. The CNN was trained for 150 epochs for the Brodatz dataset and 100 epochs for the Kylberg dataset. A binary mapping approach is used to assign classes to the appropriate classifier based on performance statistics. The Misclassification Rate and Accuracy Rate are used to measure confidence in the classifiers. The confidence $I(i)$ is measured as:

$I(i) = \text{Misclassification Rate}_{\text{CNN}} - \text{Accuracy Rate}_{\text{CNN}} - \text{Misclassification Rate}_{\text{SVM}} > \text{Misclassification Rate}_{\text{SVM}} - \text{Accuracy Rate}_{\text{SVM}} - \text{Misclassification Rate}_{\text{CNN}}$

Where:

$I(i)$ is the confidence for test sample $i$
$\text{Misclassification Rate}_{\text{CNN}}$ is the percentage of invalid predictions of samples predicted to be a class A while they belong to any other class for the CNN.
$\text{Accuracy Rate}_{\text{CNN}}$ is the percentage of valid predictions of samples predicted to be class A for the CNN.
$\text{Misclassification Rate}_{\text{SVM}}$ is the percentage of invalid predictions of samples predicted to be a class A while they belong to any other class for the SVM.
$\text{Accuracy Rate}_{\text{SVM}}$ is the percentage of valid predictions of samples predicted to be class A for the SVM.

Experiments were conducted to estimate the best allocation of data for each dataset. For the Brodatz dataset, training with 60% of the data yielded reasonable performance. For the Kylberg dataset, the SVM results were more stable and accurate than CNN. The adopted percentage of data distribution for Kylberg was 60% for training data, 10% for validation, 20% for fusion, and 10% for testing. The distribution for Brodatz was the same.

For the Kylberg dataset, the SVM performed superiorly to CNN due to the lack of rotated or scaled objects. For the Brodatz dataset, the misclassified samples were not the same for both classifiers, showing the potential improvement based on the fusion method. The fusion algorithm reached 99.21% accuracy. Random initial weights play an important factor in achieving optimal results, and SVM is relatively stable due to its training mechanism.

In conclusion, the fusion of CNN and SVM leads to improved classification rates due to uncorrelated errors and the ability to recover from mistakes of either classifier, achieving high prediction accuracy when both classifiers are well-trained.