k-Sparse Autoencoders (1312.5663v2)

Published 19 Dec 2013 in cs.LG

Abstract: Recently, it has been observed that when representations are learnt in a way that encourages sparsity, improved performance is obtained on classification tasks. These methods involve combinations of activation functions, sampling steps and different kinds of penalties. To investigate the effectiveness of sparsity by itself, we propose the k-sparse autoencoder, which is an autoencoder with linear activation function, where in hidden layers only the k highest activities are kept. When applied to the MNIST and NORB datasets, we find that this method achieves better classification results than denoising autoencoders, networks trained with dropout, and RBMs. k-sparse autoencoders are simple to train and the encoding stage is very fast, making them well-suited to large problem sizes, where conventional sparse coding algorithms cannot be applied.

Citations (406)

View on Semantic Scholar

Summary

The paper demonstrates that enforcing top‑k activations in autoencoders approximates sparse coding by recovering the true support under specific dictionary incoherence conditions.
Experimental results on MNIST, NORB, and CIFAR‑10 reveal that tuning the sparsity level and scheduling optimally enhances feature extraction and classification accuracy.
The k‑sparse autoencoder also serves as an effective pre-training module for deep networks, achieving competitive error rates in both unsupervised and fine-tuned supervised settings.

The paper presents a method for learning sparse representations by leveraging a novel autoencoder architecture in which sparsity is imposed by retaining only the top‑k activations in the hidden layer. This “k‑sparse autoencoder” employs linear activations with tied weights and enforces sparsity by selecting the indices corresponding to the k largest values of the activation vector during the feedforward pass. The only nonlinearity originates from this hard selection, thus simplifying the network compared with other sparsity-inducing methods that mix activations, sampling, and penalty terms.

The work draws connections between the proposed autoencoder and classical sparse coding algorithms. In particular, the algorithm is interpreted as an approximation of a sparse coding procedure based on iterative hard thresholding variants, specifically the iterative thresholding with inversion (ITI) algorithm. The process involves:

Support Estimation: Using the operator $\text{supp}_\text{k}(W^\top \boldsymbol{x})$ , the algorithm approximates the support of the true sparse code.
Inversion Step: Instead of computing costly pseudoinverses exactly (as in conventional sparse coding), a single step of gradient descent is used to approximate the update, thereby simultaneously performing a dictionary update and sparse inference.

A theoretical result (Theorem 3.1) is provided showing that under appropriate incoherence conditions of the learned dictionary, using the top‑k activations guarantees the recovery of the true support of the sparse code. Specifically, if the condition

$k \mu \leq \frac{z_\text{k}}{2z_1}$

is satisfied—where $\mu(W)$ is the mutual coherence of the dictionary, $z_1$ is the largest coefficient, and $z_k$ is the k‑th largest coefficient—then the support estimated by $\text{supp}_\text{k}(W^\top \boldsymbol{x})$ is correct. This establishes a clear link between dictionary incoherence and the robustness of the sparse recovery.

The experiments include extensive evaluation on several datasets such as MNIST, NORB, and natural image patches from CIFAR-10. Key experimental insights include:

Effect of Sparsity Level: Varying the parameter $k$ controls the degree of sparsity and hence the granularity of the learned features. For larger values of $k$ , the network tends to learn more local filters, while for moderate $k$ the learned features become more global, yielding improved performance on classification tasks. Conversely, too low a value of $k$ results in overly global features that do not adequately capture part-based decompositions.
Sparsity Scheduling: During training, an initial high value for $k$ is used in early epochs to ensure that all hidden units are adequately activated; this is then gradually reduced to the target value. This scheduling prevents issues with “dead” units that might arise from too aggressive sparsification early in training.
Performance Comparisons: Without any additional nonlinearities or regularizers, the k‑sparse autoencoder achieved superior classification error rates compared to methods such as denoising autoencoders, dropout autoencoders, and Restricted Boltzmann Machines (Restricted Boltzmann Machine). For instance, on MNIST, using 1000 hidden units and an optimal setting of $k=25$ (with a softening factor $\alpha=3$ at test time) the method achieved a classification error of 1.35%, outperforming other unsupervised feature learning techniques. Similar improvements are reported on the NORB dataset.

Furthermore, the paper discusses how the k‑sparse autoencoder can be effectively used as a pre-training module for both shallow and deep neural networks. When integrated as an initialization component for supervised models and subsequently fine-tuned, it yields competitive performance compared to state‑of‑the‑art approaches, with deep architectures demonstrating error rates below 1.0% on MNIST in fine-tuned settings.

In summary, the paper not only introduces an autoencoder that attains sparsity through a simple yet effective mechanism but also provides rigorous analysis connecting it to sparse coding theory. It identifies key relationships between dictionary incoherence, sparse recovery, and feature invariance, supported by empirical results that highlight significant numerical improvements and robustness in both unsupervised and supervised learning paradigms.

PDF Markdown

Related Papers

Closed-Loop Transcription via Convolutional Sparse Coding (2023)
Sparse, Geometric Autoencoder Models of V1 (2023)
Sparse Coding with Multi-Layer Decoders using Variance Regularization (2021)
Rank Ordered Autoencoders (2016)
Scaling and evaluating sparse autoencoders (2024)

Tweets

https://twitter.com/AliMakhzani/status/1799472688026517666

YouTube

Show All Videos