Sparse Coding: Autoencoder Interpretation

In the sparse autoencoder, we tried to learn a set of weights <math>W</math> (and associated biases <math>b</math>) that would give us sparse features <math>\sigma(Wx + b)</math> useful in reconstructing the input <math>x</math>. 

[[File:STL_SparseAE.png | 240px]]

Sparse coding can be seen as a modification of the sparse autoencoder method in which we try to learn the set of features for some data "directly". Together with an associated basis  for transforming the learned features from the feature space to the data space, we can then reconstruct the data from the learned features.

Formally, in sparse coding, we have some data <math>x</math> we would like to learn features on. In particular, we would like to learn <math>s</math>, a set of sparse features useful for representing the data, and <math>A</math>, a basis for transforming the features from the feature space to the data space. Our objective function is hence:

:<math>
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1
</math>

(If you are unfamiliar with the notation, <math>\lVert x \rVert_k</math> refers to the L<math>k</math> norm of the <math>x</math> which is equal to <math>\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}</math>. The L2 norm is the familiar Euclidean norm, while the L1 norm is the sum of absolute values of the elements of the vector)

The first term is the error in reconstructing the data from the features using the basis, and the second term is a sparsity penalty term to encourage the learned features to be sparse. 

However, the objective function as it stands is not properly constrained - it is possible to reduce the sparsity cost (the second term) by scaling <math>A</math> by some constant and scaling <math>s</math> by the inverse of the same constant, without changing the error. Hence, we include the additional constraint that that for every column <math>A_j</math> of <math>A</math>, 
<math>A_j^TA_j \le 1</math>. Our problem is thus:

:<math>
\begin{array}{rcl}
     {\rm minimize} & \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 \\
     {\rm s.t.}     &    A_j^TA_j \le 1 \; \forall j \\
\end{array} 
</math>

Unfortunately, the objective function is non-convex, and hence impossible to optimize well using gradient-based methods. However, given <math>A</math>, the problem of finding <math>s</math> that minimizes <math>J(A, s)</math> is convex. Similarly, given <math>s</math>, the problem of finding <math>A</math> that minimizes <math>J(A, s)</math> is also convex. This suggests that we might try alternately optimizing for <math>A</math> for a fixed <math>s</math>, and then optimizing for <math>s</math> given a fixed <math>A</math>. It turns out that this works quite well in practice.

However, the form of our problem presents another difficulty - the constraint that <math>A_j^TA_j \le 1 \; \forall j</math> cannot be enforced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of <math>A</math> small. This gives us a new objective function:

:<math>
J_{weight}(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 + \gamma \lVert A \rVert_2^2
</math>

(note that the third term, <math>\lVert A \rVert_2^2</math> is the sum of squares of the entries of A, or <math>\sum_r{\sum_c{A_{rc}^2}}</math>)

In practice, this objective is easier to optimize for, and works relatively well.