Linear Decoders
From Ufldl
Line 1: | Line 1: | ||
== Sparse Autoencoder Recap == | == Sparse Autoencoder Recap == | ||
- | In the sparse autoencoder | + | In the sparse autoencoder, we had 3 layers of neurons: an input layer, a hidden layer and an output layer. In our previous description |
- | of autoencoders (and of neural networks), every neuron used the same activation function. | + | of autoencoders (and of neural networks), every neuron in the neural network used the same activation function. |
In these notes, we describe a modified version of the autoencoder in which some of the neurons use a different activation function. | In these notes, we describe a modified version of the autoencoder in which some of the neurons use a different activation function. | ||
- | This will result in a model that is sometimes simpler to apply. | + | This will result in a model that is sometimes simpler to apply, and can also be more robust to variations in the parameters. |
- | Recall that each neuron (in the output layer) | + | Recall that each neuron (in the output layer) computed the following: |
<math> | <math> | ||
Line 15: | Line 15: | ||
</math> | </math> | ||
- | where <math>a^{(3)}</math> is the output. In the autoencoder, | + | where <math>a^{(3)}</math> is the output. In the autoencoder, <math>a^{(3)}</mat> is our approximate reconstruction of the input <math>x = a^{(1)}</math>. |
Because we used a sigmoid activation function for <math>f(z^{(3)})</math>, we needed to constrain or scale the inputs to be in the range <math>[0,1]</math>, | Because we used a sigmoid activation function for <math>f(z^{(3)})</math>, we needed to constrain or scale the inputs to be in the range <math>[0,1]</math>, | ||
since the sigmoid function outputs numbers in the range <math>[0,1]</math>. | since the sigmoid function outputs numbers in the range <math>[0,1]</math>. | ||
- | While some datasets like MNIST fit well with this scaling, this can sometimes be awkward to satisfy. For example, if one uses PCA whitening, the input is | + | While some datasets like MNIST fit well with this scaling of the output, this can sometimes be awkward to satisfy. For example, if one uses PCA whitening, the input is |
no longer constrained to <math>[0,1]</math> and it's not clear what the best way is to scale the data to ensure it fits into the constrained range. | no longer constrained to <math>[0,1]</math> and it's not clear what the best way is to scale the data to ensure it fits into the constrained range. | ||
Line 26: | Line 26: | ||
One easy fix for this problem is to set <math>a^{(3)} = z^{(3)}</math>. Formally, this is achieved by having the output | One easy fix for this problem is to set <math>a^{(3)} = z^{(3)}</math>. Formally, this is achieved by having the output | ||
- | nodes use an activation function that's the identity function <math>f(z) = z</math>. | + | nodes use an activation function that's the identity function <math>f(z) = z</math>, so that <math>a^{(3)} = f(z^{(3)}) = z^{(3)}</math>. |
+ | This activation function <math>f(\cdot)</math> is called the '''linear activation function''' (though perhaps | ||
"identity activation function" would have been a better name). Note however that in the ''hidden'' layer of the network, we still use a sigmoid (or tanh) activation function, | "identity activation function" would have been a better name). Note however that in the ''hidden'' layer of the network, we still use a sigmoid (or tanh) activation function, | ||
so that the hidden units are (say) <math>a^{(2)} = \sigma(W^{(1)}*x + b^{(1)})</math>, where <math>\sigma(\cdot)</math> is the sigmoid function, | so that the hidden units are (say) <math>a^{(2)} = \sigma(W^{(1)}*x + b^{(1)})</math>, where <math>\sigma(\cdot)</math> is the sigmoid function, |