# Fine-tuning Stacked AEs

 Revision as of 06:54, 21 April 2011 (view source)Watsuen (Talk | contribs) (Created page with "=== Step 0: Setup === You should build on your files from previous assignments. === Step 1: Implement the Stacked Autoencoder === Using the method described in the previous sect...")← Older edit Latest revision as of 04:04, 8 April 2013 (view source)Kandeng (Talk | contribs) Line 1: Line 1: - === Step 0: Setup === + === Introduction === - You should build on your files from previous assignments. + Fine tuning is a strategy that is commonly found in deep learning. As such, it can also be used to greatly improve the performance of a stacked autoencoder. From a high level perspective, fine tuning treats all layers of a stacked autoencoder as a single model, so that in one iteration, we are improving upon all the weights in the stacked autoencoder. - === Step 1: Implement the Stacked Autoencoder === + === General Strategy === - Using the method described in the previous section, train the stacked autoencoder layer by layer using greedy layer-wise training. + Fortunately, we already have all the tools necessary to implement fine tuning for stacked autoencoders! In order to compute the gradients for all the layers of the stacked autoencoder in each iteration, we use the [[Backpropagation Algorithm]], as discussed in the sparse autoencoder section. As the backpropagation algorithm can be extended to apply for an arbitrary number of layers, we can actually use this algorithm on a stacked autoencoder of arbitrary depth. - === Step 2: Train the data on the stacked autoencoder === + === Finetuning with Backpropagation === - Train the data found via blah tired on your stacked autoencoder. Training can take up to 20-30 minutes per layer, so you may wish to save your outputs to a separate file. + For your convenience, the summary of the backpropagation algorithm using element wise notation is below: - ==== Step 2a: Visualize the data ==== + - Blah later + - === Step 3: Implement fine tuning ==== + : 1. Perform a feedforward pass, computing the activations for layers $\textstyle L_2$, $\textstyle L_3$, up to the output layer $\textstyle L_{n_l}$, using the equations defining the forward propagation steps. - Sleepyyyy + : 2. For the output layer (layer $\textstyle n_l$), set + ::\begin{align} + \delta^{(n_l)} + = - (\nabla_{a^{n_l}}J) \bullet f'(z^{(n_l)}) + \end{align} + ::(When using softmax regression, the softmax layer has $\nabla J = \theta^T(I-P)$ where $I$ is the input labels and $P$ is the vector of conditional probabilities.) + : 3. For $\textstyle l = n_l-1, n_l-2, n_l-3, \ldots, 2$ + ::Set + :::\begin{align} + \delta^{(l)} = \left((W^{(l)})^T \delta^{(l+1)}\right) \bullet f'(z^{(l)}) + \end{align} + : 4. Compute the desired partial derivatives: + ::\begin{align} + \nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\ + \nabla_{b^{(l)}} J(W,b;x,y) &= \delta^{(l+1)}. + \end{align} - === Step 4: Cross-validation === + :\begin{align} - Test on MNIST data, print out percentage, should be around 97%. + J(W,b) + &= \left[ \frac{1}{m} \sum_{i=1}^m J(W,b;x^{(i)},y^{(i)}) \right] + \end{align} + + + + {{Quote| + Note: While one could consider the softmax classifier as an additional layer, the derivation above does not. Specifically, we consider the "last layer" of the network to be the features that goes into the softmax classifier. Therefore, the derivatives (in Step 2) are computed using $\delta^{(n_l)} = - (\nabla_{a^{n_l}}J) \bullet f'(z^{(n_l)})$, where  $\nabla J = \theta^T(I-P)$. + }} + + + {{CNN}} + + + {{Languages|微调多层自编码算法|中文}}