http://deeplearning.stanford.edu/wiki/index.php?title=Deriving_gradients_using_the_backpropagation_idea&feed=atom&action=historyDeriving gradients using the backpropagation idea - Revision history2024-03-28T16:31:04ZRevision history for this page on the wikiMediaWiki 1.16.2http://deeplearning.stanford.edu/wiki/index.php?title=Deriving_gradients_using_the_backpropagation_idea&diff=2315&oldid=prevKandeng at 04:26, 8 April 20132013-04-08T04:26:46Z<p></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 04:26, 8 April 2013</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 284:</td>
<td colspan="2" class="diff-lineno">Line 284:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{align}</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{align}</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div></math></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div></math></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">{{Languages|用反向传导思想求导|中文}}</ins></div></td></tr>
</table>Kandenghttp://deeplearning.stanford.edu/wiki/index.php?title=Deriving_gradients_using_the_backpropagation_idea&diff=990&oldid=prevCyfoo: /* Example 3: ICA reconstruction cost */2011-05-30T06:46:02Z<p><span class="autocomment">Example 3: ICA reconstruction cost</span></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 06:46, 30 May 2011</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 227:</td>
<td colspan="2" class="diff-lineno">Line 227:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>To have <math>J(z^{(4)}) = F(x)</math>, we can set <math>J(z^{(4)}) = \sum_k J(z^{(4)}_k)</math>.</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>To have <math>J(z^{(4)}) = F(x)</math>, we can set <math>J(z^{(4)}) = \sum_k J(z^{(4)}_k)</math>.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>Now that we can see <math>F</math> as a neural network, we can try to compute the gradient <math>\nabla_W F</math>. However, we now face the difficulty that <math>W</math> appears twice in the network. Fortunately, it turns out that if <math>W</math> appears multiple times in the network, the gradient with respect to <math>W</math> is simply the sum of gradients for each <math>W</math> in the network (you may wish to work out a formal proof of this fact to convince yourself). With this in mind, we <del class="diffchange diffchange-inline">can </del>proceed to work out the deltas first:</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>Now that we can see <math>F</math> as a neural network, we can try to compute the gradient <math>\nabla_W F</math>. However, we now face the difficulty that <math>W</math> appears twice in the network. Fortunately, it turns out that if <math>W</math> appears multiple times in the network, the gradient with respect to <math>W</math> is simply the sum of gradients for each <ins class="diffchange diffchange-inline">instance of </ins><math>W</math> in the network (you may wish to work out a formal proof of this fact to convince yourself). With this in mind, we <ins class="diffchange diffchange-inline">will </ins>proceed to work out the deltas first:</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><table align="center"></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><table align="center"></div></td></tr>
<tr><td colspan="2" class="diff-lineno">Line 258:</td>
<td colspan="2" class="diff-lineno">Line 258:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div></table></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div></table></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div><del class="diffchange diffchange-inline">First </del>we find the gradients with respect to each <math>W</math>.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins class="diffchange diffchange-inline">To find the gradients with respect to <math>W</math>, first </ins>we find the gradients with respect to each <ins class="diffchange diffchange-inline">instance of </ins><math>W</math> <ins class="diffchange diffchange-inline">in the network</ins>.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>With respect to <math>W^T</math>:</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>With respect to <math>W^T</math>:</div></td></tr>
</table>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php?title=Deriving_gradients_using_the_backpropagation_idea&diff=989&oldid=prevCyfoo at 06:44, 30 May 20112011-05-30T06:44:03Z<p></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 06:44, 30 May 2011</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 38:</td>
<td colspan="2" class="diff-lineno">Line 38:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Let's say we have a function <math>F</math> that takes a matrix <math>X</math> and yields a real number. We would like to use the backpropagation idea to compute the gradient with respect to <math>X</math> of <math>F</math>, that is <math>\nabla_X F</math>. The general idea is to see the function <math>F</math> as a multi-layer neural network, and to derive the gradients using the backpropagation idea. </div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Let's say we have a function <math>F</math> that takes a matrix <math>X</math> and yields a real number. We would like to use the backpropagation idea to compute the gradient with respect to <math>X</math> of <math>F</math>, that is <math>\nabla_X F</math>. The general idea is to see the function <math>F</math> as a multi-layer neural network, and to derive the gradients using the backpropagation idea. </div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>To do this, we will set our "objective function" to be the function <math>J(z)</math> that when applied to the outputs of the neurons in the last layer yields the value <math>F(<del class="diffchange diffchange-inline">x</del>)</math>. For the intermediate layers, we will also choose our activation functions <math>f^{(l)}</math> to this end.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>To do this, we will set our "objective function" to be the function <math>J(z)</math> that when applied to the outputs of the neurons in the last layer yields the value <math>F(<ins class="diffchange diffchange-inline">X</ins>)</math>. For the intermediate layers, we will also choose our activation functions <math>f^{(l)}</math> to this end<ins class="diffchange diffchange-inline">.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div> </div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins class="diffchange diffchange-inline">Using this method, we can easily compute derivatives with respect to the inputs <math>X</math>, as well as derivatives with respect to any of the weights in the network, as we shall see later</ins>.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>== Examples ==</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>== Examples ==</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div><del class="diffchange diffchange-inline">We </del>will use two functions from the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding]] to illustrate the <del class="diffchange diffchange-inline">method </del>of <del class="diffchange diffchange-inline">computing gradients of functions on matrices using the backpropagation </del>idea.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins class="diffchange diffchange-inline">To illustrate the use of the backpropagation idea to compute derivatives with respect to the inputs, we </ins>will use two functions from the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding<ins class="diffchange diffchange-inline">]], in examples 1 and 2. In example 3, we use a function from [[Independent Component Analysis | independent component analysis</ins>]] to illustrate the <ins class="diffchange diffchange-inline">use </ins>of <ins class="diffchange diffchange-inline">this </ins>idea <ins class="diffchange diffchange-inline">to compute derivates with respect to weights, and in this specific case, what to do in the case of tied or repeated weights</ins>.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>=== Example 1: Objective for weight matrix in sparse coding ===</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>=== Example 1: Objective for weight matrix in sparse coding ===</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>Recall the objective function for the weight matrix <math>A</math>, given the feature matrix <math>s</math>:</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>Recall <ins class="diffchange diffchange-inline">for [[Sparse Coding: Autoencoder Interpretation | sparse coding]], </ins>the objective function for the weight matrix <math>A</math>, given the feature matrix <math>s</math>:</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>:<math>F(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>:<math>F(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td colspan="2" class="diff-lineno">Line 116:</td>
<td colspan="2" class="diff-lineno">Line 118:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>=== Example 2: Smoothed topographic L1 sparsity penalty in sparse coding ===</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>=== Example 2: Smoothed topographic L1 sparsity penalty in sparse coding ===</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>Recall the smoothed topographic L1 sparsity penalty on <math>s</math> in sparse coding:</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>Recall the smoothed topographic L1 sparsity penalty on <math>s</math> in <ins class="diffchange diffchange-inline">[[Sparse Coding: Autoencoder Interpretation | </ins>sparse coding<ins class="diffchange diffchange-inline">]]</ins>:</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>:<math>\sum{ \sqrt{Vss^T + \epsilon} }</math></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>:<math>\sum{ \sqrt{Vss^T + \epsilon} }</math></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>where <math>V</math> is the grouping matrix, <math>s</math> is the feature matrix and <math>\epsilon</math> is a constant.</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>where <math>V</math> is the grouping matrix, <math>s</math> is the feature matrix and <math>\epsilon</math> is a constant.</div></td></tr>
<tr><td colspan="2" class="diff-lineno">Line 163:</td>
<td colspan="2" class="diff-lineno">Line 165:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><td>3</td></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><td>3</td></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><td><math>f'(z_i) = 1</math></td></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><td><math>f'(z_i) = 1</math></td></div></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div><td><math>\left( I^T \delta^{(<del class="diffchange diffchange-inline">3</del>)} \right) \bullet 1</math></td></div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><td><math>\left( I^T \delta^{(<ins class="diffchange diffchange-inline">4</ins>)} \right) \bullet 1</math></td></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><td><math>Vss^T</math></td></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><td><math>Vss^T</math></td></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div></tr></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div></tr></div></td></tr>
<tr><td colspan="2" class="diff-lineno">Line 186:</td>
<td colspan="2" class="diff-lineno">Line 188:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>& = V^T \frac{1}{2}(Vss^T + \epsilon)^{-\frac{1}{2}} \bullet 2s \\</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>& = V^T \frac{1}{2}(Vss^T + \epsilon)^{-\frac{1}{2}} \bullet 2s \\</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>& = V^T (Vss^T + \epsilon)^{-\frac{1}{2}} \bullet s</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>& = V^T (Vss^T + \epsilon)^{-\frac{1}{2}} \bullet s</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">\end{align}</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></math></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">=== Example 3: ICA reconstruction cost ===</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">Recall the [[Independent Component Analysis | independent component analysis (ICA)]] reconstruction cost term:</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><math>\lVert W^TWx - x \rVert_2^2</math></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">where <math>W</math> is the weight matrix and <math>x</math> is the input.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">We would like to find <math>\nabla_W \lVert W^TWx - x \rVert_2^2</math> - the derivative of the term with respect to the '''weight matrix''', rather than the '''input''' as in the earlier two examples. We will still proceed similarly though, seeing this term as an instantiation of a neural network:</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">[[File:Backpropagation Method Example 3.png | 400px]]</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">The weights and activation functions of this network are as follows:</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><table align="center"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><tr><th width="50px">Layer</th><th width="200px">Weight</th><th width="200px">Activation function <math>f</math></th></tr></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><tr></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td>1</td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td><math>W</math></td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td><math>f(z_i) = z_i</math></td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></tr></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><tr></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td>2</td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td><math>W^T</math></td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td><math>f(z_i) = z_i</math></td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></tr></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><tr></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td>3</td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td><math>I</math></td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td><math>f(z_i) = z_i - x_i</math></td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></tr></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><tr></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td>4</td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td>N/A</td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td><math>f(z_i) = z_i^2</math></td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></tr></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></table></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">To have <math>J(z^{(4)}) = F(x)</math>, we can set <math>J(z^{(4)}) = \sum_k J(z^{(4)}_k)</math>.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">Now that we can see <math>F</math> as a neural network, we can try to compute the gradient <math>\nabla_W F</math>. However, we now face the difficulty that <math>W</math> appears twice in the network. Fortunately, it turns out that if <math>W</math> appears multiple times in the network, the gradient with respect to <math>W</math> is simply the sum of gradients for each <math>W</math> in the network (you may wish to work out a formal proof of this fact to convince yourself). With this in mind, we can proceed to work out the deltas first:</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><table align="center"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><tr><th width="50px">Layer</th><th width="200px">Derivative of activation function <math>f'</math></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></th><th width="200px">Delta</th><th>Input <math>z</math> to this layer</th></tr></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><tr></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td>4</td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td><math>f'(z_i) = 2z_i</math></td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td><math>f'(z_i) = 2z_i</math></td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td><math>(W^TWx - x)</math></td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></tr></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><tr></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td>3</td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td><math>f'(z_i) = 1</math></td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td><math>\left( I^T \delta^{(4)} \right) \bullet 1</math></td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td><math>W^TWx</math></td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></tr></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><tr></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td>2</td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td><math>f'(z_i) = 1</math></td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td><math>\left( (W^T)^T \delta^{(3)} \right) \bullet 1</math></td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td><math>Wx</math></td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></tr></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><tr></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td>1</td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td><math>f'(z_i) = 1</math></td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td><math>\left( W^T \delta^{(2)} \right) \bullet 1</math></td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><td><math>x</math></td></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></tr></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></table></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">First we find the gradients with respect to each <math>W</math>.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">With respect to <math>W^T</math>:</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">:<math></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">\begin{align}</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">\nabla_{W^T} F & = \delta^{(3)} a^{(2)T} \\</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">& = 2(W^TWx - x) (Wx)^T</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">\end{align}</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></math></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">With respect to <math>W</math>:</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">:<math></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">\begin{align}</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">\nabla_{W} F & = \delta^{(2)} a^{(1)T} \\</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">& = (W^T)(2(W^TWx -x)) x^T</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">\end{align}</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></math></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">Taking sums, noting that we need to transpose the gradient with respect to <math>W^T</math> to get the gradient with respect to <math>W</math>, yields the final gradient with respect to <math>W</math> (pardon the slight abuse of notation here):</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">:<math></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">\begin{align}</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">\nabla_{W} F & = \nabla_{W} F + (\nabla_{W^T} F)^T \\</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">& = (W^T)(2(W^TWx -x)) x^T + 2(Wx)(W^TWx - x)^T</ins></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{align}</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{align}</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div></math></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div></math></div></td></tr>
</table>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php?title=Deriving_gradients_using_the_backpropagation_idea&diff=987&oldid=prevCyfoo at 06:04, 30 May 20112011-05-30T06:04:58Z<p></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 06:04, 30 May 2011</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 68:</td>
<td colspan="2" class="diff-lineno">Line 68:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><td>1</td></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><td>1</td></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><td><math>A</math></td></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><td><math>A</math></td></div></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div><td><math>f(z_i) = z_i <del class="diffchange diffchange-inline">(identity)</del></math></td></div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><td><math>f(z_i) = z_i</math> <ins class="diffchange diffchange-inline">(identity)</ins></td></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div></tr></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div></tr></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><tr></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><tr></div></td></tr>
</table>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php?title=Deriving_gradients_using_the_backpropagation_idea&diff=985&oldid=prevCyfoo at 06:04, 30 May 20112011-05-30T06:04:27Z<p></p>
<a href="http://deeplearning.stanford.edu/wiki/index.php?title=Deriving_gradients_using_the_backpropagation_idea&diff=985&oldid=982">Show changes</a>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php?title=Deriving_gradients_using_the_backpropagation_idea&diff=982&oldid=prevCyfoo: /* Example 2: Smoothed topographic L1 sparsity penalty in sparse coding */2011-05-29T08:08:57Z<p><span class="autocomment">Example 2: Smoothed topographic L1 sparsity penalty in sparse coding</span></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 08:08, 29 May 2011</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 57:</td>
<td colspan="2" class="diff-lineno">Line 57:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Recall the smoothed topographic L1 sparsity penalty on <math>s</math> in sparse coding:</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Recall the smoothed topographic L1 sparsity penalty on <math>s</math> in sparse coding:</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>:<math>\sum{ \sqrt{Vss^T + \epsilon} }</math></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>:<math>\sum{ \sqrt{Vss^T + \epsilon} }</math></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">where <math>V</math> is the grouping matrix, <math>s</math> is the feature matrix and <math>\epsilon</math> is a constant.</ins></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>We would like to find <math>\nabla_s \sum{ \sqrt{Vss^T + \epsilon} }</math>. As above, let's see this term as an instantiation of a neural network:</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>We would like to find <math>\nabla_s \sum{ \sqrt{Vss^T + \epsilon} }</math>. As above, let's see this term as an instantiation of a neural network:</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>[[File:Backpropagation Method Example 2.png | 600px]]</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>[[File:Backpropagation Method Example 2.png | 600px]]</div></td></tr>
</table>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php?title=Deriving_gradients_using_the_backpropagation_idea&diff=980&oldid=prevCyfoo: /* Introduction */2011-05-29T08:05:26Z<p><span class="autocomment">Introduction</span></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 08:05, 29 May 2011</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 1:</td>
<td colspan="2" class="diff-lineno">Line 1:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>== Introduction ==</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>== Introduction ==</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>In the section on the [[Backpropagation Algorithm | backpropagation algorithm]], you were briefly introduced to backpropagation as a means of deriving gradients for learning in the sparse autoencoder. It turns out that together with matrix calculus, this provides a powerful method and intuition for deriving gradients for more complex matrix functions (functions from matrices to the reals, or symbolically, from <math>\mathbb{R}^{r \times c} \rightarrow \mathbb{R}</math>.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>In the section on the [[Backpropagation Algorithm | backpropagation algorithm]], you were briefly introduced to backpropagation as a means of deriving gradients for learning in the sparse autoencoder. It turns out that together with matrix calculus, this provides a powerful method and intuition for deriving gradients for more complex matrix functions (functions from matrices to the reals, or symbolically, from <math>\mathbb{R}^{r \times c} \rightarrow \mathbb{R}</math><ins class="diffchange diffchange-inline">)</ins>.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>First, recall the backpropagation idea, which we present in a modified form appropriate for our purposes below:</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>First, recall the backpropagation idea, which we present in a modified form appropriate for our purposes below:</div></td></tr>
</table>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php?title=Deriving_gradients_using_the_backpropagation_idea&diff=979&oldid=prevCyfoo at 08:05, 29 May 20112011-05-29T08:05:03Z<p></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 08:05, 29 May 2011</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 26:</td>
<td colspan="2" class="diff-lineno">Line 26:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><li><math>a^{(l)}_i</math> is the activation of the <math>i</math>th unit in the <math>l</math>th layer</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><li><math>a^{(l)}_i</math> is the activation of the <math>i</math>th unit in the <math>l</math>th layer</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><li><math>A \bullet B</math> is the Hadamard or element-wise product, which for <math>r \times c</math> matrices <math>A</math> and <math>B</math> yields the <math>r \times c</math> matrix <math>C = A \bullet B</math> such that <math>C_{r, c} = A_{r, c} \cdot B_{r, c}</math></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><li><math>A \bullet B</math> is the Hadamard or element-wise product, which for <math>r \times c</math> matrices <math>A</math> and <math>B</math> yields the <math>r \times c</math> matrix <math>C = A \bullet B</math> such that <math>C_{r, c} = A_{r, c} \cdot B_{r, c}</math></div></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div><li><math>f^{(l<del class="diffchange diffchange-inline">}</del>)</math> is the activation function for units in the <math>l</math>th layer</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><li><math>f^{(l)<ins class="diffchange diffchange-inline">}</ins></math> is the activation function for units in the <math>l</math>th layer</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div></ul></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div></ul></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td colspan="2" class="diff-lineno">Line 51:</td>
<td colspan="2" class="diff-lineno">Line 51:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div></ol></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div></ol></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>[[File:Backpropagation Method Example 1.png]]</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>[[File:Backpropagation Method Example 1.png <ins class="diffchange diffchange-inline">| 400px</ins>]]</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>=== Example 2: Smoothed topographic L1 sparsity penalty in sparse coding ===</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>=== Example 2: Smoothed topographic L1 sparsity penalty in sparse coding ===</div></td></tr>
<tr><td colspan="2" class="diff-lineno">Line 60:</td>
<td colspan="2" class="diff-lineno">Line 60:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>We would like to find <math>\nabla_s \sum{ \sqrt{Vss^T + \epsilon} }</math>. As above, let's see this term as an instantiation of a neural network:</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>We would like to find <math>\nabla_s \sum{ \sqrt{Vss^T + \epsilon} }</math>. As above, let's see this term as an instantiation of a neural network:</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>[[File:Backpropagation Method Example 2.png]]</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>[[File:Backpropagation Method Example 2.png <ins class="diffchange diffchange-inline">| 600px</ins>]]</div></td></tr>
</table>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php?title=Deriving_gradients_using_the_backpropagation_idea&diff=975&oldid=prevCyfoo: Created page with "== Introduction == In the section on the backpropagation algorithm, you were briefly introduced to backpropagation as a means of deriving gradien..."2011-05-29T08:02:05Z<p>Created page with "== Introduction == In the section on the <a href="/wiki/index.php/Backpropagation_Algorithm" title="Backpropagation Algorithm"> backpropagation algorithm</a>, you were briefly introduced to backpropagation as a means of deriving gradien..."</p>
<p><b>New page</b></p><div>== Introduction ==<br />
<br />
In the section on the [[Backpropagation Algorithm | backpropagation algorithm]], you were briefly introduced to backpropagation as a means of deriving gradients for learning in the sparse autoencoder. It turns out that together with matrix calculus, this provides a powerful method and intuition for deriving gradients for more complex matrix functions (functions from matrices to the reals, or symbolically, from <math>\mathbb{R}^{r \times c} \rightarrow \mathbb{R}</math>.<br />
<br />
First, recall the backpropagation idea, which we present in a modified form appropriate for our purposes below:<br />
<ol><br />
<li>For <math>l = n_l, n_l-1, n_l-2, \ldots, 2</math> <br />
:For each node <math>i</math> in layer <math>l</math>, set<br />
::<math><br />
\delta^{(l)}_i = \left( \sum_{j=1}^{s_{l+1}} W^{(l)}_{ji} \delta^{(l+1)}_j \right) \bullet \frac{\partial}{\partial z^{(l)}_i} f^{(l)} (z^{(l)}_i)<br />
</math><br />
<li>Compute the desired partial derivatives,<br />
:<math><br />
\begin{align}<br />
\nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\<br />
\end{align}<br />
</math><br />
</ol><br />
<br />
Quick notation recap: <br />
<ul><br />
<li><math>l</math> is the number of layers in the neural network<br />
<li><math>n_l</math> is the number of neurons in the <math>l</math>th layer<br />
<li><math>W^{(l)}_{ji}</math> is the weight from the <math>i</math>th unit in the <math>l</math>th layer to the <math>j</math>th unit in the <math>(l + 1)</math>th layer<br />
<li><math>z^{(l)}_i</math> is the input to the <math>i</math>th unit in the <math>l</math>th layer<br />
<li><math>a^{(l)}_i</math> is the activation of the <math>i</math>th unit in the <math>l</math>th layer<br />
<li><math>A \bullet B</math> is the Hadamard or element-wise product, which for <math>r \times c</math> matrices <math>A</math> and <math>B</math> yields the <math>r \times c</math> matrix <math>C = A \bullet B</math> such that <math>C_{r, c} = A_{r, c} \cdot B_{r, c}</math><br />
<li><math>f^{(l})</math> is the activation function for units in the <math>l</math>th layer<br />
</ul><br />
<br />
Notice that we don't consider an objective function in this case, and we allow each layer to have a different activation function <math>f^{(l)}</math>. This will be useful in allowing us to compute the gradients of functions of matrices.<br />
<br />
== The method ==<br />
<br />
To compute the gradient with respect to some matrix <math>X</math> of a complicated function of matrices, it may be helpful to consider the function as a complicated multi-layer neural network, if possible. We will use two functions from the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding]] to illustrate this.<br />
<br />
=== Example 1: Objective for weight matrix in sparse coding ===<br />
<br />
Recall the objective function for the weight matrix <math>A</math>, given the feature matrix <math>s</math>:<br />
:<math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math><br />
<br />
We would like to find the gradient of <math>J</math> with respect to <math>A</math>, or in symbols, <math>\nabla_A J(A)</math>. Since the objective function is a sum of two terms in <math>A</math>, the gradient is the sum of gradients of each of the individual terms. The gradient of the second term is trivial, so we will consider the gradient of the first term instead. <br />
<br />
The first term, <math>\lVert As - x \rVert_2^2</math>, can be seen as an instantiation of neural network taking <math>s</math> as an input, and proceeding in four steps, as described and illustrated in the paragraph and diagram below:<br />
<br />
<ol><br />
<li>Apply <math>A</math> as the weights from the first layer to the second layer.<br />
<li>Subtract <math>x</math> from the activation of the second layer, which uses the identity activation function.<br />
<li>Pass this unchanged to the third layer, via identity weights. Use the square function as the activation function for the third layer.<br />
<li>Sum all the activations of the third layer.<br />
</ol><br />
<br />
[[File:Backpropagation Method Example 1.png]]<br />
<br />
=== Example 2: Smoothed topographic L1 sparsity penalty in sparse coding ===<br />
<br />
Recall the smoothed topographic L1 sparsity penalty on <math>s</math> in sparse coding:<br />
:<math>\sum{ \sqrt{Vss^T + \epsilon} }</math><br />
<br />
We would like to find <math>\nabla_s \sum{ \sqrt{Vss^T + \epsilon} }</math>. As above, let's see this term as an instantiation of a neural network:<br />
<br />
[[File:Backpropagation Method Example 2.png]]</div>Cyfoo