Neural Networks

From Ufldl

Jump to: navigation, search
Line 29: Line 29:
Here are plots of the sigmoid and <math>\tanh</math> functions:
Here are plots of the sigmoid and <math>\tanh</math> functions:
-
  {{multiple image
+
 
-
  | width    = 400
+
 
-
  | footer    = Two cards used by football referees
+
 
-
  | image1    = Sigmoid_Function.png
+
[[Image:Sigmoid_Function.png|400px|center|Sigmoid activation function.]]
-
  | alt1      = Sigmoid activation function
+
[[Image:Tanh_Function.png|400px|center|Tanh activation function.]]
-
  | caption1  = Sigmoid activation function
+
 
-
  | image2    = Tanh_Function.png
+
The <math>\tanh(z)</math> function is a rescaled version of the sigmoid, and its output range is
-
  | alt2      = Tanh activation function
+
<math>[-1,1]</math> instead of <math>[0,1]</math>.
-
  | caption2 = Tanh activation function
+
 
-
  }}
+
Note that unlike CS221 and (parts of) CS229, we are not using the convention
 +
here of <math>x_0=1</math>.  Instead, the intercept term is handled separately by the parameter <math>b</math>.
 +
 
 +
Finally, one identity that'll be useful later: If <math>f(z) = 1/(1+\exp(-z))</math> is the sigmoid
 +
function, then its derivative is given by <math>f'(z) = f(z) (1-f(z))</math>.
 +
(If <math>f</math> is the tanh function, then its derivative is given by
 +
<math>f'(z) = 1- (f(z))^2</math>.)  You can derive this yourself using the definition of
 +
the sigmoid (or tanh) function.
 +
 
 +
 
 +
 
 +
== Neural Network formulation ==
 +
 
 +
 
 +
A neural network is put together by hooking together many of our simple
 +
``neurons,'' so that the output of a neuron can be the input of another.  For
 +
example, here is a small neural network:
 +
 
 +
[[Image:Network331.png|400px|center]]
 +
 
 +
In this figure, we have used circles to also denote the inputs to the network. The circles
 +
labeled ``+1'' are called {\bf bias units}, and correspond to the intercept term.
 +
The leftmost layer of the network is called the {\bf input layer}, and the
 +
rightmost layer the {\bf output layer} (which, in this example, has only one
 +
node).  The middle layer of nodes is called the {\bf hidden layer}, because its
 +
values are not observed in the training set.  We also say that our example
 +
neural network has 3 {\bf input units} (not counting the bias unit), 3 {\bf
 +
hidden units}, and 1 {\bf output unit}.
 +
 
 +
We will let <math>n_l</math>
 +
denote the number of layers in our network; thus <math>n_l=3</math> in our example.  We label layer <math>l</math> as
 +
<math>L_l</math>, so layer <math>L_1</math> is the input layer, and layer <math>L_{n_l}</math> the output layer.
 +
Our neural network has parameters <math>(W,b) = (W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)})</math>, where
 +
we write
 +
<math>W^{(l)}_{ij}</math> to denote the parameter (or weight) associated with the connection
 +
between unit <math>j</math> in layer <math>l</math>, and unit <math>i</math> in layer <math>l+1</math>.  (Note the order of the indices.)
 +
Also, <math>b^{(l)}_i</math> is the bias associated with unit <math>i</math> in layer <math>l+1</math>.
 +
Thus, in our example, we have <math>W^{(1)} \in \Re^{3\times 3}</math>, and <math>W^{(2)} \in \Re^{1\times 3}</math>.
 +
Note that bias units don't have inputs or connections going into them, since they always output
 +
the value +1.  We also let <math>s_l</math> denote the number of nodes in layer <math>l</math> (not counting the bias unit).
 +
 
 +
We will write <math>a^{(l)}_i</math> to denote the {\bf activation} (meaning output value) of
 +
unit <math>i</math> in layer <math>l</math>.  For <math>l=1</math>, we also use <math>a^{(1)}_i = x_i</math> to denote the <math>i</math>-th input.
 +
Given a fixed setting of
 +
the parameters <math>W,b</math>, our neural
 +
network defines a hypothesis <math>h_{W,b}(x)</math> that outputs a real number.  Specifically, the
 +
computation that this neural network represents is given by:
 +
:<math>
 +
\begin{align}
 +
a_1^{(2)} &= f(W_{11}^{(1)}x_1 + W_{12}^{(1)} x_2 + W_{13}^{(1)} x_3 + b_1^{(1)})  \\
 +
a_2^{(2)} &= f(W_{21}^{(1)}x_1 + W_{22}^{(1)} x_2 + W_{23}^{(1)} x_3 + b_2^{(1)})  \\
 +
a_3^{(2)} &= f(W_{31}^{(1)}x_1 + W_{32}^{(1)} x_2 + W_{33}^{(1)} x_3 + b_3^{(1)})  \\
 +
h_{W,b}(x) &= a_1^{(3)} =  f(W_{11}^{(2)}a_1^{(2)} + W_{12}^{(2)} a_2^{(2)} + W_{13}^{(2)} a_3^{(2)} + b_1^{(2)})
 +
\end{align}
 +
</math>
 +
 
 +
In the sequel, we also let <math>z^{(l)}_i</math> denote the total weighted sum of inputs to unit <math>i</math> in layer <math>l</math>,
 +
including the bias term (e.g., <math>z_i^{(2)} = \sum_{j=1}^n W^{(1)}_{ij} x_j + b^{(1)}_i</math>), so that
 +
<math>a^{(l)}_i = f(z^{(l)}_i)</math>.

Revision as of 06:04, 26 February 2011

Personal tools