Neural Network Vectorization

In this section, we derive a vectorized version of our neural network.  In our earlier description of [[Neural Networks]], we had already given a partially vectorized implementation, that is quite efficient if we are working with only a single example at a time.  We now describe how to implement the algorithm so that it simultaneously processes multiple training examples.  Specifically, we will do this for the forward propagation and backpropagation steps, as well as for learning a sparse set of features. 

== Forward propagation ==

Consider a 3 layer neural network (with one input, one hidden, and one output layer), and suppose <tt>x</tt> is a column vector containing a single training example <math>x^{(i)} \in \Re^{n}</math>. Then the forward propagation step is given by: 
:<math>\begin{align}
z^{(2)} &= W^{(1)} x + b^{(1)} \\
a^{(2)} &= f(z^{(2)}) \\
z^{(3)} &= W^{(2)} a^{(2)} + b^{(2)} \\
h_{W,b}(x) &= a^{(3)} = f(z^{(3)})
\end{align}</math>
This is a fairly efficient implementation for a single example.  If we have <math>m</math> examples, then we would wrap a <tt>for</tt> loop around this.  

Concretely, following the [[Logistic Regression Vectorization Example]], let the Matlab/Octave variable <tt>x</tt> be a matrix containing the training inputs, so that <tt>x(:,i)</tt> is the <math>\textstyle i</math>-th training example.  We can then implement forward propagation as: 
<syntaxhighlight> 
% Initial implementation
for i=1:m, 
  z2 = W1 * x(:,i) + b1;
  a2 = f(z2);
  z3 = W2 * a2 + b2;
  h(:,i) = f(z3);
end;
</syntaxhighlight>

Can we get rid of the <tt>for</tt> loop?  For many algorithms, we will represent intermediate stages of computation via vectors.  For example, <tt>z2</tt>, <tt>a2</tt>, and <tt>z3</tt> here are all column vectors that're used to compute the activations of the hidden and output layers.  In order to take better advantage of parallelism and efficient matrix operations, we would like to ''have our algorithm operate simultaneously on many training examples''.  Let us temporarily ignore <tt>b1</tt> and <tt>b2</tt> (say, set them to zero for now).  We can then implement the following:
<syntaxhighlight>
% Vectorized implementation (ignoring b1, b2)
z2 = W1 * x;
a2 = f(z2);
z3 = W2 * a2;
h = f(z3)
</syntaxhighlight>

In this implementation, <tt>z2</tt>, <tt>a2</tt>, and <tt>z3</tt> are all matrices, with one column per training example.  A common design pattern in vectorizing across training examples is that whereas previously we had a column vector (such as <tt>z2</tt>) per training example, we can often instead try to compute a matrix so that all of these column vectors are stacked together to form a matrix.  Concretely, in this example, <tt>a2</tt> becomes a <math>s_2</math> by <math>m</math> matrix (where <math>s_2</math> is the number of units in layer 2 of the network, and <math>m</math> is the number of training examples).  And, the <math>i</math>-th column of <tt>a2</tt> contains the activations of the hidden units (layer 2) when the <math>i</math>-th training example <tt>x(:,i)</tt> is input to the network. 

In the implementation above, we have assumed that the activation function <tt>f(z)</tt> takes as input a matrix <tt>z</tt>, and applies the activation function component-wise to the input.  Note that your implementation of <tt>f(z)</tt> should use Matlab/Octave's matrix operations as much as possible, to avoid <tt>for</tt> loops as well.  We illustrate this below, assuming that <tt>f(z)</tt> is a sigmoid activation function:

<syntaxhighlight>
% Unvectorized implementation of the activation function
function output = unvectorized_f(z)
output = zeros(size(z))
for i=1:size(z,1), for j=1:size(z,2),
  output(i,j) = 1/(1+exp(-z(i,j)));
end; end;
end

% Efficient, vectorized implementation of the activation function
function output = vectorized_f(z)
output = 1./(1+exp(-z));     % "./" is Matlab/Octave's element-wise division operator. 
end
</syntaxhighlight>

Finally, our vectorized implementation of forward propagation had ignored <tt>b1</tt> and <tt>b2</tt>.  To incorporate those back in, we will use Matlab/Octave's built-in <tt>repmat</tt> function.  We have:
<syntaxhighlight>
% Vectorized implementation of forward propagation
z2 = W1 * x + repmat(b1,1,m);
a2 = f(z2);
z3 = W2 * a2 + repmat(b2,1,m);
h = f(z3)
</syntaxhighlight>
The result of <tt>repmat(b1,1,m)</tt> is a matrix formed by taking the column vector <tt>b1</tt> and stacking <math>m</math> copies of them in columns as follows:
<math>
\begin{bmatrix}
| & | &  & |  \\
b1  & b1  & \cdots b1 \\
| & | &  & |  
</math>
Thus, the result of adding this to <tt>W1 * x</tt> is that each column of the matrix gets <tt>b1</tt> added to it, as desired.
See Matlab/Octave's documentation (type "<tt>help repmat</tt>") for more information.  As a Matlab/Octave built-in function, <tt>repmat</tt> is very efficient--much more so than implementing a <tt>for</tt> loop yourself.  


== Backpropagation ==



Further, suppose the Matlab/Octave variable <tt>y</tt> is a ''row'' vector of the labels in the
training set, so that the variable <tt>y(i)</tt> is <math>\textstyle y^{(i)} \in \{0,1\}</math>.  (Here we differ from the 
CS229 notation. Specifically, in the matrix-valued <tt>x</tt> we stack the training inputs in columns rather than in rows;
and <tt>y</tt><math>\in \Re^{1\times m}</math> is a row vector rather than a column vector.)