In this section, we derive a vectorized version of our neural network. In our earlier description of  [[Neural Networks]], we had already given a partially vectorized implementation, that is quite efficient if we are working with only a single example at a time. We now describe how to implement the algorithm so that it simultaneously processes multiple training examples. Specifically, we will do this for the forward propagation and backpropagation steps, as well as for learning a sparse set of features.
== Forward propagation ==
Consider a 3 layer neural network (with one input, one hidden, and one output layer), and suppose <tt>x</tt> is a column vector containing a single training example <math>x^{(i)} \in \Re^{n}</math> . Then the forward propagation step is given by:
考虑一个三层网络(一个输入层、一个隐含层、以及一个输出层),并且假定x是包含一个单一训练样本  的列向量。则正向传导的步骤可向量化表示如下:
z^{(2)} &= W^{(1)} x + b^{(1)} \\
a^{(2)} &= f(z^{(2)}) \\
z^{(3)} &= W^{(2)} a^{(2)} + b^{(2)} \\
h_{W,b}(x) &= a^{(3)} = f(z^{(3)})
This is a fairly efficient implementation for a single example. If we have <tt>m</tt> examples, then we would wrap a for loop around this.
Concretely, following the [[Logistic Regression Vectorization Example]], let the Matlab/Octave variable <tt>x</tt> be a matrix containing the training inputs, so that <tt>x(:,i)</tt> is the <math>\textstyle i</math>-th training example.  We can then implement forward propagation as:
一审:更具体点来说,参照Logistic回归向量化的例子,我们用Matlab/Octave风格变量<tt>x</tt>表示包含输入训练样本的矩阵,<tt>x(:,i)</tt>代表第math>\textstyle i</math>个训练样本。则可实现正向传导如下:
%Unvectorized implementation 非向量化实现
for i=1:m,
  z2 = W1 * x(:,i) + b1;
  a2 = f(z2);
  z3 = W2 * a2 + b2;
  h(:,i) = f(z3);
Can we get rid of the <tt>for</tt> loop?  For many algorithms, we will represent intermediate stages of computation via vectors.  For example, <tt>z2</tt>, <tt>a2</tt>, and <tt>z3</tt> here are all column vectors that're used to compute the activations of the hidden and output layers.  In order to take better advantage of parallelism and efficient matrix operations, we would like to ''have our algorithm operate simultaneously on many training examples''.  Let us temporarily ignore <tt>b1</tt> and <tt>b2</tt> (say, set them to zero for now).  We can then implement the following:
% 向量化实现 (忽略 b1, b2)
z2 = W1 * x;
a2 = f(z2);
z3 = W2 * a2;
h = f(z3)
In this implementation, <tt>z2</tt>, <tt>a2</tt>, and <tt>z3</tt> are all matrices, with one column per training example.  A common design pattern in vectorizing across training examples is that whereas previously we had a column vector (such as <tt>z2</tt>) per training example, we can often instead try to compute a matrix so that all of these column vectors are stacked together to form a matrix.  Concretely, in this example, <tt>a2</tt> becomes a <math>s_2</math> by <math>m</math> matrix (where <math>s_2</math> is the number of units in layer 2 of the network, and <math>m</math> is the number of training examples).  And, the <math>i</math>-th column of <tt>a2</tt> contains the activations of the hidden units (layer 2 of the network) when the <math>i</math>-th training example <tt>x(:,i)</tt> is input to the network.
一审:在这个实现中,<tt>z2</tt>,<tt>a2</tt>,<tt>z3</tt>都是矩阵,每个训练样本是一个矩阵中的一列。在对多个训练样本实现向量化时常用的设计模式是,虽然前面每个样本对应一个列向量(比如<tt>z2</tt>),但我们可把这些列向量堆叠成一个矩阵以充分享受矩阵运算带来的好处。这样,在这个例子中,<tt>a2</tt>就成了一个<math>s_2</math>X<math>m</math>的矩阵(<math>s_2</math>是网络第二层中的神经元数,<math>m</math>是训练样本个数)。矩阵<tt>a2</tt>的物理含义是,当第<math>i</math>个训练样本<tt>x(:i)</tt>输入到网络中时,它的第<math>i</math>列就表示这个输入信号对隐神经元 (网络第二层)的激励结果。
In the implementation above, we have assumed that the activation function <tt>f(z)</tt> takes as input a matrix <tt>z</tt>, and applies the activation function component-wise to the input.  Note that your implementation of <tt>f(z)</tt> should use Matlab/Octave's matrix operations as much as possible, and avoid <tt>for</tt> loops as well.  We illustrate this below, assuming that <tt>f(z)</tt> is the sigmoid activation function:
一审:在上面的实现中,我们假定激活函数 <tt>f(z)</tt>实现的功能是:它接受一个输入矩阵<tt>z</tt>,然后按列分别施以激活函数(这个激活函数相当于上面用的激活函数--译者注)。需要注意的是,你在实现<tt>f(z)</tt>的时候要尽量多用Matlab/Octave的矩阵操作,并尽量避免使用for循环,不妨设它是Sigmoid函数,则实现代码如下所示:
% 低效的、非向量化的激活函数实现
function output = unvectorized_f(z)
output = zeros(size(z))
for i=1:size(z,1),
  for j=1:size(z,2),
    output(i,j) = 1/(1+exp(-z(i,j)));
% 高效的、向量化激活函数实现
function output = vectorized_f(z)
output = 1./(1+exp(-z));    % "./" is Matlab/Octave's element-wise division operator.
Finally, our vectorized implementation of forward propagation above had ignored <tt>b1</tt> and <tt>b2</tt>.  To incorporate those back in, we will use Matlab/Octave's built-in <tt>repmat</tt> function.  We have:
% 正向传导的向量化实现
z2 = W1 * x + repmat(b1,1,m);
a2 = f(z2);
z3 = W2 * a2 + repmat(b2,1,m);
h = f(z3)
The result of <tt>repmat(b1,1,m)</tt> is a matrix formed by taking the column vector <tt>b1</tt> and stacking <math>m</math> copies of them in columns as follows
| & | &  & |  \\
{\rm b1}  & {\rm b1}  & \cdots & {\rm b1} \\
| & | &  & | 
This forms a <math>s_2</math> by <math>m</math> matrix.
Thus, the result of adding this to <tt>W1 * x</tt> is that each column of the matrix gets <tt>b1</tt> added to it, as desired.
See Matlab/Octave's documentation (type "<tt>help repmat</tt>") for more information.  As a Matlab/Octave built-in function, <tt>repmat</tt> is very efficient as well, and runs much faster than if you were to implement the same thing yourself using a <tt>for</tt> loop.
一审:这就构成一个<math>s_2</math>X<math>m</math>的矩阵(回忆前面s2是网络第二层中的神经元数—译者)。它和<tt>W1 * x</tt>相加,就等于是把<tt>W1 * x</tt>矩阵(注意这里<tt>x</tt>是训练矩阵而非向量, 所以<tt>W1 * x</tt>代表两个矩阵相乘,结果还是一个矩阵—译者)的每一列加上<tt>b1</tt>。如果不熟悉的话,可以参考Matlab/Octave的帮助文档获取更多信息(输入 “<tt>help repmat</tt>”)。<tt>rampat</tt>作为Matlab/Octave的内建函数,运行起来是相当高效的,远远快过我们自己用<tt>for</tt>循环实现的效果。
== Backpropagation ==
We now describe the main ideas behind vectorizing backpropagation.  Before reading this section, we strongly encourage you to carefully step through all the forward propagation code examples above to make sure you fully understand them.  In this text, we'll only sketch the details of how to vectorize backpropagation, and leave you to derive the details in the [[Exercise:Vectorization|Vectorization exercise]].
一审:现在我们来描述反向传导向量化的做法。在阅读这一节之前,强烈建议各位筒子仔细阅读前面介绍的正向传导的例子代码,确保你已经完全理解。下边我们只会给出反向传导向量化实现大致纲要,而由你来完成具体细节的推导(见[[Vectorization exercise]]练习)。
We are in a supervised learning setting, so that we have a training set <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math> of <math>m</math> training examples.  (For the autoencoder, we simply set <math>y^{(i)} = x^{(i)}</math>, but our derivation here will consider this more general setting.)
一审:这是监督学习,所以我们会有一个含<math>m</math>个带标号训练样本的训练集合<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>。
(对于自编码网络,我们只需令<math>y^{(i)} = x^{(i)}</math>即可,  但这里考虑的是更一般的情况。)
Suppose we have <math>s_3</math> dimensional outputs, so that our target labels are <math>y^{(i)} \in \Re^{s_3}</math>.  In our Matlab/Octave datastructure, we will stack these in columns to form a Matlab/Octave variable <tt>y</tt>, so that the <math>i</math>-th column <tt>y(:,i)</tt> is <math>y^{(i)}</math>.
一审:假定我们的输出有<math>s_3</math>维(即这里考虑的是向量值输出--每个输入样本都被映射到<math>s_3</math>个输出节点—译注),因而每个输入样本的目标标号向量就记为<math>y^{(i)} \in \Re^{s_3}</math> 。在我们的Matlab/Octave数据结构实现中,把这些列合在一起形成一个Matlab/Octave风格变量<tt>y</tt>,其中第<tt>i</tt>个列<tt>y(:,i)</tt>就是y(i) 。
We now want to compute the gradient terms
<math>\nabla_{W^{(l)}} J(W,b)</math> and <math>\nabla_{b^{(l)}} J(W,b)</math>.  Consider the first of
these terms.  Following our earlier description of the [[Backpropagation Algorithm]], we had that for a single training example <math>(x,y)</math>, we can compute the derivatives as
一审:现在我们要计算梯度项<math>\nabla_{W^{(l)}} J(W,b)</math>和<math>\nabla_{b^{(l)}} J(W,b)</math>。对于头一个梯度项,就像过去在反向传导算法中所描述的那样,对于每个训练样本<math>(x,y)</math>,我们可以这样来计算:
\delta^{(3)} &= - (y - a^{(3)}) \bullet f'(z^{(3)}), \\
\delta^{(2)} &= ((W^{(2)})^T\delta^{(3)}) \bullet f'(z^{(2)}), \\
\nabla_{W^{(2)}} J(W,b;x,y) &= \delta^{(3)} (a^{(2)})^T, \\
\nabla_{W^{(1)}} J(W,b;x,y) &= \delta^{(2)} (a^{(1)})^T.
Here, <math>\bullet</math> denotes element-wise product.  For simplicity, our description here will ignore the derivatives with respect to <math>b^{(l)}</math>, though your implementation of backpropagation will have to compute those derivatives too.
一审:在这里 <math>\bullet</math>代表对两个向量按对应元素相乘的运算(其结果还是一个向量—译注)。为了描述简单起见,我们这里暂时忽略对参数<math>b^{(l)}</math>.的求导, 不过在你真正实现反向传导时,还是需要计算关于它们的导数。
Suppose we have already implemented the vectorized forward propagation method, so that the matrix-valued <tt>z2</tt>, <tt>a2</tt>,  <tt>z3</tt> and <tt>h</tt> are computed as described above. We can then implement an ''unvectorized'' version of backpropagation as follows:
一审:假定我们已经实现了正向传导步骤的向量化,如前面9-5那样去计算了矩阵值变量<tt>z2</tt>, <tt>a2</tt>,  <tt>z3</tt>和<tt>h</tt>的值,那么反向传导的非向量化版本实现就如下所示:
gradW1 = zeros(size(W1));
gradW2 = zeros(size(W2));
for i=1:m,
  delta3 = -(y(:,i) - h(:,i)) .* fprime(z3(:,i));
  delta2 = W2'*delta3(:,i) .* fprime(z2(:,i));
  gradW2 = gradW2 + delta3*a2(:,i)';
  gradW1 = gradW1 + delta2*a1(:,i)';
This implementation has a <tt>for</tt> loop.  We would like to come up with an implementation that simultaneously performs backpropagation on all the examples, and eliminates this <tt>for</tt> loop.
To do so, we will replace the vectors <tt>delta3</tt> and <tt>delta2</tt> with matrices, where one column of each matrix corresponds to each training example.  We will also implement a function <tt>fprime(z)</tt> that takes as input a matrix <tt>z</tt>, and applies <math>f'(\cdot)</math> element-wise.  Each of the four lines of Matlab in the <tt>for</tt> loop above can then be vectorized and replaced with a single line of Matlab code (without a surrounding <tt>for</tt> loop).
一审:为做到这一点,我们先把向量<tt>delta3</tt>和<tt>delta2</tt>替换为矩阵,其中每列代表一个训练样本。我们还要实现一个函数<tt>fprime(z)</tt>,该函数接受矩阵形式的输入<tt>z</tt>,并且按其矩阵元素分别执行 。这样,上面<tt>for</tt>循环中的4行Matlab代码中每行都可单独向量化,而分别替之以一行新的(向量化的)Matlab代码(不再需要外层的<tt>for</tt>循环).
In the [[Exercise:Vectorization|Vectorization exercise]], we ask you to derive the vectorized version of this algorithm by yourself.  If you are able to do it from this description, we strongly encourage you to do so.  Here also are some [[Backpropagation vectorization hints]]; however, we encourage you to try to carry out the vectorization yourself without looking at the hints.
== Sparse autoencoder ==
The [[Autoencoders_and_Sparsity|sparse autoencoder]] neural network has an additional sparsity penalty that constrains neurons' average firing rate to be close to some target activation <math>\rho</math>.  When performing backpropagation on a single training example, we had taken into the account the sparsity penalty by computing the following:
一审:[[稀疏自编码]]网络中包含一个额外的稀疏惩罚项,目的是限制神经元的平均激活率,使其接近某个(预设的)目标激活率<math>\rho</math>。 其实在对单个训练样本上执行反向传导时,我们已经考虑了如何计算这个稀疏惩罚项,如下所示:
\delta^{(2)}_i =
  \left( \left( \sum_{j=1}^{s_{2}} W^{(2)}_{ji} \delta^{(3)}_j \right)
+ \beta \left( - \frac{\rho}{\hat\rho_i} + \frac{1-\rho}{1-\hat\rho_i} \right) \right) f'(z^{(2)}_i) .
In the ''unvectorized'' case, this was computed as:
% Sparsity Penalty Delta 稀疏惩罚Delta
sparsity_delta = - rho ./ rho_hat + (1 - rho) ./ (1 - rho_hat);
for i=1:m,
  delta2 = (W2'*delta3(:,i) + beta*sparsity_delta).* fprime(z2(:,i));
The code above still had a <tt>for</tt> loop over the training set, and <tt>delta2</tt> was a column vector.
In contrast, recall that in the vectorized case, <tt>delta2</tt> is now a matrix with <math>m</math> columns corresponding to the <math>m</math> training examples.  Now, notice that the <tt>sparsity_delta</tt> term is the same regardless of what training example we are processing.  This suggests that vectorizing the computation above can be done by simply adding the same value to each column when constructing the <tt>delta2</tt> matrix. Thus, to vectorize the above computation, we can simply add <tt>sparsity_delta</tt> (e.g., using <tt>repmat</tt>) to each column of <tt>delta2</tt>.
作为对照,回想一下在向量化的情况下, <tt>delta2</tt>现在应该是一个有m列的矩阵,分别对应着<math>m</math>个训练样本。还要注意, <tt>Sparsity_delta</tt>稀疏惩罚项对所有的训练样本一视同仁。这意味着要向量化实现上面的计算,只需在构造<tt>delta2</tt>时,往矩阵的每一列上分别加上相同的值即可。因此,要向量化上面的代码,我们只需简单的用<tt>repmat</tt>命令把<tt>sparsity_delta</tt>加到<tt>delta2</tt>的每一列上即可(这里原文描述得不是很清楚, 看似应加到上面代码中<tt>delta2</tt>行等号右边第一项,即<tt>W2'*delta3</tt>上—译者注)。

