Softmax回归

Softmax回归(Softmax Regression)

'''初译''':@knighterzjy

'''一审''':@GuitarFang

== Introduction介绍 ==

'''原文''':

In these notes, we describe the '''Softmax regression''' model.  This model generalizes logistic regression to
classification problems where the class label <math>y</math> can take on more than two possible values.
This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 different
numerical digits.  Softmax regression is a supervised learning algorithm, but we will later be
using it in conjuction with our deep learning/unsupervised feature learning methods.


'''译文''':

在本节中，我们介绍Softmax回归模型，该模型是logistic回归模型在多分类问题上的泛化，在多分类问题中，类标签y可以取两个以上的值。 Softmax回归模型可以直接应用于 MNIST 手写数字分类问题等多分类问题。Softmax回归是有监督的，不过我们接下来也会介绍它与深度学习/无监督学习方法的结合。
（译者注： MNIST 是一个手写数字识别库，由 NYU 的Yann LeCun 等人维护。 http://yann.lecun.com/exdb/mnist/ ）

'''一审''':

在本章中，我们介绍Softmax回归模型。该模型将logistic回归模型一般化，以用来解决类型标签y的可能取值多于两种的分类问题。Softmax回归模型对于诸如MNIST手写数字分类等问题是十分有用的，该问题的目的是辨识10个不同的单个数字。Softmax回归是一种有监督学习算法，但是我们接下来要将它与我们的深度学习/无监督特征学习方法结合起来使用。
（译者注：MNIST是一个手写数字识别库，由NYU的Yann LeCun等人维护。http://yann.lecun.com/exdb/mnist/）

'''原文''':
Recall that in logistic regression, we had a training set
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>
of <math>m</math> labeled examples, where the input features are <math>x^{(i)} \in \Re^{n+1}</math>.  
(In this set of notes, we will use the notational convention of letting the feature vectors <math>x</math> be
<math>n+1</math> dimensional, with <math>x_0 = 1</math> corresponding to the intercept term.) 
With logistic regression, we were in the binary classification setting, so the labels 
were <math>y^{(i)} \in \{0,1\}</math>.  Our hypothesis took the form:

<math>\begin{align}
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},
\end{align}</math>


'''译文''':
回顾一下 logistic 回归，我们的训练集为<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>
，其中 m为样本数，<math>x^{(i)} \in \Re^{n+1}</math>为特征。
由于 logistic 回归是针对二分类问题的，因此类标 <math>y^{(i)} \in \{0,1\}</math>。假设函数如下：

<math>\begin{align}
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},
\end{align}</math>

'''一审''':
回想一下在 logistic 回归中，我们拥有一个包含 m 个被标记的样本的训练集 <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>，其中输入特征值 <math>x^{(i)} \in \Re^{n+1}</math>。（在本章中，我们对出现的符号进行如下约定：特征向量 x 的维度为n+1 ，其中x0=1对应截距项 。）因为在Logistic 回归中，我们要解决的是二元分类问题，因此类型标记<math>y^{(i)} \in \{0,1\}</math>。估值函数如下：

<math>\begin{align}
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},
\end{align}</math>


'''原文''':

and the model parameters <math>\theta</math> were trained to minimize
the cost function
<math>
\begin{align}
J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right]
\end{align}
</math>


'''译文''':
模型参数 <math>\theta</math> 用于最小化损失函数
<math>
\begin{align}
J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right]
\end{align}
</math>


'''一审''':
我们将训练模型参数 <math>\theta</math> ，使其能够最小化代价函数 ：

<math>
\begin{align}
J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right]
\end{align}
</math>

'''原文''':
In the softmax regression setting, we are interested in multi-class
classification (as opposed to only binary classification), and so the label
<math>y</math> can take on <math>k</math> different values, rather than only
two.  Thus, in our training set
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>,
we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>.  (Note that
our convention will be to index the classes starting from 1, rather than from 0.)  For example,
in the MNIST digit recognition task, we would have <math>k=10</math> different classes.

'''译文''':
在 softmax回归中，我们解决的是多分类问题（相对于 logistic 回归解决的二分类问题），类标 y 可以取 k个不同的值（而不是 2 个）。因此，对于训练集 <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>，我们有 <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>。（注意此处的类别下标从 1 开始，而不是 0）。例如，在 MNIST 数字识别任务中，我们有 k=10 个不同的类别。

'''一审''':
在 softmax回归中，我们感兴趣的是多元分类（相对于只能辨识两种类型的二元分类）， 所以类型标记y可以取k个不同的值（而不只限于2个）。 于是，对于我们的 训练集<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math> 便有<math>y^{(i)} \in \{1, 2, \ldots, k\}</math>。（注意，我们约定类别下标从 1 开始，而不是 0）。例如，在 MNIST 数字识别任务中，我们有 k=10 个不同的类别。


'''原文''':
Given a test input <math>x</math>, we want our hypothesis to estimate
the probability that <math>p(y=j | x)</math> for each value of <math>j = 1, \ldots, k</math>.
I.e., we want to estimate the probability of the class label taking
on each of the <math>k</math> different possible values.  Thus, our hypothesis
will output a <math>k</math> dimensional vector (whose elements sum to 1) giving
us our <math>k</math> estimated probabilities.  Concretely, our hypothesis
<math>h_{\theta}(x)</math> takes the form:

<math>
\begin{align}
h_\theta(x^{(i)}) =
\begin{bmatrix}
p(y^{(i)} = 1 | x^{(i)}; \theta) \\
p(y^{(i)} = 2 | x^{(i)}; \theta) \\
\vdots \\
p(y^{(i)} = k | x^{(i)}; \theta)
\end{bmatrix}
=
\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix}
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_k^T x^{(i)} } \\
\end{bmatrix}
\end{align}
</math>

'''译文''':
给定一个测试样本 x ，我们想让假设函数去估计该样本在每一个类别上的概率 <math>p(y=j | x)</math> ，例如，我们想要估计类标在 k 个不同类别上的概率。因此，我们的假设函数会输出一个 k 维的向量（向量元素的和为1）来表示样本x在k个类别上的概率值。具体地说，我们的假设函数<math>h_{\theta}(x)</math> 形式如下：

<math>
\begin{align}
h_\theta(x^{(i)}) =
\begin{bmatrix}
p(y^{(i)} = 1 | x^{(i)}; \theta) \\
p(y^{(i)} = 2 | x^{(i)}; \theta) \\
\vdots \\
p(y^{(i)} = k | x^{(i)}; \theta)
\end{bmatrix}
=
\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix}
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_k^T x^{(i)} } \\
\end{bmatrix}
\end{align}
</math>

'''一审''':
对于给定的测试输入，我们想让估值函数针对每一个估算出概率值<math>p(y=j | x)</math> 。也就是说，我们想估计出分类结果在每一个分类标记值上出现的概率 (一审注：而不是估算出具体是取哪一个值，这一点和基本神经网络估值函数输出最终值是有区别的) 。因此，我们的 估值函数将要输出一个k维的向量（向量元素的和为1）来表示这k被估计出的概率值。 具体地说，我们的 估值函数<math>h_{\theta}(x)</math> 形式如下：

<math>
\begin{align}
h_\theta(x^{(i)}) =
\begin{bmatrix}
p(y^{(i)} = 1 | x^{(i)}; \theta) \\
p(y^{(i)} = 2 | x^{(i)}; \theta) \\
\vdots \\
p(y^{(i)} = k | x^{(i)}; \theta)
\end{bmatrix}
=
\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix}
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_k^T x^{(i)} } \\
\end{bmatrix}
\end{align}
</math>

'''原文''':
Here <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math> are the
parameters of our model.  
Notice that
the term <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>
normalizes the distribution, so that it sums to one. 

'''译文''':
其中 <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math>  均为模型参数， the term <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math> 是模型的归一化因子，使得向量的和为 1 。

'''一审''':
其中  <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math>是我们模型的参数。请注意<math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>，这一项对概率分布进行归一化，使得所有概率之和为 1 。


'''原文''':
For convenience, we will also write 
<math>\theta</math> to denote all the
parameters of our model.  When you implement softmax regression, it is usually
convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by
stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that

<math>
\theta = \begin{bmatrix}
\mbox{---} \theta_1^T \mbox{---} \\
\mbox{---} \theta_2^T \mbox{---} \\
\vdots \\
\mbox{---} \theta_k^T \mbox{---} \\
\end{bmatrix}
</math>

'''译文''':
为了简便，我们使用<math>\theta</math>来表示模型参数。在实现Softmax回归的时候，往往使用一个<math>k</math>-by-<math>(n+1)</math>的矩阵来表示<math>\theta</math>。我们将 <math>\theta_1, \theta_2, \ldots, \theta_k</math>按行表示，得到
<math>
\theta = \begin{bmatrix}
\mbox{---} \theta_1^T \mbox{---} \\
\mbox{---} \theta_2^T \mbox{---} \\
\vdots \\
\mbox{---} \theta_k^T \mbox{---} \\
\end{bmatrix}
</math>

'''一审''':
为了方便起见，我们同样使用符号<math>\theta</math>来表示全部的模型参数。在实现Softmax回归时，你通常会发现，将θ用一个<math>k</math>-by-<math>(n+1)</math>的矩阵来表示会十分便利，该矩阵是将 <math>\theta_1, \theta_2, \ldots, \theta_k</math>按行罗列起来得到的，如下所示：
<math>
\theta = \begin{bmatrix}
\mbox{---} \theta_1^T \mbox{---} \\
\mbox{---} \theta_2^T \mbox{---} \\
\vdots \\
\mbox{---} \theta_k^T \mbox{---} \\
\end{bmatrix}
</math>

== 代价函数 Cost Function ==

'''原文''':

We now describe the cost function that we'll use for softmax regression.  In the equation below, <math>1\{\cdot\}</math> is
the '''indicator function,''' so that <math>1\{\hbox{a true statement}\}=1</math>, and <math>1\{\hbox{a false statement}\}=0</math>.
For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be:

<math>
\begin{align}
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}  1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
\end{align}
</math>

'''译文''':
在本节中，我们定义 softmax回归的损失函数。在下面的公式中，<math>1\{\cdot\}</math>是一个标识函数，1{值为真的表达式}=1，1{值为假的表达式}=0。例如，表达式 1{2+2=4}的值为1 ，1{1+1=5}的值为 0。我们的损失函数为：
<math>
\begin{align}
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}  1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
\end{align}
</math>

'''一审''':
现在我们来介绍用于softmax回归算法的代价函数。在下面的公式中，<math>1\{\cdot\}</math>是示性函数，其取值规则为：1{值为真的表达式}=1，1{值为假的表达式}=0。举例来说，表达式1{2+2=4}的值为1 ，1{1+1=5}的值为 0。我们的代价函数为：
<math>
\begin{align}
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}  1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
\end{align}
</math>


'''原文''':

Notice that this generalizes the logistic regression cost function, which could also have been written:

<math>
\begin{align}
J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m   (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\
&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
\end{align}
</math>

'''译文''':

值得注意的是，上述公式是logistic回归损失函数的一个泛化版。 logistic回归损失函数可以改写如下：
 
<math>
\begin{align}
J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m   (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\
&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
\end{align}
</math>

'''一审''':
值得注意的是，上述公式是logistic回归代价函数的一个泛化版。 logistic回归代价函数 可以改写如下：

<math>
\begin{align}
J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m   (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\
&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
\end{align}
</math>

'''原文''':

The softmax cost function is similar, except that we now sum over the <math>k</math> different possible values
of the class label.  Note also that in softmax regression, we have that
<math>
p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} }
</math>.
'''译文''':

可以看到，Softmax损失函数与logistic 损失函数在形式上非常类似，只是在Softmax损失函数将类标的开 k个可能值进行了累加，另外，

'''一审''':

除了我们是对 k 个分类标记的概率值求和之外，Softmax回归的代价函数和上式是十分相似的。我们可以注意到在Softmax回归中概率值为：

'''原文''':

There is no known closed-form way to solve for the minimum of <math>J(\theta)</math>, and thus as usual we'll resort to an iterative
optimization algorithm such as gradient descent or L-BFGS.  Taking derivatives, one can show that the gradient is:

<math>
\begin{align}
\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right]  }
\end{align}
</math>

'''译文''':
对于<math>J(\theta)</math>，现在还没有一个闭合形式的方法来求解，因此，我们使用一个迭代的优化算法（例如梯度下降法，或 L-BFGS）来求解<math>J(\theta)</math>。经过求导，我们得到梯度公式如下：

<math>
\begin{align}
\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right]  }
\end{align}
</math>

'''一审''':
对于<math>J(\theta)</math>，现在还没有一个闭合形式的方法来求解，因此，我们使用一个迭代的优化算法（例如梯度下降法，或 L-BFGS）来求解<math>J(\theta)</math>。经过求导，我们得到梯度公式如下：

<math>
\begin{align}
\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right]  }
\end{align}
</math>


'''原文''':

Recall the meaning of the "<math>\nabla_{\theta_j}</math>" notation.  In particular, <math>\nabla_{\theta_j} J(\theta)</math>
is itself a vector, so that its <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>
the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_j</math>. 


'''译文''':
让我们来回顾一下 "<math>\nabla_{\theta_j}</math>" 的含义， <math>\nabla_{\theta_j} J(\theta)</math>是一个向量，因此，它的第 l个元素<math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>是<math>J(\theta)</math>对<math>\theta_j</math>的第l个元素求偏导后的值。

'''一审''':
让我们来回顾一下 符号 "<math>\nabla_{\theta_j}</math>" 的含义。特别地， <math>\nabla_{\theta_j} J(\theta)</math>本身是一个向量，因此它的第 l个元素<math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>是<math>J(\theta)</math>对<math>\theta_j</math>的第l个分量的偏导数。

'''原文''':

Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it
minimize <math>J(\theta)</math>.  For example, with the standard implementation of gradient descent, on each iteration
we would perform the update <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math> (for each <math>j=1,\ldots,k</math>).

When implementing softmax regression, we will typically use a modified version of the cost function described above;
specifically, one that incorporates weight decay.  We describe the motivation and details below.

'''译文''':
有了上面的偏导公式以后，我们可以将它带入到算法中来最小化 <math>J(\theta)</math>。例如，使用标准的梯度下降法，在每一次迭代过程中，我们更新 <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math>。
在实际的 softmax 实现过程中，我们通常使用一个改进版的损失函数（一个加入了权重 decay 的函数），在下面会详细讲到。

'''一审''':
有了上面的偏导数公式以后，我们就可以将它带入到梯度下降法等算法中，来使<math>J(\theta)</math>最小化。 例如，在梯度下降法标准实现的每一次迭代中，我们需要进行如下更新 ：<math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math>（对于每一个 <math>j=1,\ldots,k</math>）
当实现 softmax 回归算法时， 我们通常会使用 上述代价函数的一个改进版本。具体来说，就是和 权重衰减 一起使用。我们接下来会描述使用它的动机和细节。