Softmax回归

Softmax回归(Softmax Regression)

'''初译''':@knighterzjy

'''一审''':@GuitarFang

== Introduction介绍 ==

'''原文''':

In these notes, we describe the '''Softmax regression''' model.  This model generalizes logistic regression to
classification problems where the class label <math>y</math> can take on more than two possible values.
This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 different
numerical digits.  Softmax regression is a supervised learning algorithm, but we will later be
using it in conjuction with our deep learning/unsupervised feature learning methods.


'''译文''':

在本节中，我们介绍Softmax回归模型，该模型是logistic回归模型在多分类问题上的泛化，在多分类问题中，类标签y可以取两个以上的值。 Softmax回归模型可以直接应用于 MNIST 手写数字分类问题等多分类问题。Softmax回归是有监督的，不过我们接下来也会介绍它与深度学习/无监督学习方法的结合。
（译者注： MNIST 是一个手写数字识别库，由 NYU 的Yann LeCun 等人维护。 http://yann.lecun.com/exdb/mnist/ ）

'''一审''':

在本章中，我们介绍Softmax回归模型。该模型将logistic回归模型一般化，以用来解决类型标签y的可能取值多于两种的分类问题。Softmax回归模型对于诸如MNIST手写数字分类等问题是十分有用的，该问题的目的是辨识10个不同的单个数字。Softmax回归是一种有监督学习算法，但是我们接下来要将它与我们的深度学习/无监督特征学习方法结合起来使用。
（译者注：MNIST是一个手写数字识别库，由NYU的Yann LeCun等人维护。http://yann.lecun.com/exdb/mnist/）

'''原文''':
Recall that in logistic regression, we had a training set
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>
of <math>m</math> labeled examples, where the input features are <math>x^{(i)} \in \Re^{n+1}</math>.  
(In this set of notes, we will use the notational convention of letting the feature vectors <math>x</math> be
<math>n+1</math> dimensional, with <math>x_0 = 1</math> corresponding to the intercept term.) 
With logistic regression, we were in the binary classification setting, so the labels 
were <math>y^{(i)} \in \{0,1\}</math>.  Our hypothesis took the form:

<math>\begin{align}
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},
\end{align}</math>


'''译文''':
回顾一下 logistic 回归，我们的训练集为<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>
，其中 m为样本数，<math>x^{(i)} \in \Re^{n+1}</math>为特征。
由于 logistic 回归是针对二分类问题的，因此类标 <math>y^{(i)} \in \{0,1\}</math>。假设函数如下：
<math>\begin{align}
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},
\end{align}</math>

'''一审''':
回想一下在 logistic 回归中，我们拥有一个包含 m 个被标记的样本的训练集 <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>，其中输入特征值 <math>x^{(i)} \in \Re^{n+1}</math>。（在本章中，我们对出现的符号进行如下约定：特征向量 x 的维度为n+1 ，其中 x0=1对应 截距项 。）因为在Logistic 回归中，我们要解决的是 二元分类 问题，因此 类型标记 <math>y^{(i)} \in \{0,1\}</math>。 估值函数 如下：
<math>\begin{align}
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},
\end{align}</math>


'''原文''':

and the model parameters <math>\theta</math> were trained to minimize
the cost function
<math>
\begin{align}
J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right]
\end{align}
</math>


'''译文''':
模型参数 θ 用于最小化损失函数
<math>
\begin{align}
J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right]
\end{align}
</math>


'''一审''':
我们将训练模型参数 θ ，使其能够最小化 代价函数 ：

<math>
\begin{align}
J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right]
\end{align}
</math>

'''原文''':
In the softmax regression setting, we are interested in multi-class
classification (as opposed to only binary classification), and so the label
<math>y</math> can take on <math>k</math> different values, rather than only
two.  Thus, in our training set
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>,
we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>.  (Note that
our convention will be to index the classes starting from 1, rather than from 0.)  For example,
in the MNIST digit recognition task, we would have <math>k=10</math> different classes.

'''译文''':
在 softmax回归中，我们解决的是多分类问题（相对于 logistic 回归解决的二分类问题），类标 y 可以取 k个不同的值（而不是 2 个）。因此，对于训练集 <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>，我们有 <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>。（注意此处的类别下标从 1 开始，而不是 0）。例如，在 MNIST 数字识别任务中，我们有 k=10 个不同的类别。

'''一审''':
在 softmax回归中，我们 感兴趣的是 多元分类 （相对于 只能辨识两种类型的 二元分类 ）， 所以类型标记 y 可以取 k个不同的值（而不 只限于 2个）。 于是 ，对于我们的 训练集<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math> 便有<math>y^{(i)} \in \{1, 2, \ldots, k\}</math>。（注意， 我们约定 类别下标从 1 开始，而不是 0）。例如，在 MNIST 数字识别任务中，我们有 k=10 个不同的类别。


'''原文''':
Given a test input <math>x</math>, we want our hypothesis to estimate
the probability that <math>p(y=j | x)</math> for each value of <math>j = 1, \ldots, k</math>.
I.e., we want to estimate the probability of the class label taking
on each of the <math>k</math> different possible values.  Thus, our hypothesis
will output a <math>k</math> dimensional vector (whose elements sum to 1) giving
us our <math>k</math> estimated probabilities.  Concretely, our hypothesis
<math>h_{\theta}(x)</math> takes the form:

<math>
\begin{align}
h_\theta(x^{(i)}) =
\begin{bmatrix}
p(y^{(i)} = 1 | x^{(i)}; \theta) \\
p(y^{(i)} = 2 | x^{(i)}; \theta) \\
\vdots \\
p(y^{(i)} = k | x^{(i)}; \theta)
\end{bmatrix}
=
\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix}
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_k^T x^{(i)} } \\
\end{bmatrix}
\end{align}
</math>

'''译文''':
给定一个测试样本 x ，我们想让假设函数去估计该样本在每一个类别  上的概率 p(y = j | x)，例如，我们想要估计类标在 k 个不同类别上的概率。因此，我们的假设函数会输出一个 k 维的向量（向量元素的和为 1 ）来表示样本 x在 k 个类别上的概率值。具体地说，我们的假设函数 hθ(x) 形式如下：
<math>
\begin{align}
h_\theta(x^{(i)}) =
\begin{bmatrix}
p(y^{(i)} = 1 | x^{(i)}; \theta) \\
p(y^{(i)} = 2 | x^{(i)}; \theta) \\
\vdots \\
p(y^{(i)} = k | x^{(i)}; \theta)
\end{bmatrix}
=
\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix}
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_k^T x^{(i)} } \\
\end{bmatrix}
\end{align}
</math>

'''一审''':
对于给定的测试输入，我们想让 估值函数 针对每一个  估算出概率值 p(y = j | x) 。也就是说， 我们想估计出分类结果在每一个分类标记值上出现的概率 ( 一审注：而不是估算出具体是取哪一个值，这一点和基本神经网络估值函数输出最终值是有区别的 ) 。因此，我们的 估值函数 将要 输出一个 k维的向量（向量元素的和为 1 ）来表示这 k 被估计出的概率值。 具体地说，我们的 估值函数hθ(x) 形式如下：
<math>
\begin{align}
h_\theta(x^{(i)}) =
\begin{bmatrix}
p(y^{(i)} = 1 | x^{(i)}; \theta) \\
p(y^{(i)} = 2 | x^{(i)}; \theta) \\
\vdots \\
p(y^{(i)} = k | x^{(i)}; \theta)
\end{bmatrix}
=
\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix}
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_k^T x^{(i)} } \\
\end{bmatrix}
\end{align}
</math>

'''原文''':
Here <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math> are the
parameters of our model.  
Notice that
the term <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>
normalizes the distribution, so that it sums to one. 

'''译文''':
其中 <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math>  均为模型参数， the term <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math> 是模型的归一化因子，使得向量的和为 1 。

'''一审''':
其中  <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math>是我们模型的参数。请注意<math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>，这一项对概率分布进行归一化，使得所有概率之和为 1 。


'''原文''':
For convenience, we will also write 
<math>\theta</math> to denote all the
parameters of our model.  When you implement softmax regression, it is usually
convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by
stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that

<math>
\theta = \begin{bmatrix}
\mbox{---} \theta_1^T \mbox{---} \\
\mbox{---} \theta_2^T \mbox{---} \\
\vdots \\
\mbox{---} \theta_k^T \mbox{---} \\
\end{bmatrix}
</math>

'''译文''':
为了简便，我们使用 θ 来表示模型参数。在实现 softmax 回归的时候，往往使用一个 k-by-(n + 1) 的矩阵来表示 θ。我们将  按行表示，得到
<math>
\theta = \begin{bmatrix}
\mbox{---} \theta_1^T \mbox{---} \\
\mbox{---} \theta_2^T \mbox{---} \\
\vdots \\
\mbox{---} \theta_k^T \mbox{---} \\
\end{bmatrix}
</math>

'''一审''':
为了方便起见，我们同样使用符号 θ 来表示全部的模型参数。在实现 softmax 回归时，你通常会发现，将 θ 用一个k × (n+1)的矩阵来表示会十分便利，该矩阵是将  按行罗列起来得到的，如下所示：
<math>
\theta = \begin{bmatrix}
\mbox{---} \theta_1^T \mbox{---} \\
\mbox{---} \theta_2^T \mbox{---} \\
\vdots \\
\mbox{---} \theta_k^T \mbox{---} \\
\end{bmatrix}
</math>

== 2 ==