Softmax回归

Revision as of 08:54, 10 March 2013 (view source)

Kandeng (Talk | contribs)

(→Introduction介绍)

← Older edit

Revision as of 09:15, 10 March 2013 (view source)

Kandeng (Talk | contribs)

(→2)

Newer edit →

Line 226:

</math>

-

== 2 ==

+

== 代价函数 Cost Function ==

+

'''原文''':

+

We now describe the cost function that we'll use for softmax regression. In the equation below, <math>1\{\cdot\}</math> is

+

the '''indicator function,''' so that <math>1\{\hbox{a true statement}\}=1</math>, and <math>1\{\hbox{a false statement}\}=0</math>.

+

For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be:

+

<math>

+

\begin{align}

+

J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]

+

\end{align}

+

</math>

+

'''译文''':

+

在本节中，我们定义 softmax回归的损失函数。在下面的公式中，<math>1\{\cdot\}</math>是一个标识函数，1{值为真的表达式}=1，1{值为假的表达式}=0。例如，表达式 1{2+2=4}的值为1 ，1{1+1=5}的值为 0。我们的损失函数为：

+

<math>

+

\begin{align}

+

J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]

+

\end{align}

+

</math>

+

'''一审''':

+

现在我们来介绍用于softmax回归算法的代价函数。在下面的公式中，<math>1\{\cdot\}</math>是示性函数，其取值规则为：1{值为真的表达式}=1，1{值为假的表达式}=0。举例来说，表达式1{2+2=4}的值为1 ，1{1+1=5}的值为 0。我们的代价函数为：

+

<math>

+

\begin{align}

+

J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]

+

\end{align}

+

</math>

+

'''原文''':

+

Notice that this generalizes the logistic regression cost function, which could also have been written:

+

<math>

+

\begin{align}

+

J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\

+

&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]

+

\end{align}

+

</math>

+

'''译文''':

+

值得注意的是，上述公式是logistic回归损失函数的一个泛化版。 logistic回归损失函数可以改写如下：

+

<math>

+

\begin{align}

+

J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\

+

&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]

+

\end{align}

+

</math>

+

'''一审''':

+

值得注意的是，上述公式是logistic回归代价函数的一个泛化版。 logistic回归代价函数可以改写如下：

+

<math>

+

\begin{align}

+

J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\

+

&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]

+

\end{align}

+

</math>

+

'''原文''':

+

The softmax cost function is similar, except that we now sum over the <math>k</math> different possible values

+

of the class label. Note also that in softmax regression, we have that

+

<math>

+

p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} }

+

</math>.

+

'''译文''':

+

可以看到，Softmax损失函数与logistic 损失函数在形式上非常类似，只是在Softmax损失函数将类标的开 k个可能值进行了累加，另外，

+

'''一审''':

+

除了我们是对 k 个分类标记的概率值求和之外，Softmax回归的代价函数和上式是十分相似的。我们可以注意到在Softmax回归中概率值为：

+

'''原文''':

+

There is no known closed-form way to solve for the minimum of <math>J(\theta)</math>, and thus as usual we'll resort to an iterative

+

optimization algorithm such as gradient descent or L-BFGS. Taking derivatives, one can show that the gradient is:

+

<math>

+

\begin{align}

+

\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right] }

+

\end{align}

+

</math>

+

'''译文''':

+

对于<math>J(\theta)</math>，现在还没有一个闭合形式的方法来求解，因此，我们使用一个迭代的优化算法（例如梯度下降法，或 L-BFGS）来求解<math>J(\theta)</math>。经过求导，我们得到梯度公式如下：

+

<math>

+

\begin{align}

+

\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right] }

+

\end{align}

+

</math>

+

'''一审''':

+

对于<math>J(\theta)</math>，现在还没有一个闭合形式的方法来求解，因此，我们使用一个迭代的优化算法（例如梯度下降法，或 L-BFGS）来求解<math>J(\theta)</math>。经过求导，我们得到梯度公式如下：

+

<math>

+

\begin{align}

+

\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right] }

+

\end{align}

+

</math>

+

'''原文''':

+

Recall the meaning of the "<math>\nabla_{\theta_j}</math>" notation. In particular, <math>\nabla_{\theta_j} J(\theta)</math>

+

is itself a vector, so that its <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>

+

the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_j</math>.

+

'''译文''':

+

让我们来回顾一下 "<math>\nabla_{\theta_j}</math>" 的含义， <math>\nabla_{\theta_j} J(\theta)</math>是一个向量，因此，它的第 l个元素<math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>是<math>J(\theta)</math>对<math>\theta_j</math>的第l个元素求偏导后的值。

+

'''一审''':

+

让我们来回顾一下符号 "<math>\nabla_{\theta_j}</math>" 的含义。特别地， <math>\nabla_{\theta_j} J(\theta)</math>本身是一个向量，因此它的第 l个元素<math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>是<math>J(\theta)</math>对<math>\theta_j</math>的第l个分量的偏导数。

+

'''原文''':

+

Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it

+

minimize <math>J(\theta)</math>. For example, with the standard implementation of gradient descent, on each iteration

+

we would perform the update <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math> (for each <math>j=1,\ldots,k</math>).

+

When implementing softmax regression, we will typically use a modified version of the cost function described above;

+

specifically, one that incorporates weight decay. We describe the motivation and details below.

+

'''译文''':

+

有了上面的偏导公式以后，我们可以将它带入到算法中来最小化 <math>J(\theta)</math>。例如，使用标准的梯度下降法，在每一次迭代过程中，我们更新 <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math>。

+

在实际的 softmax 实现过程中，我们通常使用一个改进版的损失函数（一个加入了权重 decay 的函数），在下面会详细讲到。

+

'''一审''':

+

有了上面的偏导数公式以后，我们就可以将它带入到梯度下降法等算法中，来使<math>J(\theta)</math>最小化。例如，在梯度下降法标准实现的每一次迭代中，我们需要进行如下更新：<math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math>（对于每一个 <math>j=1,\ldots,k</math>）

+

当实现 softmax 回归算法时，我们通常会使用上述代价函数的一个改进版本。具体来说，就是和权重衰减一起使用。我们接下来会描述使用它的动机和细节。

Softmax回归

From Ufldl

Revision as of 09:15, 10 March 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 226: / Line 226: @@
 </math>
-== 2 ==
+== 代价函数 Cost Function ==
+'''原文''':
+We now describe the cost function that we'll use for softmax regression.  In the equation below, <math>1\{\cdot\}</math> is
+the '''indicator function,''' so that <math>1\{\hbox{a true statement}\}=1</math>, and <math>1\{\hbox{a false statement}\}=0</math>.
+For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be:
+<math>
+\begin{align}
+J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}  1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
+\end{align}
+</math>
+'''译文''':
+在本节中，我们定义 softmax回归的损失函数。在下面的公式中，<math>1\{\cdot\}</math>是一个标识函数，1{值为真的表达式}=1，1{值为假的表达式}=0。例如，表达式 1{2+2=4}的值为1 ，1{1+1=5}的值为 0。我们的损失函数为：
+<math>
+\begin{align}
+J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}  1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
+\end{align}
+</math>
+'''一审''':
+现在我们来介绍用于softmax回归算法的代价函数。在下面的公式中，<math>1\{\cdot\}</math>是示性函数，其取值规则为：1{值为真的表达式}=1，1{值为假的表达式}=0。举例来说，表达式1{2+2=4}的值为1 ，1{1+1=5}的值为 0。我们的代价函数为：
+<math>
+\begin{align}
+J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}  1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
+\end{align}
+</math>
+'''原文''':
+Notice that this generalizes the logistic regression cost function, which could also have been written:
+<math>
+\begin{align}
+J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m   (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\
+&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
+\end{align}
+</math>
+'''译文''':
+值得注意的是，上述公式是logistic回归损失函数的一个泛化版。 logistic回归损失函数可以改写如下：
+<math>
+\begin{align}
+J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m   (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\
+&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
+\end{align}
+</math>
+'''一审''':
+值得注意的是，上述公式是logistic回归代价函数的一个泛化版。 logistic回归代价函数 可以改写如下：
+<math>
+\begin{align}
+J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m   (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\
+&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
+\end{align}
+</math>
+'''原文''':
+The softmax cost function is similar, except that we now sum over the <math>k</math> different possible values
+of the class label.  Note also that in softmax regression, we have that
+<math>
+p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} }
+</math>.
+'''译文''':
+可以看到，Softmax损失函数与logistic 损失函数在形式上非常类似，只是在Softmax损失函数将类标的开 k个可能值进行了累加，另外，
+'''一审''':
+除了我们是对 k 个分类标记的概率值求和之外，Softmax回归的代价函数和上式是十分相似的。我们可以注意到在Softmax回归中概率值为：
+'''原文''':
+There is no known closed-form way to solve for the minimum of <math>J(\theta)</math>, and thus as usual we'll resort to an iterative
+optimization algorithm such as gradient descent or L-BFGS.  Taking derivatives, one can show that the gradient is:
+<math>
+\begin{align}
+\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right]  }
+\end{align}
+</math>
+'''译文''':
+对于<math>J(\theta)</math>，现在还没有一个闭合形式的方法来求解，因此，我们使用一个迭代的优化算法（例如梯度下降法，或 L-BFGS）来求解<math>J(\theta)</math>。经过求导，我们得到梯度公式如下：
+<math>
+\begin{align}
+\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right]  }
+\end{align}
+</math>
+'''一审''':
+对于<math>J(\theta)</math>，现在还没有一个闭合形式的方法来求解，因此，我们使用一个迭代的优化算法（例如梯度下降法，或 L-BFGS）来求解<math>J(\theta)</math>。经过求导，我们得到梯度公式如下：
+<math>
+\begin{align}
+\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right]  }
+\end{align}
+</math>
+'''原文''':
+Recall the meaning of the "<math>\nabla_{\theta_j}</math>" notation.  In particular, <math>\nabla_{\theta_j} J(\theta)</math>
+is itself a vector, so that its <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>
+the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_j</math>.
+'''译文''':
+让我们来回顾一下 "<math>\nabla_{\theta_j}</math>" 的含义， <math>\nabla_{\theta_j} J(\theta)</math>是一个向量，因此，它的第 l个元素<math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>是<math>J(\theta)</math>对<math>\theta_j</math>的第l个元素求偏导后的值。
+'''一审''':
+让我们来回顾一下 符号 "<math>\nabla_{\theta_j}</math>" 的含义。特别地， <math>\nabla_{\theta_j} J(\theta)</math>本身是一个向量，因此它的第 l个元素<math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>是<math>J(\theta)</math>对<math>\theta_j</math>的第l个分量的偏导数。
+'''原文''':
+Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it
+minimize <math>J(\theta)</math>.  For example, with the standard implementation of gradient descent, on each iteration
+we would perform the update <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math> (for each <math>j=1,\ldots,k</math>).
+When implementing softmax regression, we will typically use a modified version of the cost function described above;
+specifically, one that incorporates weight decay.  We describe the motivation and details below.
+'''译文''':
+有了上面的偏导公式以后，我们可以将它带入到算法中来最小化 <math>J(\theta)</math>。例如，使用标准的梯度下降法，在每一次迭代过程中，我们更新 <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math>。
+在实际的 softmax 实现过程中，我们通常使用一个改进版的损失函数（一个加入了权重 decay 的函数），在下面会详细讲到。
+'''一审''':
+有了上面的偏导数公式以后，我们就可以将它带入到梯度下降法等算法中，来使<math>J(\theta)</math>最小化。 例如，在梯度下降法标准实现的每一次迭代中，我们需要进行如下更新 ：<math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math>（对于每一个 <math>j=1,\ldots,k</math>）
+当实现 softmax 回归算法时， 我们通常会使用 上述代价函数的一个改进版本。具体来说，就是和 权重衰减 一起使用。我们接下来会描述使用它的动机和细节。