Softmax回归
From Ufldl
(→Introduction介绍) |
(→2) |
||
Line 226: | Line 226: | ||
</math> | </math> | ||
- | == 2 == | + | == 代价函数 Cost Function == |
+ | |||
+ | '''原文''': | ||
+ | |||
+ | We now describe the cost function that we'll use for softmax regression. In the equation below, <math>1\{\cdot\}</math> is | ||
+ | the '''indicator function,''' so that <math>1\{\hbox{a true statement}\}=1</math>, and <math>1\{\hbox{a false statement}\}=0</math>. | ||
+ | For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be: | ||
+ | |||
+ | <math> | ||
+ | \begin{align} | ||
+ | J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right] | ||
+ | \end{align} | ||
+ | </math> | ||
+ | |||
+ | '''译文''': | ||
+ | 在本节中,我们定义 softmax回归的损失函数。在下面的公式中,<math>1\{\cdot\}</math>是一个标识函数,1{值为真的表达式}=1,1{值为假的表达式}=0。例如,表达式 1{2+2=4}的值为1 ,1{1+1=5}的值为 0。我们的损失函数为: | ||
+ | <math> | ||
+ | \begin{align} | ||
+ | J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right] | ||
+ | \end{align} | ||
+ | </math> | ||
+ | |||
+ | '''一审''': | ||
+ | 现在我们来介绍用于softmax回归算法的代价函数。在下面的公式中,<math>1\{\cdot\}</math>是示性函数,其取值规则为:1{值为真的表达式}=1,1{值为假的表达式}=0。举例来说,表达式1{2+2=4}的值为1 ,1{1+1=5}的值为 0。我们的代价函数为: | ||
+ | <math> | ||
+ | \begin{align} | ||
+ | J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right] | ||
+ | \end{align} | ||
+ | </math> | ||
+ | |||
+ | |||
+ | '''原文''': | ||
+ | |||
+ | Notice that this generalizes the logistic regression cost function, which could also have been written: | ||
+ | |||
+ | <math> | ||
+ | \begin{align} | ||
+ | J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\ | ||
+ | &= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right] | ||
+ | \end{align} | ||
+ | </math> | ||
+ | |||
+ | '''译文''': | ||
+ | |||
+ | 值得注意的是,上述公式是logistic回归损失函数的一个泛化版。 logistic回归损失函数可以改写如下: | ||
+ | |||
+ | <math> | ||
+ | \begin{align} | ||
+ | J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\ | ||
+ | &= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right] | ||
+ | \end{align} | ||
+ | </math> | ||
+ | |||
+ | '''一审''': | ||
+ | 值得注意的是,上述公式是logistic回归代价函数的一个泛化版。 logistic回归代价函数 可以改写如下: | ||
+ | |||
+ | <math> | ||
+ | \begin{align} | ||
+ | J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\ | ||
+ | &= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right] | ||
+ | \end{align} | ||
+ | </math> | ||
+ | |||
+ | '''原文''': | ||
+ | |||
+ | The softmax cost function is similar, except that we now sum over the <math>k</math> different possible values | ||
+ | of the class label. Note also that in softmax regression, we have that | ||
+ | <math> | ||
+ | p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} } | ||
+ | </math>. | ||
+ | '''译文''': | ||
+ | |||
+ | 可以看到,Softmax损失函数与logistic 损失函数在形式上非常类似,只是在Softmax损失函数将类标的开 k个可能值进行了累加,另外, | ||
+ | |||
+ | '''一审''': | ||
+ | |||
+ | 除了我们是对 k 个分类标记的概率值求和之外,Softmax回归的代价函数和上式是十分相似的。我们可以注意到在Softmax回归中概率值为: | ||
+ | |||
+ | '''原文''': | ||
+ | |||
+ | There is no known closed-form way to solve for the minimum of <math>J(\theta)</math>, and thus as usual we'll resort to an iterative | ||
+ | optimization algorithm such as gradient descent or L-BFGS. Taking derivatives, one can show that the gradient is: | ||
+ | |||
+ | <math> | ||
+ | \begin{align} | ||
+ | \nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right] } | ||
+ | \end{align} | ||
+ | </math> | ||
+ | |||
+ | '''译文''': | ||
+ | 对于<math>J(\theta)</math>,现在还没有一个闭合形式的方法来求解,因此,我们使用一个迭代的优化算法(例如梯度下降法,或 L-BFGS)来求解<math>J(\theta)</math>。经过求导,我们得到梯度公式如下: | ||
+ | |||
+ | <math> | ||
+ | \begin{align} | ||
+ | \nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right] } | ||
+ | \end{align} | ||
+ | </math> | ||
+ | |||
+ | '''一审''': | ||
+ | 对于<math>J(\theta)</math>,现在还没有一个闭合形式的方法来求解,因此,我们使用一个迭代的优化算法(例如梯度下降法,或 L-BFGS)来求解<math>J(\theta)</math>。经过求导,我们得到梯度公式如下: | ||
+ | |||
+ | <math> | ||
+ | \begin{align} | ||
+ | \nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right] } | ||
+ | \end{align} | ||
+ | </math> | ||
+ | |||
+ | |||
+ | '''原文''': | ||
+ | |||
+ | Recall the meaning of the "<math>\nabla_{\theta_j}</math>" notation. In particular, <math>\nabla_{\theta_j} J(\theta)</math> | ||
+ | is itself a vector, so that its <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math> | ||
+ | the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_j</math>. | ||
+ | |||
+ | |||
+ | '''译文''': | ||
+ | 让我们来回顾一下 "<math>\nabla_{\theta_j}</math>" 的含义, <math>\nabla_{\theta_j} J(\theta)</math>是一个向量,因此,它的第 l个元素<math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>是<math>J(\theta)</math>对<math>\theta_j</math>的第l个元素求偏导后的值。 | ||
+ | |||
+ | '''一审''': | ||
+ | 让我们来回顾一下 符号 "<math>\nabla_{\theta_j}</math>" 的含义。特别地, <math>\nabla_{\theta_j} J(\theta)</math>本身是一个向量,因此它的第 l个元素<math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>是<math>J(\theta)</math>对<math>\theta_j</math>的第l个分量的偏导数。 | ||
+ | |||
+ | '''原文''': | ||
+ | |||
+ | Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it | ||
+ | minimize <math>J(\theta)</math>. For example, with the standard implementation of gradient descent, on each iteration | ||
+ | we would perform the update <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math> (for each <math>j=1,\ldots,k</math>). | ||
+ | |||
+ | When implementing softmax regression, we will typically use a modified version of the cost function described above; | ||
+ | specifically, one that incorporates weight decay. We describe the motivation and details below. | ||
+ | |||
+ | '''译文''': | ||
+ | 有了上面的偏导公式以后,我们可以将它带入到算法中来最小化 <math>J(\theta)</math>。例如,使用标准的梯度下降法,在每一次迭代过程中,我们更新 <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math>。 | ||
+ | 在实际的 softmax 实现过程中,我们通常使用一个改进版的损失函数(一个加入了权重 decay 的函数),在下面会详细讲到。 | ||
+ | |||
+ | '''一审''': | ||
+ | 有了上面的偏导数公式以后,我们就可以将它带入到梯度下降法等算法中,来使<math>J(\theta)</math>最小化。 例如,在梯度下降法标准实现的每一次迭代中,我们需要进行如下更新 :<math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math>(对于每一个 <math>j=1,\ldots,k</math>) | ||
+ | 当实现 softmax 回归算法时, 我们通常会使用 上述代价函数的一个改进版本。具体来说,就是和 权重衰减 一起使用。我们接下来会描述使用它的动机和细节。 |