Hinge loss

From HandWiki
Short description: Loss function in machine learning
The vertical axis represents the value of the Hinge loss (in blue) and zero-one loss (in green) for fixed t = 1, while the horizontal axis represents the value of the prediction y. The plot shows that the Hinge loss penalizes predictions y < 1, corresponding to the notion of a margin in a support vector machine.

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).[1]

For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined as

[math]\displaystyle{ \ell(y) = \max(0, 1-t \cdot y) }[/math]

Note that [math]\displaystyle{ y }[/math] should be the "raw" output of the classifier's decision function, not the predicted class label. For instance, in linear SVMs, [math]\displaystyle{ y = \mathbf{w} \cdot \mathbf{x} + b }[/math], where [math]\displaystyle{ (\mathbf{w},b) }[/math] are the parameters of the hyperplane and [math]\displaystyle{ \mathbf{x} }[/math] is the input variable(s).

When t and y have the same sign (meaning y predicts the right class) and [math]\displaystyle{ |y| \ge 1 }[/math], the hinge loss [math]\displaystyle{ \ell(y) = 0 }[/math]. When they have opposite signs, [math]\displaystyle{ \ell(y) }[/math] increases linearly with y, and similarly if [math]\displaystyle{ |y| \lt 1 }[/math], even if it has the same sign (correct prediction, but not by enough margin).

Extensions

While binary SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion,[2] it is also possible to extend the hinge loss itself for such an end. Several different variations of multiclass hinge loss have been proposed.[3] For example, Crammer and Singer[4] defined it for a linear classifier as[5]

[math]\displaystyle{ \ell(y) = \max(0, 1 + \max_{y \ne t} \mathbf{w}_y \mathbf{x} - \mathbf{w}_t \mathbf{x}) }[/math],

where [math]\displaystyle{ t }[/math] is the target label, [math]\displaystyle{ \mathbf{w}_t }[/math] and [math]\displaystyle{ \mathbf{w}_y }[/math] are the model parameters.

Weston and Watkins provided a similar definition, but with a sum rather than a max:[6][3]

[math]\displaystyle{ \ell(y) = \sum_{y \ne t} \max(0, 1 + \mathbf{w}_y \mathbf{x} - \mathbf{w}_t \mathbf{x}) }[/math].

In structured prediction, the hinge loss can be further extended to structured output spaces. Structured SVMs with margin rescaling use the following variant, where w denotes the SVM's parameters, y the SVM's predictions, φ the joint feature function, and Δ the Hamming loss:

[math]\displaystyle{ \begin{align} \ell(\mathbf{y}) & = \max(0, \Delta(\mathbf{y}, \mathbf{t}) + \langle \mathbf{w}, \phi(\mathbf{x}, \mathbf{y}) \rangle - \langle \mathbf{w}, \phi(\mathbf{x}, \mathbf{t}) \rangle) \\ & = \max(0, \max_{y \in \mathcal{Y}} \left( \Delta(\mathbf{y}, \mathbf{t}) + \langle \mathbf{w}, \phi(\mathbf{x}, \mathbf{y}) \rangle \right) - \langle \mathbf{w}, \phi(\mathbf{x}, \mathbf{t}) \rangle) \end{align} }[/math].

Optimization

The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters w of a linear SVM with score function [math]\displaystyle{ y = \mathbf{w} \cdot \mathbf{x} }[/math] that is given by

[math]\displaystyle{ \frac{\partial\ell}{\partial w_i} = \begin{cases} -t \cdot x_i & \text{if } t \cdot y \lt 1, \\ 0 & \text{otherwise}. \end{cases} }[/math]
Plot of three variants of the hinge loss as a function of z = ty: the "ordinary" variant (blue), its square (green), and the piece-wise smooth version by Rennie and Srebro (red). The y-axis is the l(y) hinge loss, and the x-axis is the parameter t

However, since the derivative of the hinge loss at [math]\displaystyle{ ty = 1 }[/math] is undefined, smoothed versions may be preferred for optimization, such as Rennie and Srebro's[7]

[math]\displaystyle{ \ell(y) = \begin{cases} \frac{1}{2} - ty & \text{if} ~~ ty \le 0, \\ \frac{1}{2} (1 - ty)^2 & \text{if} ~~ 0 \lt ty \lt 1, \\ 0 & \text{if} ~~ 1 \le ty \end{cases} }[/math]

or the quadratically smoothed

[math]\displaystyle{ \ell_\gamma(y) = \begin{cases} \frac{1}{2\gamma} \max(0, 1 - ty)^2 & \text{if} ~~ ty \ge 1 - \gamma, \\ 1 - \frac{\gamma}{2} - ty & \text{otherwise} \end{cases} }[/math]

suggested by Zhang.[8] The modified Huber loss [math]\displaystyle{ L }[/math] is a special case of this loss function with [math]\displaystyle{ \gamma = 2 }[/math], specifically [math]\displaystyle{ L(t,y) = 4 \ell_2(y) }[/math].

See also

References

  1. Rosasco, L.; De Vito, E. D.; Caponnetto, A.; Piana, M.; Verri, A. (2004). "Are Loss Functions All the Same?". Neural Computation 16 (5): 1063–1076. doi:10.1162/089976604773135104. PMID 15070510. http://web.mit.edu/lrosasco/www/publications/loss.pdf. 
  2. Duan, K. B.; Keerthi, S. S. (2005). "Which Is the Best Multiclass SVM Method? An Empirical Study". Multiple Classifier Systems. LNCS. 3541. pp. 278–285. doi:10.1007/11494683_28. ISBN 978-3-540-26306-7. http://www.keerthis.com/multiclass_mcs_kaibo_05.pdf. 
  3. 3.0 3.1 Doğan, Ürün; Glasmachers, Tobias; Igel, Christian (2016). "A Unified View on Multi-class Support Vector Classification". Journal of Machine Learning Research 17: 1–32. http://www.jmlr.org/papers/volume17/11-229/11-229.pdf. 
  4. Crammer, Koby; Singer, Yoram (2001). "On the algorithmic implementation of multiclass kernel-based vector machines". Journal of Machine Learning Research 2: 265–292. http://jmlr.csail.mit.edu/papers/volume2/crammer01a/crammer01a.pdf. 
  5. Moore, Robert C.; DeNero, John (2011). "L1 and L2 regularization for multiclass hinge loss models". http://www.ttic.edu/sigml/symposium2011/papers/Moore+DeNero_Regularization.pdf. 
  6. Weston, Jason; Watkins, Chris (1999). "Support Vector Machines for Multi-Class Pattern Recognition". https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es1999-461.pdf. 
  7. Rennie, Jason D. M.; Srebro, Nathan (2005). "Loss Functions for Preference Levels: Regression with Discrete Ordered Labels". Proc. IJCAI Multidisciplinary Workshop on Advances in Preference Handling. http://ttic.uchicago.edu/~nati/Publications/RennieSrebroIJCAI05.pdf. 
  8. Zhang, Tong (2004). "Solving large scale linear prediction problems using stochastic gradient descent algorithms". ICML. http://tongzhang-ml.org/papers/icml04-stograd.pdf.