Huber loss

Short description: Loss function used in robust regression

In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss. A variant for classification is also sometimes used.

Definition

The Huber loss function describes the penalty incurred by an estimation procedure $f$ . Huber (1964) defines the loss function piecewise by^[1] $L_{δ} (a) = {\begin{cases} \frac{1}{2} a^{2} & for | a | \leq δ, \\ δ \cdot (| a | - \frac{1}{2} δ), & otherwise. \end{cases}$

This function is quadratic for small values of $a$ , and linear for large values, with equal values and slopes of the different sections at the two points where $| a | = δ$ . The variable $a$ often refers to the residuals, that is to the difference between the observed and predicted values $a = y - f (x)$ , so the former can be expanded to^[2]

$L_{δ} (y, f (x)) = {\begin{cases} \frac{1}{2} {(y - f (x))}^{2} & for | y - f (x) | \leq δ, \\ δ \cdot (| y - f (x) | - \frac{1}{2} δ), & otherwise. \end{cases}$

The Huber loss is the convolution of the absolute value function with the rectangular function, scaled and translated. Thus it "smoothens out" the former's corner at the origin.

Motivation

Two very commonly used loss functions are the squared loss, $L (a) = a^{2}$ , and the absolute loss, $L (a) = | a |$ . The squared loss function results in an arithmetic mean-unbiased estimator, and the absolute-value loss function results in a median-unbiased estimator (in the one-dimensional case, and a geometric median-unbiased estimator for the multi-dimensional case). The squared loss has the disadvantage that it has the tendency to be dominated by outliers—when summing over a set of $a$ 's (as in $\sum_{i = 1}^{n} L (a_{i})$ ), the sample mean is influenced too much by a few particularly large $a$ -values when the distribution is heavy tailed: in terms of estimation theory, the asymptotic relative efficiency of the mean is poor for heavy-tailed distributions.^[3]

As defined above, the Huber loss function is strongly convex in a uniform neighborhood of its minimum $a = 0$ ; at the boundary of this uniform neighborhood, the Huber loss function has a differentiable extension to an affine function at points $a = - δ$ and $a = δ$ . These properties allow it to combine much of the sensitivity of the mean-unbiased, minimum-variance estimator of the mean (using the quadratic loss function) and the robustness of the median-unbiased estimator (using the absolute value function).

Pseudo-Huber loss function

The Pseudo-Huber loss function can be used as a smooth approximation of the Huber loss function. It combines the best properties of L2 squared loss and L1 absolute loss by being strongly convex when close to the target/minimum and less steep for extreme values. The scale at which the Pseudo-Huber loss function transitions from L2 loss for values close to the minimum to L1 loss for extreme values and the steepness at extreme values can be controlled by the $δ$ value. The Pseudo-Huber loss function ensures that derivatives are continuous for all degrees. It is defined as^[4]^[5]

$L_{δ} (a) = δ^{2} (\sqrt{1 + (a / δ)^{2}} - 1) .$

As such, this function approximates $a^{2} / 2$ for small values of $a$ , and approximates a straight line with slope $δ$ for large values of $a$ .

While the above is the most common form, other smooth approximations of the Huber loss function also exist.^[6]

Variant for classification

For classification purposes, a variant of the Huber loss called modified Huber is sometimes used. Given a prediction $f (x)$ (a real-valued classifier score) and a true binary class label $y \in {+ 1, - 1}$ , the modified Huber loss is defined as^[7]

$L (y, f (x)) = {\begin{cases} \max (0, 1 - y f (x))^{2} & for y f (x) > - 1, \\ - 4 y f (x) & otherwise. \end{cases}$

The term $\max (0, 1 - y f (x))$ is the hinge loss used by support vector machines; the quadratically smoothed hinge loss is a generalization of $L$ .^[7]

Applications

The Huber loss function is used in robust statistics, M-estimation and additive modelling.^[8]

References

↑ Huber, Peter J. (1964). "Robust Estimation of a Location Parameter". Annals of Statistics 53 (1): 73–101. doi:10.1214/aoms/1177703732.
↑ Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2009). The Elements of Statistical Learning. p. 349. http://statweb.stanford.edu/~tibs/ElemStatLearn/. Compared to Hastie et al., the loss is scaled by a factor of ⁠1/2⁠, to be consistent with Huber's original definition given earlier. Though cute and elegant, the Huber loss serves almost no real purpose without scaling by a posteriori variable because the delta cannot be adjusted blindly and be effective; as such, its elegance and simplicity in a time of mathematical open field serves almost no purpose in the machine learning world.
↑ Géron, Aurélien (2023). Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow (3rd ed.). O'Reily (published October 20, 2023). pp. 314,412. ISBN 978-1-098-12597-4.
↑ Charbonnier, P.; Blanc-Féraud, L.; Aubert, G.; Barlaud, M. (1997). "Deterministic edge-preserving regularization in computed imaging". IEEE Trans. Image Process. 6 (2): 298–311. doi:10.1109/83.551699. PMID 18282924. Bibcode: 1997ITIP....6..298C.
↑ Hartley, R.; Zisserman, A. (2003). Multiple View Geometry in Computer Vision (2nd ed.). Cambridge University Press. p. 619. ISBN 978-0-521-54051-3. https://archive.org/details/multipleviewgeom00hart_833.
↑ Lange, K. (1990). "Convergence of Image Reconstruction Algorithms with Gibbs Smoothing". IEEE Trans. Med. Imaging 9 (4): 439–446. doi:10.1109/42.61759. PMID 18222791.
↑ ^7.0 ^7.1 Zhang, Tong (2004). "Solving large scale linear prediction problems using stochastic gradient descent algorithms". ICML. https://dl.acm.org/citation.cfm?id=1015332.
↑ Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine". Annals of Statistics 26 (5): 1189–1232. doi:10.1214/aos/1013203451.

0.00

(0 votes)

Original source: https://en.wikipedia.org/wiki/Huber loss. Read more

[1] Huber, Peter J. (1964). "Robust Estimation of a Location Parameter". Annals of Statistics 53 (1): 73–101. doi:10.1214/aoms/1177703732.

[2] Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2009). The Elements of Statistical Learning. p. 349. http://statweb.stanford.edu/~tibs/ElemStatLearn/. Compared to Hastie et al., the loss is scaled by a factor of ⁠1/2⁠, to be consistent with Huber's original definition given earlier. Though cute and elegant, the Huber loss serves almost no real purpose without scaling by a posteriori variable because the delta cannot be adjusted blindly and be effective; as such, its elegance and simplicity in a time of mathematical open field serves almost no purpose in the machine learning world.

[3] Géron, Aurélien (2023). Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow (3rd ed.). O'Reily (published October 20, 2023). pp. 314,412. ISBN 978-1-098-12597-4.

[4] Charbonnier, P.; Blanc-Féraud, L.; Aubert, G.; Barlaud, M. (1997). "Deterministic edge-preserving regularization in computed imaging". IEEE Trans. Image Process. 6 (2): 298–311. doi:10.1109/83.551699. PMID 18282924. Bibcode: 1997ITIP....6..298C.

[5] Hartley, R.; Zisserman, A. (2003). Multiple View Geometry in Computer Vision (2nd ed.). Cambridge University Press. p. 619. ISBN 978-0-521-54051-3. https://archive.org/details/multipleviewgeom00hart_833.

[6] Lange, K. (1990). "Convergence of Image Reconstruction Algorithms with Gibbs Smoothing". IEEE Trans. Med. Imaging 9 (4): 439–446. doi:10.1109/42.61759. PMID 18222791.

[zhang-7] 7.0 ^7.1 Zhang, Tong (2004). "Solving large scale linear prediction problems using stochastic gradient descent algorithms". ICML. https://dl.acm.org/citation.cfm?id=1015332.

[8] Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine". Annals of Statistics 26 (5): 1189–1232. doi:10.1214/aos/1013203451.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

Anonymous

Search

Huber loss

Namespaces

More

Page actions

Contents

Definition

Motivation

Pseudo-Huber loss function

Variant for classification

Applications

See also

References

Navigation

Navigation

Resources

Help

googletranslator

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Huber loss

Definition

Motivation

Pseudo-Huber loss function

Variant for classification

Applications

See also

References

Navigation

Wiki tools

Page tools

Other projects

Categories