LogSumExp

From HandWiki
Short description: Smooth approximation to the maximum function

The LogSumExp (LSE) (also called RealSoftMax[1] or multivariable softplus) function is a smooth maximum – a smooth approximation to the maximum function, mainly used by machine learning algorithms.[2] It is defined as the logarithm of the sum of the exponentials of the arguments:

[math]\displaystyle{ \mathrm{LSE}(x_1, \dots, x_n) = \log\left( \exp(x_1) + \cdots + \exp(x_n) \right). }[/math]

Properties

The LogSumExp function domain is [math]\displaystyle{ \R^n }[/math], the real coordinate space, and its codomain is [math]\displaystyle{ \R }[/math], the real line. It is an approximation to the maximum [math]\displaystyle{ \max_i x_i }[/math] with the following bounds [math]\displaystyle{ \max{\{x_1, \dots, x_n\}} \leq \mathrm{LSE}(x_1, \dots, x_n) \leq \max{\{x_1, \dots, x_n\}} + \log(n). }[/math] The first inequality is strict unless [math]\displaystyle{ n = 1 }[/math]. The second inequality is strict unless all arguments are equal. (Proof: Let [math]\displaystyle{ m = \max_i x_i }[/math]. Then [math]\displaystyle{ \exp(m) \leq \sum_{i=1}^n \exp(x_i) \leq n \exp(m) }[/math]. Applying the logarithm to the inequality gives the result.)

In addition, we can scale the function to make the bounds tighter. Consider the function [math]\displaystyle{ \frac 1 t \mathrm{LSE}(tx_1, \dots, tx_n) }[/math]. Then [math]\displaystyle{ \max{\{x_1, \dots, x_n\}} \lt \frac 1 t \mathrm{LSE}(tx_1, \dots, tx_n) \leq \max{\{x_1, \dots, x_n\}} + \frac{\log(n)}{t}. }[/math] (Proof: Replace each [math]\displaystyle{ x_i }[/math] with [math]\displaystyle{ tx_i }[/math] for some [math]\displaystyle{ t\gt 0 }[/math] in the inequalities above, to give [math]\displaystyle{ \max{\{tx_1, \dots, tx_n\}} \lt \mathrm{LSE}(tx_1, \dots, tx_n)\leq \max{\{tx_1, \dots, tx_n\}} + \log(n). }[/math] and, since [math]\displaystyle{ t\gt 0 }[/math] [math]\displaystyle{ t \max{\{x_1, \dots, x_n\}} \lt \mathrm{LSE}(tx_1, \dots, tx_n)\leq t\max{\{x_1, \dots, x_n\}} + \log(n). }[/math] finally, dividing by [math]\displaystyle{ t }[/math] gives the result.)

Also, if we multiply by a negative number instead, we of course find a comparison to the [math]\displaystyle{ \min }[/math] function: [math]\displaystyle{ \min{\{x_1, \dots, x_n\}} - \frac{\log(n)}{t} \leq \frac 1 {-t} \mathrm{LSE}(-tx) \lt \min{\{x_1, \dots, x_n\}}. }[/math] The LogSumExp function is convex, and is strictly increasing everywhere in its domain[3] (but not strictly convex everywhere[4]).

Writing [math]\displaystyle{ \mathbf{x} = (x_1, \dots, x_n), }[/math] the partial derivatives are: [math]\displaystyle{ \frac{\partial}{\partial x_i}{\mathrm{LSE}(\mathbf{x})} = \frac{\exp x_i}{\sum_j \exp {x_j}}, }[/math] which means the gradient of LogSumExp is the softmax function.

The convex conjugate of LogSumExp is the negative entropy.

log-sum-exp trick for log-domain calculations

The LSE function is often encountered when the usual arithmetic computations are performed on a logarithmic scale, as in log probability.[5]

Similar to multiplication operations in linear-scale becoming simple additions in log-scale, an addition operation in linear-scale becomes the LSE in log-scale:

[math]\displaystyle{ \mathrm{LSE}(\log(x_1), ..., \log(x_n)) = \log(x_1 + \dots + x_n) }[/math] A common purpose of using log-domain computations is to increase accuracy and avoid underflow and overflow problems when very small or very large numbers are represented directly (i.e. in a linear domain) using limited-precision floating point numbers.[6]

Unfortunately, the use of LSE directly in this case can again cause overflow/underflow problems. Therefore, the following equivalent must be used instead (especially when the accuracy of the above 'max' approximation is not sufficient). Therefore, many math libraries such as IT++ provide a default routine of LSE and use this formula internally.

[math]\displaystyle{ \mathrm{LSE}(x_1, \dots, x_n) = x^* + \log\left( \exp(x_1-x^*)+ \cdots + \exp(x_n-x^*) \right) }[/math] where [math]\displaystyle{ x^* = \max{\{x_1, \dots, x_n\}} }[/math]

A strictly convex log-sum-exp type function

LSE is convex but not strictly convex. We can define a strictly convex log-sum-exp type function[7] by adding an extra argument set to zero:

[math]\displaystyle{ \mathrm{LSE}_0^+(x_1,...,x_n) = \mathrm{LSE}(0,x_1,...,x_n) }[/math] This function is a proper Bregman generator (strictly convex and differentiable). It is encountered in machine learning, for example, as the cumulant of the multinomial/binomial family.

In tropical analysis, this is the sum in the log semiring.

See also

References

  1. Zhang, Aston; Lipton, Zack; Li, Mu; Smola, Alex. "Dive into Deep Learning, Chapter 3 Exercises". https://www.d2l.ai/chapter_linear-networks/softmax-regression.html#exercises. Retrieved 27 June 2020. 
  2. Nielsen, Frank; Sun, Ke (2016). "Guaranteed bounds on the Kullback-Leibler divergence of univariate mixtures using piecewise log-sum-exp inequalities". Entropy 18 (12): 442. doi:10.3390/e18120442. Bibcode2016Entrp..18..442N. 
  3. El Ghaoui, Laurent (2017). Optimization Models and Applications. http://livebooklabs.com/keeppies/c5a5868ce26b8125. 
  4. "convex analysis - About the strictly convexity of log-sum-exp function - Mathematics Stack Exchange". stackexchange.com. https://math.stackexchange.com/q/1189638. 
  5. McElreath, Richard. Statistical Rethinking. OCLC 1107423386. http://worldcat.org/oclc/1107423386. 
  6. "Practical issues: Numeric stability.". https://cs231n.github.io/linear-classify/#softmax-classifier. 
  7. Nielsen, Frank; Hadjeres, Gaetan (2018). Monte Carlo Information Geometry: The dually flat case. Bibcode2018arXiv180307225N.