LogSumExp
{{Short description|Smooth approximation to the maximum function}}
{{refimprove|date=August 2015}}
The LogSumExp (LSE) (also called RealSoftMax{{cite web |last1=Zhang |first1=Aston |last2=Lipton |first2=Zack |last3=Li |first3=Mu |last4=Smola |first4=Alex |title=Dive into Deep Learning, Chapter 3 Exercises |url=https://www.d2l.ai/chapter_linear-networks/softmax-regression.html#exercises |website=www.d2l.ai |accessdate=27 June 2020 |ref=d2lch3ex}} or multivariable softplus) function is a smooth maximum – a smooth approximation to the maximum function, mainly used by machine learning algorithms.{{cite journal
| first1 = Frank | last1 = Nielsen
| first2 = Ke | last2 = Sun
| year = 2016
| title = Guaranteed bounds on the Kullback-Leibler divergence of univariate mixtures using piecewise log-sum-exp inequalities
| arxiv = 1606.05850
| doi=10.3390/e18120442
| volume=18
| journal=Entropy
| issue = 12
| page=442
| bibcode=2016Entrp..18..442N
| s2cid = 17259055
| doi-access = free
}} It is defined as the logarithm of the sum of the exponentials of the arguments:
Properties
The LogSumExp function domain is , the real coordinate space, and its codomain is , the real line.
It is an approximation to the maximum with the following bounds
The first inequality is strict unless . The second inequality is strict unless all arguments are equal.
(Proof: Let . Then . Applying the logarithm to the inequality gives the result.)
In addition, we can scale the function to make the bounds tighter. Consider the function . Then
(Proof: Replace each with for some in the inequalities above, to give
and, since
finally, dividing by gives the result.)
Also, if we multiply by a negative number instead, we of course find a comparison to the function:
The LogSumExp function is convex, and is strictly increasing everywhere in its domain.{{cite book|url=http://livebooklabs.com/keeppies/c5a5868ce26b8125|title=Optimization Models and Applications|first=Laurent|last=El Ghaoui|year=2017}} It is not strictly convex, since it is affine (linear plus a constant) on the diagonal and parallel lines:{{cite web|url=https://math.stackexchange.com/q/1189638 |title=convex analysis - About the strictly convexity of log-sum-exp function - Mathematics Stack Exchange|work=stackexchange.com}}
:
Other than this direction, it is strictly convex (the Hessian has rank {{tmath|n - 1}}), so for example restricting to a hyperplane that is transverse to the diagonal results in a strictly convex function. See , below.
Writing the partial derivatives are:
\frac{\exp x_i}{\sum_j \exp {x_j}},
which means the gradient of LogSumExp is the softmax function.
The convex conjugate of LogSumExp is the negative entropy.
log-sum-exp trick for log-domain calculations
The LSE function is often encountered when the usual arithmetic computations are performed on a logarithmic scale, as in log probability.{{Cite book|last=McElreath|first=Richard|url=http://worldcat.org/oclc/1107423386|title=Statistical Rethinking|oclc=1107423386}}
Similar to multiplication operations in linear-scale becoming simple additions in log-scale, an addition operation in
linear-scale becomes the LSE in log-scale:
A common purpose of using log-domain computations is to increase accuracy and avoid underflow and overflow problems
when very small or very large numbers are represented directly (i.e. in a linear domain) using limited-precision
floating point numbers.{{Cite web|title=Practical issues: Numeric stability.|url=https://cs231n.github.io/linear-classify/#softmax-classifier|website=CS231n Convolutional Neural Networks for Visual Recognition}}
Unfortunately, the use of LSE directly in this case can again cause overflow/underflow problems. Therefore, the
following equivalent must be used instead (especially when the accuracy of the above 'max' approximation is not sufficient).
where
Many math libraries such as IT++ provide a default routine of LSE and use this formula internally.
A strictly convex log-sum-exp type function
LSE is convex but not strictly convex.
We can define a strictly convex log-sum-exp type function{{cite arXiv
| first1 = Frank | last1 = Nielsen
| first2 = Gaetan | last2 = Hadjeres
| year = 2018
| title = Monte Carlo Information Geometry: The dually flat case
| class = cs.LG
| eprint = 1803.07225}} by adding an extra argument set to zero:
This function is a proper Bregman generator (strictly convex and differentiable).
It is encountered in machine learning, for example, as the cumulant of the multinomial/binomial family.
In tropical analysis, this is the sum in the log semiring.