LogSumExp

{{Short description|Smooth approximation to the maximum function}}

The LogSumExp (LSE) (also called RealSoftMax{{cite web |last1=Zhang |first1=Aston |last2=Lipton |first2=Zack |last3=Li |first3=Mu |last4=Smola |first4=Alex |title=Dive into Deep Learning, Chapter 3 Exercises |url=https://www.d2l.ai/chapter_linear-networks/softmax-regression.html#exercises |website=www.d2l.ai |accessdate=27 June 2020 |ref=d2lch3ex}} or multivariable softplus) function is a smooth maximum – a smooth approximation to the maximum function, mainly used by machine learning algorithms.{{cite journal

| first1 = Frank | last1 = Nielsen

| first2 = Ke | last2 = Sun

| year = 2016

| title = Guaranteed bounds on the Kullback-Leibler divergence of univariate mixtures using piecewise log-sum-exp inequalities

| arxiv = 1606.05850

| doi=10.3390/e18120442

| volume=18

| journal=Entropy

| issue = 12

| page=442

| bibcode=2016Entrp..18..442N

| s2cid = 17259055

| doi-access = free

}} It is defined as the logarithm of the sum of the exponentials of the arguments:

$\mathrm{LSE}(x_1, \dots, x_n) = \log\left( \exp(x_1) + \cdots + \exp(x_n) \right).$

Properties

The LogSumExp function domain is $\R^n$ , the real coordinate space, and its codomain is $\R$ , the real line.

It is an approximation to the maximum $\max_i x_i$ with the following bounds

$\max{\{x_1, \dots, x_n\}} \leq \mathrm{LSE}(x_1, \dots, x_n) \leq \max{\{x_1, \dots, x_n\}} + \log(n).$

The first inequality is strict unless $n = 1$ . The second inequality is strict unless all arguments are equal.

(Proof: Let $m = \max_i x_i$ . Then $\exp(m) \leq \sum_{i=1}^n \exp(x_i) \leq n \exp(m)$ . Applying the logarithm to the inequality gives the result.)

In addition, we can scale the function to make the bounds tighter. Consider the function $\frac 1 t \mathrm{LSE}(tx_1, \dots, tx_n)$ . Then

$\max{\{x_1, \dots, x_n\}} < \frac 1 t \mathrm{LSE}(tx_1, \dots, tx_n) \leq \max{\{x_1, \dots, x_n\}} + \frac{\log(n)}{t}.$

(Proof: Replace each $x_i$ with $tx_i$ for some $t>0$ in the inequalities above, to give

$\max{\{tx_1, \dots, tx_n\}} < \mathrm{LSE}(tx_1, \dots, tx_n)\leq \max{\{tx_1, \dots, tx_n\}} + \log(n).$

and, since $t>0$

$t \max{\{x_1, \dots, x_n\}} < \mathrm{LSE}(tx_1, \dots, tx_n)\leq t\max{\{x_1, \dots, x_n\}} + \log(n).$

finally, dividing by $t$ gives the result.)

Also, if we multiply by a negative number instead, we of course find a comparison to the $\min$ function:

$\min{\{x_1, \dots, x_n\}} - \frac{\log(n)}{t} \leq \frac 1 {-t} \mathrm{LSE}(-tx) < \min{\{x_1, \dots, x_n\}}.$

The LogSumExp function is convex, and is strictly increasing everywhere in its domain.{{cite book|url=http://livebooklabs.com/keeppies/c5a5868ce26b8125|title=Optimization Models and Applications|first=Laurent|last=El Ghaoui|year=2017}} It is not strictly convex, since it is affine (linear plus a constant) on the diagonal and parallel lines:{{cite web|url=https://math.stackexchange.com/q/1189638 |title=convex analysis - About the strictly convexity of log-sum-exp function - Mathematics Stack Exchange|work=stackexchange.com}}

: $\mathrm{LSE}(x_1 + c, \dots, x_n + c) =\mathrm{LSE}(x_1, \dots, x_n) + c.$

Other than this direction, it is strictly convex (the Hessian has rank {{tmath|n - 1}}), so for example restricting to a hyperplane that is transverse to the diagonal results in a strictly convex function. See $\mathrm{LSE}_0^+$ , below.

Writing $\mathbf{x} = (x_1, \dots, x_n),$ the partial derivatives are:

$\frac{\partial}{\partial x_i}{\mathrm{LSE}(\mathbf{x})} =
\frac{\exp x_i}{\sum_j \exp {x_j}},$

which means the gradient of LogSumExp is the softmax function.

The convex conjugate of LogSumExp is the negative entropy.

log-sum-exp trick for log-domain calculations

The LSE function is often encountered when the usual arithmetic computations are performed on a logarithmic scale, as in log probability.{{Cite book|last=McElreath|first=Richard|url=http://worldcat.org/oclc/1107423386|title=Statistical Rethinking|oclc=1107423386}}

Similar to multiplication operations in linear-scale becoming simple additions in log-scale, an addition operation in

linear-scale becomes the LSE in log-scale:

$\mathrm{LSE}(\log(x_1), ..., \log(x_n)) = \log(x_1 + \dots + x_n)$

A common purpose of using log-domain computations is to increase accuracy and avoid underflow and overflow problems

when very small or very large numbers are represented directly (i.e. in a linear domain) using limited-precision

floating point numbers.{{Cite web|title=Practical issues: Numeric stability.|url=https://cs231n.github.io/linear-classify/#softmax-classifier|website=CS231n Convolutional Neural Networks for Visual Recognition}}

Unfortunately, the use of LSE directly in this case can again cause overflow/underflow problems. Therefore, the

following equivalent must be used instead (especially when the accuracy of the above 'max' approximation is not sufficient).

$\mathrm{LSE}(x_1, \dots, x_n) = x^* + \log\left( \exp(x_1-x^*)+ \cdots + \exp(x_n-x^*) \right)$

where $x^* = \max{\{x_1, \dots, x_n\}}$

Many math libraries such as IT++ provide a default routine of LSE and use this formula internally.

A strictly convex log-sum-exp type function

LSE is convex but not strictly convex.

We can define a strictly convex log-sum-exp type function{{cite arXiv

| first1 = Frank | last1 = Nielsen

| first2 = Gaetan | last2 = Hadjeres

| year = 2018

| title = Monte Carlo Information Geometry: The dually flat case

| class = cs.LG

| eprint = 1803.07225}} by adding an extra argument set to zero:

$\mathrm{LSE}_0^+(x_1,...,x_n) = \mathrm{LSE}(0,x_1,...,x_n)$

This function is a proper Bregman generator (strictly convex and differentiable).

It is encountered in machine learning, for example, as the cumulant of the multinomial/binomial family.

In tropical analysis, this is the sum in the log semiring.

References

Category:Logarithms

LogSumExp

Properties

log-sum-exp trick for log-domain calculations

A strictly convex log-sum-exp type function

See also

References