LogSumExp

{{Short description|Smooth approximation to the maximum function}}

{{refimprove|date=August 2015}}

The LogSumExp (LSE) (also called RealSoftMax{{cite web |last1=Zhang |first1=Aston |last2=Lipton |first2=Zack |last3=Li |first3=Mu |last4=Smola |first4=Alex |title=Dive into Deep Learning, Chapter 3 Exercises |url=https://www.d2l.ai/chapter_linear-networks/softmax-regression.html#exercises |website=www.d2l.ai |accessdate=27 June 2020 |ref=d2lch3ex}} or multivariable softplus) function is a smooth maximum – a smooth approximation to the maximum function, mainly used by machine learning algorithms.{{cite journal

| first1 = Frank | last1 = Nielsen

| first2 = Ke | last2 = Sun

| year = 2016

| title = Guaranteed bounds on the Kullback-Leibler divergence of univariate mixtures using piecewise log-sum-exp inequalities

| arxiv = 1606.05850

| doi=10.3390/e18120442

| volume=18

| journal=Entropy

| issue = 12

| page=442

| bibcode=2016Entrp..18..442N

| s2cid = 17259055

| doi-access = free

}} It is defined as the logarithm of the sum of the exponentials of the arguments:

\mathrm{LSE}(x_1, \dots, x_n) = \log\left( \exp(x_1) + \cdots + \exp(x_n) \right).

Properties

The LogSumExp function domain is \R^n, the real coordinate space, and its codomain is \R, the real line.

It is an approximation to the maximum \max_i x_i with the following bounds

\max{\{x_1, \dots, x_n\}} \leq \mathrm{LSE}(x_1, \dots, x_n) \leq \max{\{x_1, \dots, x_n\}} + \log(n).

The first inequality is strict unless n = 1. The second inequality is strict unless all arguments are equal.

(Proof: Let m = \max_i x_i. Then \exp(m) \leq \sum_{i=1}^n \exp(x_i) \leq n \exp(m). Applying the logarithm to the inequality gives the result.)

In addition, we can scale the function to make the bounds tighter. Consider the function \frac 1 t \mathrm{LSE}(tx_1, \dots, tx_n). Then

\max{\{x_1, \dots, x_n\}} < \frac 1 t \mathrm{LSE}(tx_1, \dots, tx_n) \leq \max{\{x_1, \dots, x_n\}} + \frac{\log(n)}{t}.

(Proof: Replace each x_i with tx_i for some t>0 in the inequalities above, to give

\max{\{tx_1, \dots, tx_n\}} < \mathrm{LSE}(tx_1, \dots, tx_n)\leq \max{\{tx_1, \dots, tx_n\}} + \log(n).

and, since t>0

t \max{\{x_1, \dots, x_n\}} < \mathrm{LSE}(tx_1, \dots, tx_n)\leq t\max{\{x_1, \dots, x_n\}} + \log(n).

finally, dividing by t gives the result.)

Also, if we multiply by a negative number instead, we of course find a comparison to the \min function:

\min{\{x_1, \dots, x_n\}} - \frac{\log(n)}{t} \leq \frac 1 {-t} \mathrm{LSE}(-tx) < \min{\{x_1, \dots, x_n\}}.

The LogSumExp function is convex, and is strictly increasing everywhere in its domain.{{cite book|url=http://livebooklabs.com/keeppies/c5a5868ce26b8125|title=Optimization Models and Applications|first=Laurent|last=El Ghaoui|year=2017}} It is not strictly convex, since it is affine (linear plus a constant) on the diagonal and parallel lines:{{cite web|url=https://math.stackexchange.com/q/1189638 |title=convex analysis - About the strictly convexity of log-sum-exp function - Mathematics Stack Exchange|work=stackexchange.com}}

:\mathrm{LSE}(x_1 + c, \dots, x_n + c) =\mathrm{LSE}(x_1, \dots, x_n) + c.

Other than this direction, it is strictly convex (the Hessian has rank {{tmath|n - 1}}), so for example restricting to a hyperplane that is transverse to the diagonal results in a strictly convex function. See \mathrm{LSE}_0^+, below.

Writing \mathbf{x} = (x_1, \dots, x_n), the partial derivatives are:

\frac{\partial}{\partial x_i}{\mathrm{LSE}(\mathbf{x})} =

\frac{\exp x_i}{\sum_j \exp {x_j}},

which means the gradient of LogSumExp is the softmax function.

The convex conjugate of LogSumExp is the negative entropy.

log-sum-exp trick for log-domain calculations

The LSE function is often encountered when the usual arithmetic computations are performed on a logarithmic scale, as in log probability.{{Cite book|last=McElreath|first=Richard|url=http://worldcat.org/oclc/1107423386|title=Statistical Rethinking|oclc=1107423386}}

Similar to multiplication operations in linear-scale becoming simple additions in log-scale, an addition operation in

linear-scale becomes the LSE in log-scale:

\mathrm{LSE}(\log(x_1), ..., \log(x_n)) = \log(x_1 + \dots + x_n)

A common purpose of using log-domain computations is to increase accuracy and avoid underflow and overflow problems

when very small or very large numbers are represented directly (i.e. in a linear domain) using limited-precision

floating point numbers.{{Cite web|title=Practical issues: Numeric stability.|url=https://cs231n.github.io/linear-classify/#softmax-classifier|website=CS231n Convolutional Neural Networks for Visual Recognition}}

Unfortunately, the use of LSE directly in this case can again cause overflow/underflow problems. Therefore, the

following equivalent must be used instead (especially when the accuracy of the above 'max' approximation is not sufficient).

\mathrm{LSE}(x_1, \dots, x_n) = x^* + \log\left( \exp(x_1-x^*)+ \cdots + \exp(x_n-x^*) \right)

where x^* = \max{\{x_1, \dots, x_n\}}

Many math libraries such as IT++ provide a default routine of LSE and use this formula internally.

A strictly convex log-sum-exp type function

LSE is convex but not strictly convex.

We can define a strictly convex log-sum-exp type function{{cite arXiv

| first1 = Frank | last1 = Nielsen

| first2 = Gaetan | last2 = Hadjeres

| year = 2018

| title = Monte Carlo Information Geometry: The dually flat case

| class = cs.LG

| eprint = 1803.07225}} by adding an extra argument set to zero:

\mathrm{LSE}_0^+(x_1,...,x_n) = \mathrm{LSE}(0,x_1,...,x_n)

This function is a proper Bregman generator (strictly convex and differentiable).

It is encountered in machine learning, for example, as the cumulant of the multinomial/binomial family.

In tropical analysis, this is the sum in the log semiring.

See also

References

{{Reflist}}

{{refbegin}}

{{refend}}

Category:Logarithms