Smooth maximum

{{Short description|Mathematical approximation}}

In mathematics, a smooth maximum of an indexed family x1, ..., xn of numbers is a smooth approximation to the maximum function \max(x_1,\ldots,x_n), meaning a parametric family of functions m_\alpha(x_1,\ldots,x_n) such that for every {{mvar|α}}, the function {{tmath|m_\alpha}} is smooth, and the family converges to the maximum function {{tmath|m_\alpha \to \max}} as {{tmath|\alpha\to\infty}}. The concept of smooth minimum is similarly defined. In many cases, a single family approximates both: maximum as the parameter goes to positive infinity, minimum as the parameter goes to negative infinity; in symbols, {{tmath|m_\alpha \to \max}} as {{tmath|\alpha \to \infty}} and {{tmath|m_\alpha \to \min}} as {{tmath|\alpha \to -\infty}}. The term can also be used loosely for a specific smooth function that behaves similarly to a maximum, without necessarily being part of a parametrized family.

Examples

= Boltzmann operator =

File:Smoothmax.png

For large positive values of the parameter \alpha > 0, the following formulation is a smooth, differentiable approximation of the maximum function. For negative values of the parameter that are large in absolute value, it approximates the minimum.

:

\mathcal{S}_\alpha (x_1,\ldots,x_n) = \frac{\sum_{i=1}^n x_i e^{\alpha x_i}}{\sum_{i=1}^n e^{\alpha x_i}}

\mathcal{S}_\alpha has the following properties:

  1. \mathcal{S}_\alpha\to \max as \alpha\to\infty
  2. \mathcal{S}_0 is the arithmetic mean of its inputs
  3. \mathcal{S}_\alpha\to \min as \alpha\to -\infty

The gradient of \mathcal{S}_{\alpha} is closely related to softmax and is given by

:

\nabla_{x_i}\mathcal{S}_\alpha (x_1,\ldots,x_n) = \frac{e^{\alpha x_i}}{\sum_{j=1}^n e^{\alpha x_j}} [1 + \alpha(x_i - \mathcal{S}_\alpha (x_1,\ldots,x_n))].

This makes the softmax function useful for optimization techniques that use gradient descent.

This operator is sometimes called the Boltzmann operator,{{cite journal |last1=Asadi |first1=Kavosh |last2=Littman |first2=Michael L. |author-link2=Michael L. Littman |date=2017 |title=An Alternative Softmax Operator for Reinforcement Learning |url=https://proceedings.mlr.press/v70/asadi17a.html |journal=PMLR |volume=70 |pages=243–252 |arxiv=1612.05628 |access-date=January 6, 2023}} after the Boltzmann distribution.

= LogSumExp =

{{main|LogSumExp}}

Another smooth maximum is LogSumExp:

:\mathrm{LSE}_\alpha(x_1, \ldots, x_n) = \frac{1}{\alpha} \log \sum_{i=1}^n \exp \alpha x_i

This can also be normalized if the x_i are all non-negative, yielding a function with domain [0,\infty)^n and range [0, \infty):

:g(x_1, \ldots, x_n) = \log \left( \sum_{i=1}^n \exp x_i - (n-1) \right)

The (n - 1) term corrects for the fact that \exp(0) = 1 by canceling out all but one zero exponential, and \log 1 = 0 if all x_i are zero.

= Mellowmax =

The mellowmax operator is defined as follows:

:\mathrm{mm}_\alpha(x) = \frac{1}{\alpha} \log \frac{1}{n} \sum_{i=1}^n \exp \alpha x_i

It is a non-expansive operator. As \alpha \to \infty, it acts like a maximum. As \alpha \to 0, it acts like an arithmetic mean. As \alpha \to -\infty, it acts like a minimum. This operator can be viewed as a particular instantiation of the quasi-arithmetic mean. It can also be derived from information theoretical principles as a way of regularizing policies with a cost function defined by KL divergence. The operator has previously been utilized in other areas, such as power engineering.{{cite journal |last1=Safak |first1=Aysel |date=February 1993 |title=Statistical analysis of the power sum of multiple correlated log-normal components |url=https://ieeexplore.ieee.org/document/192387 |journal=IEEE Transactions on Vehicular Technology |volume=42 |issue=1 |pages={58–61 |doi=10.1109/25.192387 |access-date=January 6, 2023}}

= p-Norm =

{{main|P-norm}}

Another smooth maximum is the p-norm:

:

\| (x_1, \ldots, x_n) \|_p = \left( \sum_{i=1}^n |x_i|^p \right)^\frac{1}{p}

which converges to \| (x_1, \ldots, x_n) \|_\infty = \max_{1\leq i\leq n} |x_i| as p \to \infty.

An advantage of the p-norm is that it is a norm. As such it is scale invariant (homogeneous): \| (\lambda x_1, \ldots, \lambda x_n) \|_p = |\lambda| \cdot \| (x_1, \ldots, x_n) \|_p , and it satisfies the triangle inequality.

= Smooth maximum unit =

The following binary operator is called the Smooth Maximum Unit (SMU):{{Cite arXiv|eprint = 2111.04682|last1 = Biswas|first1 = Koushik|last2 = Kumar|first2 = Sandeep|last3 = Banerjee|first3 = Shilpak|author4 = Ashish Kumar Pandey|title = SMU: Smooth activation function for deep networks using smoothing maximum technique|year = 2021| class=cs.LG }}

:

\begin{align}

\textstyle\max_\varepsilon(a, b)

&= \frac{a + b + |a - b|_\varepsilon}{2} \\

&= \frac{a + b + \sqrt{(a - b)^2 + \varepsilon}}{2}

\end{align}

where \varepsilon \geq 0 is a parameter. As \varepsilon \to 0, |\cdot|_\varepsilon \to |\cdot| and thus \textstyle\max_\varepsilon \to \max.

See also

References

{{Reflist}}

Category:Mathematical notation

Category:Basic concepts in set theory

https://www.johndcook.com/soft_maximum.pdf

M. Lange, D. Zühlke, O. Holz, and T. Villmann, "Applications of lp-norms and their smooth approximations for gradient based learning vector quantization," in Proc. ESANN, Apr. 2014, pp. 271-276.

(https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2014-153.pdf)