Smooth maximum

In mathematics, a smooth maximum of an indexed family x₁, ..., x_n of numbers is a smooth approximation to the maximum function $\max(x_1,\ldots,x_n),$ meaning a parametric family of functions $m_\alpha(x_1,\ldots,x_n)$ such that for every {{mvar|α}}, the function {{tmath|m_\alpha}} is smooth, and the family converges to the maximum function {{tmath|m_\alpha \to \max}} as {{tmath|\alpha\to\infty}}. The concept of smooth minimum is similarly defined. In many cases, a single family approximates both: maximum as the parameter goes to positive infinity, minimum as the parameter goes to negative infinity; in symbols, {{tmath|m_\alpha \to \max}} as {{tmath|\alpha \to \infty}} and {{tmath|m_\alpha \to \min}} as {{tmath|\alpha \to -\infty}}. The term can also be used loosely for a specific smooth function that behaves similarly to a maximum, without necessarily being part of a parametrized family.

Examples

= Boltzmann operator =

File:Smoothmax.png

For large positive values of the parameter $\alpha > 0$ , the following formulation is a smooth, differentiable approximation of the maximum function. For negative values of the parameter that are large in absolute value, it approximates the minimum.

: $\mathcal{S}_\alpha (x_1,\ldots,x_n) = \frac{\sum_{i=1}^n x_i e^{\alpha x_i}}{\sum_{i=1}^n e^{\alpha x_i}}$

$\mathcal{S}_\alpha$ has the following properties:

$\mathcal{S}_\alpha\to \max$ as $\alpha\to\infty$
$\mathcal{S}_0$ is the arithmetic mean of its inputs
$\mathcal{S}_\alpha\to \min$ as $\alpha\to -\infty$

The gradient of $\mathcal{S}_{\alpha}$ is closely related to softmax and is given by

: $\nabla_{x_i}\mathcal{S}_\alpha (x_1,\ldots,x_n) = \frac{e^{\alpha x_i}}{\sum_{j=1}^n e^{\alpha x_j}} [1 + \alpha(x_i - \mathcal{S}_\alpha (x_1,\ldots,x_n))].$

This makes the softmax function useful for optimization techniques that use gradient descent.

This operator is sometimes called the Boltzmann operator,{{cite journal |last1=Asadi |first1=Kavosh |last2=Littman |first2=Michael L. |author-link2=Michael L. Littman |date=2017 |title=An Alternative Softmax Operator for Reinforcement Learning |url=https://proceedings.mlr.press/v70/asadi17a.html |journal=PMLR |volume=70 |pages=243–252 |arxiv=1612.05628 |access-date=January 6, 2023}} after the Boltzmann distribution.

= LogSumExp =

Another smooth maximum is LogSumExp:

: $\mathrm{LSE}_\alpha(x_1, \ldots, x_n) = \frac{1}{\alpha} \log \sum_{i=1}^n \exp \alpha x_i$

This can also be normalized if the $x_i$ are all non-negative, yielding a function with domain $[0,\infty)^n$ and range $[0, \infty)$ :

: $g(x_1, \ldots, x_n) = \log \left( \sum_{i=1}^n \exp x_i - (n-1) \right)$

The $(n - 1)$ term corrects for the fact that $\exp(0) = 1$ by canceling out all but one zero exponential, and $\log 1 = 0$ if all $x_i$ are zero.

= Mellowmax =

The mellowmax operator is defined as follows:

: $\mathrm{mm}_\alpha(x) = \frac{1}{\alpha} \log \frac{1}{n} \sum_{i=1}^n \exp \alpha x_i$

It is a non-expansive operator. As $\alpha \to \infty$ , it acts like a maximum. As $\alpha \to 0$ , it acts like an arithmetic mean. As $\alpha \to -\infty$ , it acts like a minimum. This operator can be viewed as a particular instantiation of the quasi-arithmetic mean. It can also be derived from information theoretical principles as a way of regularizing policies with a cost function defined by KL divergence. The operator has previously been utilized in other areas, such as power engineering.{{cite journal |last1=Safak |first1=Aysel |date=February 1993 |title=Statistical analysis of the power sum of multiple correlated log-normal components |url=https://ieeexplore.ieee.org/document/192387 |journal=IEEE Transactions on Vehicular Technology |volume=42 |issue=1 |pages={58–61 |doi=10.1109/25.192387 |access-date=January 6, 2023}}

= p-Norm =

Another smooth maximum is the p-norm:

: $\| (x_1, \ldots, x_n) \|_p = \left( \sum_{i=1}^n |x_i|^p \right)^\frac{1}{p}$

which converges to $\| (x_1, \ldots, x_n) \|_\infty = \max_{1\leq i\leq n} |x_i|$ as $p \to \infty$ .

= Smooth maximum unit =

The following binary operator is called the Smooth Maximum Unit (SMU):{{Cite arXiv|eprint = 2111.04682|last1 = Biswas|first1 = Koushik|last2 = Kumar|first2 = Sandeep|last3 = Banerjee|first3 = Shilpak|author4 = Ashish Kumar Pandey|title = SMU: Smooth activation function for deep networks using smoothing maximum technique|year = 2021| class=cs.LG }}

: $\begin{align}
\textstyle\max_\varepsilon(a, b)
&= \frac{a + b + |a - b|_\varepsilon}{2} \\
&= \frac{a + b + \sqrt{(a - b)^2 + \varepsilon}}{2}
\end{align}$

where $\varepsilon \geq 0$ is a parameter. As $\varepsilon \to 0$ , $|\cdot|_\varepsilon \to |\cdot|$ and thus $\textstyle\max_\varepsilon \to \max$ .

References

Category:Mathematical notation

Category:Basic concepts in set theory

https://www.johndcook.com/soft_maximum.pdf

M. Lange, D. Zühlke, O. Holz, and T. Villmann, "Applications of lp-norms and their smooth approximations for gradient based learning vector quantization," in Proc. ESANN, Apr. 2014, pp. 271-276.

(https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2014-153.pdf)