Sub-Gaussian distribution

{{Short description|Probability distribution whose tail probability is less than some gaussian, in some sense.}}

In probability theory, a subgaussian distribution, the distribution of a subgaussian random variable, is a probability distribution with strong tail decay. More specifically, the tails of a subgaussian distribution are dominated by (i.e. decay at least as fast as) the tails of a Gaussian. This property gives subgaussian distributions their name.

Often in analysis, we divide an object (such as a random variable) into two parts, a central bulk and a distant tail, then analyze each separately. In probability, this division usually goes like "Everything interesting happens near the center. The tail event is so rare, we may safely ignore that." Subgaussian distributions are worthy of study, because the gaussian distribution is well-understood, and so we can give sharp bounds on the rarity of the tail event. Similarly, the subexponential distributions are also worthy of study.

Formally, the probability distribution of a random variable X is called subgaussian if there is a positive constant C such that for every t \geq 0,

:

\operatorname{P}(|X| \geq t) \leq 2 \exp{(-t^2/C^2)}

.

There are many equivalent definitions. For example, a random variable X is sub-Gaussian iff its distribution function is bounded from above (up to a constant) by the distribution function of a Gaussian:

:P(|X| \geq t) \leq cP(|Z| \geq t) \quad \forall t > 0

where c\ge 0 is constant and Z is a mean zero Gaussian random variable.{{r|Wainwright2019|at=Theorem 2.6}}

Definitions

= Subgaussian norm =

The subgaussian norm of X , denoted as \Vert X \Vert_{\psi_2} , is\Vert X \Vert_{\psi_2} = \inf\left\{ c>0 : \operatorname{E}\left[\exp{\left(\frac{X^2}{c^2}\right)}\right] \leq 2 \right\}.In other words, it is the Orlicz norm of X generated by the Orlicz function \Phi(u)=e^{u^2}-1. By condition (2) below, subgaussian random variables can be characterized as those random variables with finite subgaussian norm.

= Variance proxy =

If there exists some s^2 such that \operatorname{E} [e^{(X-\operatorname{E}[X])t}] \leq e^{\frac{s^2t^2}{2}} for all t, then s^2 is called a variance proxy, and the smallest such s^2 is called the optimal variance proxy and denoted by \Vert X\Vert_{\mathrm{vp}}^2.

Since \operatorname{E} [e^{(X-\operatorname{E}[X])t}] = e^{\frac{\sigma^2 t^2}{2}} when X \sim \mathcal{N}(\mu, \sigma^2) is Gaussian, we then have \|X\|^2_{vp} = \sigma^2, as it should.

= Equivalent definitions =

Let X be a random variable. Let K_1, K_2, K_3, \dots be positive constants. The following conditions are equivalent: (Proposition 2.5.2 {{Cite book |last=Vershynin |first=R. |title=High-dimensional probability: An introduction with applications in data science |publisher=Cambridge University Press |year=2018 |location=Cambridge}})

  1. Tail probability bound:

\operatorname{P}(|X| \geq t) \leq 2 \exp{(-t^2/K_1^2)}

for all t \geq 0;

  1. Finite subgaussian norm: \Vert X \Vert_{\psi_2} = K_2 < \infty;
  2. Moment: \operatorname{E} |X|^p \leq 2K_3^p \Gamma\left(\frac{p}{2}+1\right) for all p \geq 1, where \Gamma is the Gamma function;
  3. Moment: \operatorname{E}|X|^p\leq K^p p^{p/2} for all p \geq 1;
  4. Moment-generating function (of X ), or variance proxy{{R|kahane}}{{R|buldygin}} : \operatorname{E} [e^{(X-\operatorname{E}[X])t}] \leq e^{\frac{K^2t^2}{2}} for all t;
  5. Moment-generating function (of X^2 ): \operatorname{E}[e^{X^2t^2}] \leq e^{K^2t^2} for all t \in [-1/K, +1/K];
  6. Union bound: for some c > 0, \ \operatorname{E}[\max\{|X_1 - \operatorname{E}[X]|,\ldots,|X_n - \operatorname{E}[X]|\}] \leq c \sqrt{\log n} for all n > c, where X_1, \ldots, X_n are i.i.d copies of X;
  7. Subexponential: X^2 has a subexponential distribution.

Furthermore, the constant K is the same in the definitions (1) to (5), up to an absolute constant. So for example, given a random variable satisfying (1) and (2), the minimal constants K_1, K_2 in the two definitions satisfy K_1 \leq cK_2, K_2 \leq c' K_1, where c, c' are constants independent of the random variable.

= Proof of equivalence =

As an example, the first four definitions are equivalent by the proof below.

Proof. (1)\implies(3) By the layer cake representation,\begin{align}

\operatorname{E} |X|^p &= \int_0^\infty \operatorname{P}(|X|^p \geq t) dt\\

&= \int_0^\infty pt^{p-1}\operatorname{P}(|X| \geq t) dt\\

&\leq 2\int_0^\infty pt^{p-1}\exp\left(-\frac{t^2}{K_1^2}\right) dt\\

\end{align}

After a change of variables u=t^2/K_1^2, we find that\begin{align}

\operatorname{E} |X|^p &\leq 2K_1^p \frac{p}{2}\int_0^\infty u^{\frac{p}{2}-1}e^{-u} du\\

&= 2K_1^p \frac{p}{2}\Gamma\left(\frac{p}{2}\right)\\

&= 2K_1^p \Gamma\left(\frac{p}{2}+1\right).

\end{align}(3)\implies(2) By the Taylor series e^x = 1 + \sum_{p=1}^\infty \frac{x^p}{p!},\begin{align}

\operatorname{E}[\exp{(\lambda X^2)}] &= 1 + \sum_{p=1}^\infty \frac{\lambda^p \operatorname{E}{[X^{2p}]}}{p!}\\

&\leq 1 + \sum_{p=1}^\infty \frac{2\lambda^p K_3^{2p} \Gamma\left(p+1\right)}{p!}\\

&= 1 + 2 \sum_{p=1}^\infty \lambda^p K_3^{2p}\\

&= 2 \sum_{p=0}^\infty \lambda^p K_3^{2p}-1\\

&= \frac{2}{1-\lambda K_3^2}-1 \quad\text{for }\lambda K_3^2 <1,

\end{align}which is less than or equal to 2 for \lambda \leq \frac{1}{3K_3^2}. Let K_2 \geq 3^{\frac{1}{2}}K_3, then \operatorname{E}[\exp{(X^2/K_2^2)}] \leq 2.

(2)\implies(1) By Markov's inequality,\operatorname{P}(|X|\geq t) = \operatorname{P}\left( \exp\left(\frac{X^2}{K_2^2}\right) \geq \exp\left(\frac{t^2}{K_2^2}\right) \right) \leq \frac{\operatorname{E}[\exp{(X^2/K_2^2)}]}{\exp\left(\frac{t^2}{K_2^2}\right)} \leq 2 \exp\left(-\frac{t^2}{K_2^2}\right). (3) \iff (4) by asymptotic formula for gamma function: \Gamma(p/2 + 1 ) \sim \sqrt{\pi p} \left(\frac{p}{2e} \right)^{p/2}.

From the proof, we can extract a cycle of three inequalities:

  • If

\operatorname{P}(|X| \geq t) \leq 2 \exp{(-t^2/K^2)}

, then \operatorname{E} |X|^p \leq 2K^p \Gamma\left(\frac{p}{2}+1\right) for all p \geq 1 .

  • If \operatorname{E} |X|^p \leq 2K^p \Gamma\left(\frac{p}{2}+1\right) for all p \geq 1 , then \|X \|_{\psi_2} \leq 3^{\frac{1}{2}}K .
  • If \|X \|_{\psi_2} \leq K , then

\operatorname{P}(|X| \geq t) \leq 2 \exp{(-t^2/K^2)}

.

In particular, the constant K provided by the definitions are the same up to a constant factor, so we can say that the definitions are equivalent up to a constant independent of X.

Similarly, because up to a positive multiplicative constant, \Gamma(p/2 + 1) = p^{p/2} \times ((2e)^{-1/2}p^{1/2p})^p for all p \geq 1, the definitions (3) and (4) are also equivalent up to a constant.

Basic properties

{{Math theorem

| math_statement = * If X is subgaussian, and k > 0, then \|kX\|_{\psi_2} = k \|X\|_{\psi_2} and \|kX\|_{vp} = k \|X\|_{vp}.

  • (Triangle inequality) If X, Y are subgaussian, then \|X+Y\|_{vp}^2 \leq (\|X\|_{vp} + \|Y\|_{vp})^2
  • (Chernoff bound) If X is subgaussian, then Pr(X \geq t) \leq e^{-\frac{t^2}{2\|X\|_{vp}^2}} for all t \geq 0

| name = Basic properties

}}

X \lesssim X' means that X \leq CX', where the positive constant C is independent of X and X'.

{{Math theorem

| math_statement = If X is subgaussian, then \|X - E[X]\|_{\psi_2} \lesssim \|X\|_{\psi_2}

| name = Subgaussian deviation bound

}}

{{Math proof|title=Proof|proof= By triangle inequality, \|X - E[X]\|_{\psi_2} \leq \|X\|_{\psi_2} + \|E[X]\|_{\psi_2}. Now we have \|E[X]\|_{\psi_2} = \sqrt{\ln 2} |E[X]| \leq \sqrt{\ln 2} E[|X|] \sim E[|X|]. By the equivalence of definitions (2) and (4) of subgaussianity, we have E[|X|] \lesssim \|X\|_{\psi_2}.}}

{{Math theorem

| math_statement = If X, Y are subgaussian and independent, then \|X+Y\|_{vp}^2 \leq \|X\|_{vp}^2 + \|Y\|_{vp}^2

| name = Independent subgaussian sum bound

}}

{{Math proof|title=Proof|proof= If independent, then use that the cumulant of independent random variables is additive. That is, \ln \operatorname{E}[e^{t(X+Y)}] = \ln \operatorname{E}[e^{tX}] + \ln \operatorname{E}[e^{tY}].

If not independent, then by Hölder's inequality, for any 1/p + 1/q = 1 we have

E[e^{t(X+Y)}] = \|e^{t(X+Y)}\|_1 \leq e^{\frac 12 t^2 (p \|X\|_{vp}^2 + q \|Y\|_{vp}^2)}

Solving the optimization problem

\begin{cases}

\min p \|X\|_{vp}^2 + q \|Y\|_{vp}^2 \\

1/p + 1/q = 1

\end{cases}, we obtain the result.}}

{{Math theorem

| math_statement = Linear sums of subgaussian random variables are subgaussian.

| name = Corollary

}}

{{Math theorem

| math_statement = If E[X] = 0, E[X^2] =1, and -\ln Pr(X \geq t) \geq \frac 12 at^2 for all t >0, then \ln E[e^{tX}] \leq C_at^2 where C_a > 0 depends on a only.

| name = Partial converse

| note = Matoušek 2008, Lemma 2.4

}}

{{hidden begin|style=width:100%|ta1=center|border=1px #aaa solid|title=Proof}}

{{Math proof|title=Proof|proof=

Let F(x) be the CDF of X. The proof splits the integral of MGF to two halves, one with tX > 1 and one with tX \leq 1, and bound each one respectively.

\begin{aligned}

E[e^{tX}]

&= \int_\R e^{tx} dF(x) \\

&= \int_{-\infty}^{1/t} e^{tx} dF(x) + \int_{1/t}^{+\infty} e^{tx} dF(x) \\

\end{aligned}

Since e^x \leq 1+x+x^2 for x \leq 1,

\begin{aligned}

\int_{-\infty}^{1/t} e^{tx} dF(x)

&\leq \int_{-\infty}^{1/t} (1+tx + t^2x^2) dF(x) \\

&\leq \int_{\R} (1+tx + t^2x^2) dF(x) \\

&= 1 + tE[X] + t^2 E[X^2] \\

&= 1 + t^2 \\

&\leq e^{t^2}

\end{aligned}

For the second term, upper bound it by a summation:

\begin{aligned}

\int_{1/t}^{+\infty} e^{tx} dF(x)

&\leq e^2 Pr(X \in [1/t, 2/t]) + e^3 Pr(X \in [1/t, 2/t]) + \dots \\

&\leq \sum_{k=1}^\infty e^{k+1} Pr(X \geq k/t) \\

&\leq \sum_{k=1}^\infty e^{k(2-\frac 12 ak/t^2)}

\end{aligned}

When t\le \sqrt{a/8}, for any k\ge1, 2k - \frac{a k^2}{2t^2} \le -\frac{ak}{4t^2}, so

\leq \frac{1}{e^{\frac{a}{4t^2}} - 1} \leq \frac 4a t^2

When t> \sqrt{a/8}, by drawing out the curve of f(x) = e^{-\frac{a}{2t^2}x^2 + 2x}, and plotting out the summation, we find that

\sum_{k=1}^\infty e^{k(2-\frac 12 ak/t^2)} \leq \int_\R f(x)dx + 2 \max_x f(x) = e^{\frac{2 t^2}{a}}\left(\sqrt{\frac{2 \pi t^2}{a}}+2\right) < 10 \sqrt{t^2/a} e^{\frac{2 t^2}{a}}

Now verify that \ln 10 + \frac 12 \ln(t^2/a) + \frac{2}{a}t^2 < C_a t^2, where C_a depends on a only.

}}{{hidden end}}

{{Math theorem

| name = Corollary

| note = Matoušek 2008, Lemma 2.2

| math_statement = X_1, \dots, X_n are independent random variables with the same upper subgaussian tail:

-\ln Pr(X_i \geq t) \geq \frac 12 at^2

for all t>0. Also, E[X_i] = 0, E[X_i^2] = 1, then for any unit vector v\in \R^n, the linear sum \sum_i v_i X_i has a subgaussian tail:

-\ln Pr\left(\sum_i v_i X_i \geq t \right) \geq C_a t^2

where C_a > 0 depends only on a.

}}

Concentration

{{Math theorem|name=Gaussian concentration inequality for Lipschitz functions|note=Tao 2012, Theorem 2.1.12.|math_statement=

If f: \R^n \to \R is L-Lipschitz, and X \sim N(0, I) is a standard gaussian vector, then f(X) concentrates around its expectation at a rate

Pr(f(X) - E[f(X)] \geq t)\leq e^{-\frac{2}{\pi^2}\frac{t^2}{L^2}}

and similarly for the other tail.

}}

{{hidden begin|style=width:100%|ta1=center|border=1px #aaa solid|title=Proof}}

{{Math proof|title=Proof|proof=

By shifting and scaling, it suffices to prove the case where L = 1, and E[f(X)] = 0.

Since every 1-Lipschitz function is uniformly approximable by 1-Lipschitz smooth functions (by convolving with a mollifier), it suffices to prove it for 1-Lipschitz smooth functions.

Now it remains to bound the cumulant generating function.

To exploit the Lipschitzness, we introduce Y, an independent copy of X, then by Jensen,

E[e^{t(X-Y)}] = E[e^{tX}]E[e^{-tY}] \geq E[e^{tX}]e^{-tE[Y]} = E[e^{tX}]

By the circular symmetry of gaussian variables, we introduce X_\theta := Y\cos\theta + X\sin\theta. This has the benefit that its derivative X' = -Y\sin\theta + X\cos\theta is independent of it.

\begin{aligned}e^{t(f(X) - f(Y))}

&=e^{t(f(X_{\pi/2}) - f(X_0))} \\

&= e^{t\int_0^{\pi/2} \nabla f(X_\theta) \cdot X_\theta'd\theta} \\

&= e^{\pi t/2 \int_0^{\pi/2} \nabla f(X_\theta) \cdot X_\theta'\frac{d\theta}{\pi/2}} \\

&\leq \int_0^{\pi/2} e^{\pi t/2 \nabla f(X_\theta) \cdot X_\theta'}\frac{d\theta}{\pi/2} \\

\end{aligned}

Now take its expectation,

E[e^{t(f(X) - f(Y))}] \leq \int_0^{\pi/2} E[e^{\pi t/2 \nabla f(X_\theta) \cdot X_\theta'}]\frac{d\theta}{\pi/2}

The expectation within the integral is over the joint distribution of X, Y, but since the joint distribution of X_\theta, X_\theta' is exactly the same, we have

= E_X[E_Y[e^{\pi t/2 \nabla f(X) \cdot Y}]]

Conditional on X, the quantity \nabla f(X) \cdot Y is normally distributed, with variance \leq 1, so

\leq e^{\frac 12 (\pi t/2)^2} = e^{\frac{\pi^2}{8} t^2}

Thus, we have

\ln E[e^{tf(X)}] \leq \frac{\pi^2}{8}t^2

}}{{hidden end}}

Strictly subgaussian

Expanding the cumulant generating function:\frac 12 s^2 t^2 \geq \ln \operatorname{E}[e^{tX}] = \frac 12 \mathrm{Var}[X] t^2 + \kappa_3 t^3 + \cdotswe find that \mathrm{Var}[X] \leq \|X\|_{\mathrm{vp}}^2. At the edge of possibility, we define that a random variable X satisfying \mathrm{Var}[X]=\|X\|_{\mathrm{vp}}^2 is called strictly subgaussian.

= Properties =

Theorem.{{cite arXiv |last1=Bobkov |first1=S. G. |title=Strictly subgaussian probability distributions |date=2023-08-03 |eprint=2308.01749 |last2=Chistyakov |first2=G. P. |last3=Götze |first3=F.|class=math.PR }} Let X be a subgaussian random variable with mean zero. If all zeros of its characteristic function are real, then X is strictly subgaussian.

Corollary. If X_1, \dots, X_n are independent and strictly subgaussian, then any linear sum of them is strictly subgaussian.

= Examples =

By calculating the characteristic functions, we can show that some distributions are strictly subgaussian: symmetric uniform distribution, symmetric Bernoulli distribution.

Since a symmetric uniform distribution is strictly subgaussian, its convolution with itself is strictly subgaussian. That is, the symmetric triangular distribution is strictly subgaussian.

Since the symmetric Bernoulli distribution is strictly subgaussian, any symmetric Binomial distribution is strictly subgaussian.

Examples

class="wikitable"

|+

!

!\|X\|_{\psi_2}

!\|X\|_{vp}^2

!strictly subgaussian?

gaussian distribution \mathcal N (0, 1)

|\sqrt{8/3}

|1

|Yes

mean-zero Bernoulli distribution p\delta_q + q \delta_{-p}

|solution to pe^{(q/t)^2} + qe^{(p/t)^2} = 2

|\frac{p-q}{2(\log p-\log q)}

|Iff p=0, 1/2, 1

symmetric Bernoulli distribution \frac 12 \delta_{1/2} + \frac 12 \delta_{-1/2}

|\frac{1}{\sqrt{\ln 2}}

|1

|Yes

uniform distribution U(0, 1)

|solution to \int_0^1 e^{x^2/t^2}dx = 2, approximately 0.7727

|1/3

|Yes

arbitrary distribution on interval [a, b]

|

|\leq \left(\frac{b-a}{2}\right)^2

|

The optimal variance proxy \Vert X\Vert_{\mathrm{vp}}^2 is known for many standard probability distributions, including the beta, Bernoulli, Dirichlet{{R|marchal2017}}, Kumaraswamy, triangular{{R|arbel2020}}, truncated Gaussian, and truncated exponential.{{R|barreto2024}}

= Bernoulli distribution =

Let p + q = 1 be two positive numbers. Let X be a centered Bernoulli distribution p\delta_q + q \delta_{-p}, so that it has mean zero, then \Vert X\Vert_{\mathrm{vp}}^2 = \frac{p-q}{2(\log p-\log q)}. Its subgaussian norm is t where t is the unique positive solution to pe^{(q/t)^2} + qe^{(p/t)^2} = 2.

Let X be a random variable with symmetric Bernoulli distribution (or Rademacher distribution). That is, X takes values -1 and 1 with probabilities 1/2 each. Since X^2=1 , it follows that\Vert X \Vert_{\psi_2} = \inf\left\{ c>0 : \operatorname{E}\left[\exp{\left(\frac{X^2}{c^2}\right)}\right] \leq 2 \right\} = \inf\left\{ c>0 : \exp{\left(\frac{1}{c^2}\right)} \leq 2 \right\}=\frac{1}{\sqrt{\ln 2}}, and hence X is a subgaussian random variable.

= Bounded distributions =

File:Bounded probability distributions, compared with the normal distribution.svg

Bounded distributions have no tail at all, so clearly they are subgaussian.

If X is bounded within the interval [a, b], Hoeffding's lemma states that \Vert X\Vert_{\mathrm{vp}}^2 \leq \left(\frac{b-a}{2}\right)^2 . Hoeffding's inequality is the Chernoff bound obtained using this fact.

= Convolutions =

File:Gaussian-mixture-example.svg

Since the sum of subgaussian random variables is still subgaussian, the convolution of subgaussian distributions is still subgaussian. In particular, any convolution of the normal distribution with any bounded distribution is subgaussian.

= Mixtures =

Given subgaussian distributions X_1, X_2, \dots, X_n, we can construct an additive mixture X as follows: first randomly pick a number i \in \{1, 2, \dots, n\}, then pick X_i.

Since \operatorname{E}\left[\exp{\left(\frac{X^2}{c^2}\right)}\right] = \sum_i p_i \operatorname{E}\left[\exp{\left(\frac{X_i^2}{c^2}\right)}\right] we have \|X\|_{\psi_2} \leq \max_i \|X_i\|_{\psi_2}, and so the mixture is subgaussian.

In particular, any gaussian mixture is subgaussian.

More generally, the mixture of infinitely many subgaussian distributions is also subgaussian, if the subgaussian norm has a finite supremum: \|X\|_{\psi_2} \leq \sup_i \|X_i\|_{\psi_2}.

Subgaussian random vectors

So far, we have discussed subgaussianity for real-valued random variables. We can also define subgaussianity for random vectors. The purpose of subgaussianity is to make the tails decay fast, so we generalize accordingly: a subgaussian random vector is a random vector where the tail decays fast.

Let X be a random vector taking values in \R^n.

Define.

  • \|X\|_{\psi_2} := \sup_{v \in S^{n-1}}\|v^T X\|_{\psi_2}, where S^{n-1} is the unit sphere in \R^n. Similarly for the variance proxy \|X\|_{vp} := \sup_{v \in S^{n-1}}\|v^T X\|_{vp}
  • X is subgaussian iff \|X\|_{\psi_2} < \infty.

Theorem. (Theorem 3.4.6 ) For any positive integer n, the uniformly distributed random vector X \sim U(\sqrt{n} S^{n-1}) is subgaussian, with \|X\|_{\psi_2} \lesssim{} 1.

This is not so surprising, because as n \to \infty, the projection of U(\sqrt{n} S^{n-1}) to the first coordinate converges in distribution to the standard normal distribution.

Maximum inequalities

{{Math theorem

| math_statement =

If X_1, \dots, X_n are mean-zero subgaussians, with \|X_i \|_{vp}^2 \leq \sigma^2, then for any \delta > 0, we have \max(X_1, \dots, X_n) \leq \sigma\sqrt{2 \ln \frac{n}{\delta}} with probability \geq 1-\delta.

}}

{{Math proof|title=Proof|proof= By the Chernoff bound, \Pr(X_i \geq \sigma \sqrt{2 \ln(n/\delta)}) \leq \delta/n. Now apply the union bound. }}

{{Math theorem

| note=Exercise 2.5.10

| math_statement = If X_1, X_2, \dots are subgaussians, with \|X_i \|_{\psi_2} \leq K, then E\left[\sup_n \frac

X_n
{\sqrt{1+\ln n}}\right] \lesssim K, \quad E\left[\max_{1 \leq n \leq N} |X_n|\right] \lesssim K \sqrt{\ln N}Further, the bound is sharp, since when X_1, X_2, \dots are IID samples of \mathcal N(0, 1) we have E\left[\max_{1 \leq n \leq N} |X_n|\right] \gtrsim \sqrt{\ln N}.Kamath, Gautam. "[http://www.gautamkamath.com/writings/gaussian_max.pdf Bounds on the expectation of the maximum of samples from a gaussian]." (2015)

}}

{{Math theorem

| math_statement = If X_1, \dots, X_n are subgaussian, with \|X_i \|_{vp}^2 \leq \sigma^2, then\begin{aligned}

E[\max_i (X_i - E[X_i])] \leq \sigma\sqrt{ 2\ln n}, &\quad \Pr(\max_i (X_i- E[X_i]) > t) \leq n e^{-\frac{t^2}{2\sigma^2}}, \\

E[\max_i |X_i - E[X_i]|] \leq \sigma\sqrt{ 2\ln (2n)},

&\quad

\Pr(\max_i |X_i- E[X_i]| > t) \leq 2 n e^{-\frac{t^2}{2\sigma^2}}

\end{aligned}

| note = over a finite set{{Cite web |title=MIT 18.S997 {{!}} Spring 2015 {{!}} High-Dimensional Statistics, Chapter 1. Sub-Gaussian Random Variables |url=https://ocw.mit.edu/courses/18-s997-high-dimensional-statistics-spring-2015/a69e2f53bb2eeb9464520f3027fc61e6_MIT18_S997S15_Chapter1.pdf |access-date=2024-04-03 |website=MIT OpenCourseWare |language=en}}

}}

{{Math proof|title=Proof|proof= For any t>0:\begin{aligned}

E\!\bigl[\max_{1\le i\le n}(X_i-E[X_i])\bigr]

&=\frac1t\,E\!\Bigl[\ln\max_{i}e^{\,t(X_i-E[X_i])}\Bigr]\\

&\le\frac1t\ln E\!\Bigl[\max_{i}e^{\,t(X_i-E[X_i])}\Bigr] \quad \text{by Jensen}\\

&\le\frac1t\ln\sum_{i=1}^{n}Ee^{t(X_i-E[X_i])}\\

&\le\frac1t\ln\sum_{i=1}^{n}e^{\sigma^{2}t^{2}/2}\quad \text{by def of }\|\cdot\|_{vp}\\

&=\frac{\ln n}{t}+\frac{\sigma^{2}t}{2} \\

&\overset{t=\sqrt{2\ln n}/\sigma}{=}\;\sigma\sqrt{2\ln n},

\end{aligned}This is a standard proof structure for proving Chernoff-like bounds for sub-Gaussian variables. For the second equation, it suffices to prove the case with one variable and zero mean, then use the union bound. First by Markov, \Pr(X > t) \leq \Pr(e^{sX} > e^{st}) \leq e^{-st} E[e^{sX}]

, then by definition of variance proxy, \leq e^{-st} e^{\sigma^2s^2/2}

, and then optimize at s = -t^2/2\sigma^2

.

}}

{{Math theorem

| math_statement = Fix a finite set of vectors v_1, \dots, v_n. If X is a random vector, such that each \| v_i^T X \|_{vp}^2 \leq \sigma^2, then the above 4 inequalities hold, with \max_{v \in \mathrm{conv}(v_1, \dots, v_n)}(v^T X - E[v^T X]) replacing \max_i (X_i - E[X_i]). Here, \mathrm{conv}(v_1, \dots, v_n) is the convex polytope hulled by the vectors v_1, \dots, v_n.

| note = over a convex polytope

| name=Corollary

}}

{{Math theorem

| math_statement = If X is a random vector in \R^d, such that \|v^T X\|_{vp}^2 \leq \sigma^2 for all v on the unit sphere S, then E[\max_{v \in S} v^T X] = E[\max_{v \in S} |v^T X|] \leq 4\sigma \sqrt{d} For any \delta > 0, with probability at least 1-\delta,\max_{v \in S} v^T X = \max_{v \in S} | v^T X | \leq 4 \sigma \sqrt{d}+2 \sigma \sqrt{2 \log (1 / \delta)}

| note = subgaussian random vectors

}}

Inequalities

Theorem. (Theorem 2.6.1 ) There exists a positive constant C such that given any number of independent mean-zero subgaussian random variables X_1, \dots,X_n, \left\|\sum_{i=1}^n X_i\right\|_{\psi_2}^2 \leq C \sum_{i=1}^n\left\|X_i\right\|_{\psi_2}^2Theorem. (Hoeffding's inequality) (Theorem 2.6.3 ) There exists a positive constant c such that given any number of independent mean-zero subgaussian random variables X_1, \dots,X_N,\mathbb{P}\left(\left|\sum_{i=1}^N X_i\right| \geq t\right) \leq 2 \exp \left(-\frac{c t^2}{\sum_{i=1}^N\left\|X_i\right\|_{\psi_2}^2}\right)

\quad \forall t > 0Theorem. (Bernstein's inequality) (Theorem 2.8.1 ) There exists a positive constant c such that given any number of independent mean-zero subexponential random variables X_1, \dots,X_N,\mathbb{P}\left(\left|\sum_{i=1}^N X_i\right| \geq t\right) \leq 2 \exp \left(-c \min \left(\frac{t^2}{\sum_{i=1}^N\left\|X_i\right\|_{\psi_1}^2}, \frac{t}{\max _i\left\|X_i\right\|_{\psi_1}}\right)\right)

Theorem. (Khinchine inequality) (Exercise 2.6.5 ) There exists a positive constant C such that given any number of independent mean-zero variance-one subgaussian random variables X_1, \dots,X_N, any p \geq 2, and any a_1, \dots, a_N \in \R,\left(\sum_{i=1}^N a_i^2\right)^{1 / 2} \leq\left\|\sum_{i=1}^N a_i X_i\right\|_{L^p} \leq C K \sqrt{p}\left(\sum_{i=1}^N a_i^2\right)^{1 / 2}

Hanson-Wright inequality

The Hanson-Wright inequality states that if a random vector X is subgaussian in a certain sense, then any quadratic form A of this vector, X^TAX, is also subgaussian/subexponential. Further, the upper bound on the tail of X^TAX, is uniform.

A weak version of the following theorem was proved in (Hanson, Wright, 1971).{{Cite journal |last1=Hanson |first1=D. L. |last2=Wright |first2=F. T. |date=1971 |title=A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables |journal=The Annals of Mathematical Statistics |volume=42 |issue=3 |pages=1079–1083 |doi=10.1214/aoms/1177693335 |jstor=2240253 |issn=0003-4851|doi-access=free }} There are many extensions and variants. Much like the central limit theorem, the Hanson-Wright inequality is more a cluster of theorems with the same purpose, than a single theorem. The purpose is to take a subgaussian vector and uniformly bound its quadratic forms.

Theorem.{{Cite journal |last1=Rudelson |first1=Mark |last2=Vershynin |first2=Roman |date=January 2013 |title=Hanson-Wright inequality and sub-gaussian concentration |url=https://projecteuclid.org/journals/electronic-communications-in-probability/volume-18/issue-none/Hanson-Wright-inequality-and-sub-gaussian-concentration/10.1214/ECP.v18-2865.full |journal=Electronic Communications in Probability |volume=18 |issue=none |pages=1–9 |doi=10.1214/ECP.v18-2865 |issn=1083-589X|arxiv=1306.2872 }}{{Cite book |last=Vershynin |first=Roman |url=https://www.cambridge.org/core/books/highdimensional-probability/797C466DA29743D2C8213493BD2D2102 |title=High-Dimensional Probability: An Introduction with Applications in Data Science |date=2018 |publisher=Cambridge University Press |isbn=978-1-108-41519-4 |series=Cambridge Series in Statistical and Probabilistic Mathematics |location=Cambridge |chapter=6. Quadratic Forms, Symmetrization, and Contraction |pages=127–146 |doi=10.1017/9781108231596.009 |chapter-url=https://doi.org/10.1017/9781108231596.009}} There exists a constant c, such that:

Let n be a positive integer. Let X_1, ..., X_n be independent random variables, such that each satisfies E[X_i] = 0. Combine them into a random vector X = (X_1, \dots, X_n). For any n\times n matrix A, we haveP(|X^T AX - E[X^TAX]| > t ) \leq \max\left( 2 e^{-\frac{ct^2}{K^4\|A\|_F^2}}, 2 e^{-\frac{ct}{K^2\|A\|}} \right) =

2 \exp \left[-c \min \left(\frac{t^2}{K^4\|A\|_F^2}, \frac{t}{K^2\|A\|}\right)\right]

where K = \max_i \|X_i\|_{\psi_2}, and \|A\|_F = \sqrt{\sum_{ij} A_{ij}^2} is the Frobenius norm of the matrix, and \|A\| = \max_{\|x\|_2=1} \|Ax\|_2 is the operator norm of the matrix.

In words, the quadratic form X^TAX has its tail uniformly bounded by an exponential, or a gaussian, whichever is larger.

In the statement of the theorem, the constant c is an "absolute constant", meaning that it has no dependence on n, X_1, \dots, X_n, A. It is a mathematical constant much like pi and e.

= Consequences =

Theorem (subgaussian concentration). There exists a constant c, such that:

Let n, m be positive integers. Let X_1, ..., X_n be independent random variables, such that each satisfies E[X_i] = 0, E[X_i^2] = 1. Combine them into a random vector X = (X_1, \dots, X_n). For any m\times n matrix A, we haveP( | \| AX\|_2 - \|A\|_F | > t ) \leq 2 e^{-\frac{ct^2}{K^4\|A\|^2}}In words, the random vector A X is concentrated on a spherical shell of radius \|A \|_F, such that \| AX\|_2 - \|A \|_F is subgaussian, with subgaussian norm \leq \sqrt{3/c} \|A\| K^2.

See also

Notes

{{reflist|refs=

Wainwright MJ. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge: Cambridge University Press; 2019. {{doi|10.1017/9781108627771}}, {{ISBN|9781108627771}}.

{{cite journal | doi=10.4064/sm-19-1-1-25 | title=Propriétés locales des fonctions à séries de Fourier aléatoires | date=1960 | last1=Kahane | first1=J. | journal=Studia Mathematica | volume=19 | pages=1–25 }}

{{cite journal | doi=10.1007/BF01087176 | title=Sub-Gaussian random variables | date=1980 | last1=Buldygin | first1=V. V. | last2=Kozachenko | first2=Yu. V. | journal=Ukrainian Mathematical Journal | volume=32 | issue=6 | pages=483–489 }}

{{cite journal | doi=10.1214/17-ECP92 | title=On the sub-Gaussianity of the Beta and Dirichlet distributions | date=2017 | last1=Marchal | first1=Olivier | last2=Arbel | first2=Julyan | journal=Electronic Communications in Probability | volume=22 | arxiv=1705.00048 }}

{{cite journal | doi=10.1051/ps/2019018 | title=On strict sub-Gaussianity, optimal proxy variance and symmetry for bounded random variables | date=2020 | last1=Arbel | first1=Julyan | last2=Marchal | first2=Olivier | last3=Nguyen | first3=Hien D. | journal=Esaim: Probability and Statistics | volume=24 | pages=39–55 | arxiv=1901.09188 }}

{{cite arXiv | eprint=2403.08628 | last1=Barreto | first1=Mathias | last2=Marchal | first2=Olivier | last3=Arbel | first3=Julyan | title=Optimal sub-Gaussian variance proxy for truncated Gaussian and exponential random variables | date=2024 | class=math.ST }}

}}

References

  • {{cite journal

|first1=J.P. |last1=Kahane

|title=Propriétés locales des fonctions à séries de Fourier aléatoires

|journal=Studia Mathematica

|volume=19

|pages=1–25

|year=1960

|doi=10.4064/sm-19-1-1-25 | doi-access=free

}}

  • {{Cite book |last=Tao |first=Terence |title=Topics in random matrix theory |date=2012 |publisher=American Mathematical Society |isbn=978-0-8218-7430-1 |series=Graduate studies in mathematics |location=Providence, R.I}}
  • {{Cite journal |last=Matoušek |first=Jiří |date=September 2008 |title=On variants of the Johnson–Lindenstrauss lemma |url=https://onlinelibrary.wiley.com/doi/10.1002/rsa.20218 |journal=Random Structures & Algorithms |language=en |volume=33 |issue=2 |pages=142–156 |doi=10.1002/rsa.20218 |issn=1042-9832|url-access=subscription }}
  • {{cite journal

|first1=V.V. |last1=Buldygin

|first2=Yu.V. |last2=Kozachenko

|title=Sub-Gaussian random variables

|journal=Ukrainian Mathematical Journal

|volume=32

|pages=483–489

|year=1980

|issue=6

|doi=10.1007/BF01087176 | doi-access=

}}

  • {{cite book

| last1 = Ledoux | first1 = Michel

| last2 = Talagrand | first2 = Michel

| title = Probability in Banach Spaces

| year = 1991

| publisher = Springer-Verlag

}}

  • {{cite book

| last1 = Stromberg | first1 = K.R.

| title = Probability for Analysts

| year = 1994

| publisher = Chapman & Hall/CRC

}}

  • {{cite journal

| first1=A.E. | last1=Litvak

| first2=A. | last2=Pajor

| first3=M. | last3=Rudelson

| first4=N. | last4=Tomczak-Jaegermann

| title=Smallest singular value of random matrices and geometry of random polytopes

| journal=Advances in Mathematics

| volume=195

| pages=491–523

| year=2005

| issue=2

| doi=10.1016/j.aim.2004.08.004 | doi-access=free

| url = http://www.math.ualberta.ca/~alexandr/OrganizedPapers/lprtlastlast.pdf

}}

  • {{cite conference

| first1=Mark | last1=Rudelson

| first2=Roman | last2=Vershynin

| title=Non-asymptotic theory of random matrices: extreme singular values

| book-title=Proceedings of the International Congress of Mathematicians 2010

| pages=1576–1602

| arxiv=1003.2990

| doi=10.1142/9789814324359_0111

| year=2010

}}

  • {{cite news

| first1=O. | last1=Rivasplata

| title=Subgaussian random variables: An expository note

| journal=Unpublished

| year=2012

| url=http://www.stat.cmu.edu/~arinaldo/36788/subgaussians.pdf

}}

  • Vershynin, R. (2018). [https://www.math.uci.edu/~rvershyn/papers/HDP-book/HDP-book.pdf "High-dimensional probability: An introduction with applications in data science"] (PDF). Volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge.
  • Zajkowskim, K. (2020). "On norms in some class of exponential type Orlicz spaces of random variables". Positivity. An International Mathematics Journal Devoted to Theory and Applications of Positivity. 24(5): 1231--1240. {{arxiv|1709.02970}}. {{doi|10.1007/s11117-019-00729-6}}.

Category:Continuous distributions