Bernoulli distribution

{{Short description|Probability distribution modeling a coin toss which need not be fair}}

{{Use American English|date = January 2019}}

{{Probability distribution

|name =Bernoulli distribution

|type =mass

|pdf_image =File:Bernoulli Distribution.PNG

Three examples of Bernoulli distribution:

{{legend|7F0000|2=P(x=0) = 0{.}2 and P(x=1) = 0{.}8}}

{{legend|00007F|2=P(x=0) = 0{.}8 and P(x=1) = 0{.}2}}

{{legend|007F00|2=P(x=0) = 0{.}5 and P(x=1) = 0{.}5}}

|cdf_image =

|parameters =0 \leq p \leq 1

q = 1 - p

|support =k \in \{0,1\}

|pdf =\begin{cases}

q=1-p & \text{if }k=0 \\

p & \text{if }k=1

\end{cases}

|cdf =\begin{cases}

0 & \text{if } k < 0 \\

1 - p & \text{if } 0 \leq k < 1 \\

1 & \text{if } k \geq 1

\end{cases}

|mean = p

|median =\begin{cases}

0 & \text{if } p < 1/2\\

\left[0, 1\right] & \text{if } p = 1/2\\

1 & \text{if } p > 1/2

\end{cases}

|mode =\begin{cases}

0 & \text{if } p < 1/2\\

0, 1 & \text{if } p = 1/2\\

1 & \text{if } p > 1/2

\end{cases}

|variance =p(1-p) = pq

|mad =2p(1-p) = 2pq

|skewness =\frac{q - p}{\sqrt{pq}}

|kurtosis =\frac{1 - 6pq}{pq}

|entropy =-q\ln q - p\ln p

|mgf =q+pe^t

|char =q+pe^{it}

|pgf =q+pz

|fisher = \frac{1}{pq} |

}}

{{Probability fundamentals}}

In probability theory and statistics, the Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli,{{cite book |first=James Victor |last=Uspensky |title=Introduction to Mathematical Probability |publisher=McGraw-Hill |location=New York |year=1937 |page=45 |oclc=996937 }} is the discrete probability distribution of a random variable which takes the value 1 with probability p and the value 0 with probability q = 1-p. Less formally, it can be thought of as a model for the set of possible outcomes of any single experiment that asks a yes–no question. Such questions lead to outcomes that are Boolean-valued: a single bit whose value is success/yes/true/one with probability p and failure/no/false/zero with probability q. It can be used to represent a (possibly biased) coin toss where 1 and 0 would represent "heads" and "tails", respectively, and p would be the probability of the coin landing on heads (or vice versa where 1 would represent tails and p would be the probability of tails). In particular, unfair coins would have p \neq 1/2.

The Bernoulli distribution is a special case of the binomial distribution where a single trial is conducted (so n would be 1 for such a binomial distribution). It is also a special case of the two-point distribution, for which the possible outcomes need not be 0 and 1.{{cite book |last1=Dekking |first1=Frederik |last2=Kraaikamp |first2=Cornelis |last3=Lopuhaä |first3=Hendrik |last4=Meester |first4=Ludolf |title=A Modern Introduction to Probability and Statistics |date=9 October 2010 |publisher=Springer London |isbn=9781849969529 |pages=43–48 |edition=1}}

Properties

If X is a random variable with a Bernoulli distribution, then:

:\Pr(X=1) = p, \Pr(X=0) = q =1 - p.

The probability mass function f of this distribution, over possible outcomes k, is

: f(k;p) = \begin{cases}

p & \text{if }k=1, \\

q = 1-p & \text {if } k = 0.

\end{cases}{{Cite book|title=Introduction to Probability|last=Bertsekas|author-link=Dimitri Bertsekas|first=Dimitri P.|date=2002|publisher=Athena Scientific|others=Tsitsiklis, John N., Τσιτσικλής, Γιάννης Ν.|isbn=188652940X|location=Belmont, Mass.|oclc=51441829}}

This can also be expressed as

:f(k;p) = p^k (1-p)^{1-k} \quad \text{for } k\in\{0,1\}

or as

:f(k;p)=pk+(1-p)(1-k) \quad \text{for } k\in\{0,1\}.

The Bernoulli distribution is a special case of the binomial distribution with n = 1.{{cite book | last = McCullagh | first = Peter | author-link= Peter McCullagh |author2=Nelder, John |author-link2=John Nelder | title = Generalized Linear Models, Second Edition | publisher = Boca Raton: Chapman and Hall/CRC | year = 1989 | isbn = 0-412-31760-5 |ref=McCullagh1989 |at=Section 4.2.2 }}

The kurtosis goes to infinity for high and low values of p, but for p=1/2 the two-point distributions including the Bernoulli distribution have a lower excess kurtosis, namely −2, than any other probability distribution.

The Bernoulli distributions for 0 \le p \le 1 form an exponential family.

The maximum likelihood estimator of p based on a random sample is the sample mean.

File:PMF and CDF of a bernouli distribution.png

Mean

The expected value of a Bernoulli random variable X is

:\operatorname{E}[X]=p

This is because for a Bernoulli distributed random variable X with \Pr(X=1)=p and \Pr(X=0)=q we find

:\operatorname{E}[X] = \Pr(X=1)\cdot 1 + \Pr(X=0)\cdot 0

= p \cdot 1 + q\cdot 0 = p.

Variance

The variance of a Bernoulli distributed X is

:\operatorname{Var}[X] = pq = p(1-p)

We first find

:\operatorname{E}[X^2] = \Pr(X=1)\cdot 1^2 + \Pr(X=0)\cdot 0^2

: = p \cdot 1^2 + q\cdot 0^2 = p = \operatorname{E}[X]

From this follows

:\operatorname{Var}[X] = \operatorname{E}[X^2]-\operatorname{E}[X]^2 = \operatorname{E}[X]-\operatorname{E}[X]^2

: = p-p^2 = p(1-p) = pq

With this result it is easy to prove that, for any Bernoulli distribution, its variance will have a value inside [0,1/4].

Skewness

The skewness is \frac{q-p}{\sqrt{pq}}=\frac{1-2p}{\sqrt{pq}}. When we take the standardized Bernoulli distributed random variable \frac{X-\operatorname{E}[X]}{\sqrt{\operatorname{Var}[X]}} we find that this random variable attains \frac{q}{\sqrt{pq}} with probability p and attains -\frac{p}{\sqrt{pq}} with probability q. Thus we get

:\begin{align}

\gamma_1 &= \operatorname{E} \left[\left(\frac{X-\operatorname{E}[X]}{\sqrt{\operatorname{Var}[X]}}\right)^3\right] \\

&= p \cdot \left(\frac{q}{\sqrt{pq}}\right)^3 + q \cdot \left(-\frac{p}{\sqrt{pq}}\right)^3 \\

&= \frac{1}{\sqrt{pq}^3} \left(pq^3-qp^3\right) \\

&= \frac{pq}{\sqrt{pq}^3} (q^2-p^2) \\

&= \frac{(1-p)^2-p^2}{\sqrt{pq}} \\

&= \frac{1-2p}{\sqrt{pq}} = \frac{q-p}{\sqrt{pq}}.

\end{align}

Higher moments and cumulants

The raw moments are all equal because 1^k=1 and 0^k=0.

:\operatorname{E}[X^k] = \Pr(X=1)\cdot 1^k + \Pr(X=0)\cdot 0^k = p \cdot 1 + q\cdot 0 = p = \operatorname{E}[X].

The central moment of order k is given by

:

\mu_k =(1-p)(-p)^k +p(1-p)^k.

The first six central moments are

:\begin{align}

\mu_1 &= 0, \\

\mu_2 &= p(1-p), \\

\mu_3 &= p(1-p)(1-2p), \\

\mu_4 &= p(1-p)(1-3p(1-p)), \\

\mu_5 &= p(1-p)(1-2p)(1-2p(1-p)), \\

\mu_6 &= p(1-p)(1-5p(1-p)(1-p(1-p))).

\end{align}

The higher central moments can be expressed more compactly in terms of \mu_2 and \mu_3

:\begin{align}

\mu_4 &= \mu_2 (1-3\mu_2 ), \\

\mu_5 &= \mu_3 (1-2\mu_2 ), \\

\mu_6 &= \mu_2 (1-5\mu_2 (1-\mu_2 )).

\end{align}

The first six cumulants are

:\begin{align}

\kappa_1 &= p, \\

\kappa_2 &= \mu_2 , \\

\kappa_3 &= \mu_3 , \\

\kappa_4 &= \mu_2 (1-6\mu_2 ), \\

\kappa_5 &= \mu_3 (1-12\mu_2 ), \\

\kappa_6 &= \mu_2 (1-30\mu_2 (1-4\mu_2 )).

\end{align}

Entropy and Fisher's Information

=Entropy=

Entropy is a measure of uncertainty or randomness in a probability distribution. For a Bernoulli random variable X with success probability p and failure probability q = 1 - p, the entropy H(X) is defined as:

:\begin{align}

H(X) &= \mathbb{E}_p \ln (\frac{1}{P(X)}) = - [P(X = 0) \ln P(X = 0) + P(X = 1) \ln P(X = 1)] \\

H(X) &= - (q \ln q + p \ln p) , \quad q = P(X = 0), p = P(X = 1)

\end{align}

The entropy is maximized when p = 0.5, indicating the highest level of uncertainty when both outcomes are equally likely. The entropy is zero when p = 0 or p = 1, where one outcome is certain.

=Fisher's Information=

Fisher information measures the amount of information that an observable random variable X carries about an unknown parameter p upon which the probability of X depends. For the Bernoulli distribution, the Fisher information with respect to the parameter p is given by:

:\begin{align}

I(p) = \frac{1}{pq}

\end{align}

Proof:

  • The Likelihood Function for a Bernoulli random variableX is:

:\begin{align}

L(p; X) = p^X (1 - p)^{1 - X}

\end{align}

This represents the probability of observing X given the parameter p.

  • The Log-Likelihood Function is:

:\begin{align}

\ln L(p; X) = X \ln p + (1 - X) \ln (1 - p)

\end{align}

  • The Score Function (the first derivative of the log-likelihood w.r.t. p is:

:\begin{align}

\frac{\partial}{\partial p} \ln L(p; X) = \frac{X}{p} - \frac{1 - X}{1 - p}

\end{align}

  • The second derivative of the log-likelihood function is:

:\begin{align}

\frac{\partial^2}{\partial p^2} \ln L(p; X) = -\frac{X}{p^2} - \frac{1 - X}{(1 - p)^2}

\end{align}

  • Fisher information is calculated as the negative expected value of the second derivative of the log-likelihood:

:\begin{align}

I(p) = -E\left[\frac{\partial^2}{\partial p^2} \ln L(p; X)\right] = -\left(-\frac{p}{p^2} - \frac{1 - p}{(1 - p)^2}\right) = \frac{1}{p(1-p)} = \frac{1}{pq}

\end{align}

It is maximized when p = 0.5, reflecting maximum uncertainty and thus maximum information about the parameter p.

Related distributions

:The Bernoulli distribution is simply \operatorname{B}(1, p), also written as \mathrm{Bernoulli} (p).

  • The categorical distribution is the generalization of the Bernoulli distribution for variables with any constant number of discrete values.
  • The Beta distribution is the conjugate prior of the Bernoulli distribution.{{Cite web |last1=Orloff |first1=Jeremy |last2=Bloom |first2=Jonathan |date= |title=Conjugate priors: Beta and normal |url=https://math.mit.edu/~dav/05.dir/class15-prep.pdf |access-date=October 20, 2023 |website=math.mit.edu}}
  • The geometric distribution models the number of independent and identical Bernoulli trials needed to get one success.
  • If Y \sim \mathrm{Bernoulli}\left(\frac{1}{2}\right), then 2Y - 1 has a Rademacher distribution.

See also

References

{{Reflist}}

Further reading

  • {{cite book |last1=Johnson |first1=N. L. |last2=Kotz |first2=S. |last3=Kemp |first3=A. |year=1993 |title=Univariate Discrete Distributions |edition=2nd |publisher=Wiley |isbn=0-471-54897-9 }}
  • {{cite book |first=John G. |last=Peatman |title=Introduction to Applied Statistics |location=New York |publisher=Harper & Row |year=1963 |pages=162–171 }}