Chain rule (probability)

In probability theory, the chain rule{{cite book|first=René L.|last=Schilling|title=Measure, Integral, Probability & Processes - Probab(ilistical)ly the Theoretical Minimum |location=Technische Universität Dresden, Germany |year=2021|isbn=979-8-5991-0488-9|page=136ff}} (also called the general product rule{{cite book|first=David A.|last=Schum|title=The Evidential Foundations of Probabilistic Reasoning|year=1994|publisher=Northwestern University Press|isbn=978-0-8101-1821-8|page=49}}{{cite book|first=Henry E.|last=Klugh|title=Statistics: The Essentials for Research|year=2013|publisher=Psychology Press|isbn=978-1-134-92862-0|page=149|edition=3rd}}) describes how to calculate the probability of the intersection of, not necessarily independent, events or the joint distribution of random variables respectively, using conditional probabilities. This rule allows one to express a joint probability in terms of only conditional probabilities.{{Cite web |last=Virtue |first=Pat |title=10-606: Mathematical Foundations for Machine Learning |url=https://www.cs.cmu.edu/~10606-f21/recitations/10606_Fa21_Recitation_4_worksheet_solutions.pdf}} The rule is notably used in the context of discrete stochastic processes and in applications, e.g. the study of Bayesian networks, which describe a probability distribution in terms of conditional probabilities.

Chain rule for events

=Two events=

For two events $A$ and $B$ , the chain rule states that

: $\mathbb P(A \cap B) = \mathbb P(B \mid A) \mathbb P(A)$ ,

where $\mathbb P(B \mid A)$ denotes the conditional probability of $B$ given $A$ .

==Example==

An Urn A has 1 black ball and 2 white balls and another Urn B has 1 black ball and 3 white balls. Suppose we pick an urn at random and then select a ball from that urn. Let event $A$ be choosing the first urn, i.e. $\mathbb P(A) = \mathbb P(\overline{A}) = 1/2$ , where $\overline A$ is the complementary event of $A$ . Let event $B$ be the chance we choose a white ball. The chance of choosing a white ball, given that we have chosen the first urn, is $\mathbb P(B|A) = 2/3.$ The intersection $A \cap B$ then describes choosing the first urn and a white ball from it. The probability can be calculated by the chain rule as follows:

: $\mathbb P(A \cap B) = \mathbb P(B \mid A) \mathbb P(A) = \frac 23 \cdot \frac 12 = \frac 13.$

=Finitely many events=

For events $A_1,\ldots,A_n$ whose intersection has not probability zero, the chain rule states

: $\begin{align}
\mathbb P\left(A_1 \cap A_2 \cap \ldots \cap A_n\right)
&= \mathbb P\left(A_n \mid A_1 \cap \ldots \cap A_{n-1}\right) \mathbb P\left(A_1 \cap \ldots \cap A_{n-1}\right) \\
&= \mathbb P\left(A_n \mid A_1 \cap \ldots \cap A_{n-1}\right) \mathbb P\left(A_{n-1} \mid A_1 \cap \ldots \cap A_{n-2}\right) \mathbb P\left(A_1 \cap \ldots \cap A_{n-2}\right) \\
&= \mathbb P\left(A_n \mid A_1 \cap \ldots \cap A_{n-1}\right) \mathbb P\left(A_{n-1} \mid A_1 \cap \ldots \cap A_{n-2}\right) \cdot \ldots \cdot \mathbb P(A_3 \mid A_1 \cap A_2) \mathbb P(A_2 \mid A_1) \mathbb P(A_1)\\
&= \mathbb P(A_1) \mathbb P(A_2 \mid A_1) \mathbb P(A_3 \mid A_1 \cap A_2) \cdot \ldots \cdot \mathbb P(A_n \mid A_1 \cap \dots \cap A_{n-1})\\
&= \prod_{k=1}^n \mathbb P(A_k \mid A_1 \cap \dots \cap A_{k-1})\\
&= \prod_{k=1}^n \mathbb P\left(A_k \,\Bigg|\, \bigcap_{j=1}^{k-1} A_j\right).
\end{align}$

==Example 1==

For $n=4$ , i.e. four events, the chain rule reads

: $\begin{align}
\mathbb P(A_1 \cap A_2 \cap A_3 \cap A_4) &= \mathbb P(A_4 \mid A_3 \cap A_2 \cap A_1)\mathbb P(A_3 \cap A_2 \cap A_1) \\
&= \mathbb P(A_4 \mid A_3 \cap A_2 \cap A_1)\mathbb P(A_3 \mid A_2 \cap A_1)\mathbb P(A_2 \cap A_1) \\
&= \mathbb P(A_4 \mid A_3 \cap A_2 \cap A_1)\mathbb P(A_3 \mid A_2 \cap A_1)\mathbb P(A_2 \mid A_1)\mathbb P(A_1)
\end{align}$ .

== Example 2 ==

We randomly draw 4 cards (one at a time) without replacement from deck with 52 cards. What is the probability that we have picked 4 aces?

First, we set $A_n := \left\{ \text{draw an ace in the } n^{\text{th}} \text{ try} \right\}$ . Obviously, we get the following probabilities

: $\mathbb P(A_1) = \frac 4{52},
\qquad
\mathbb P(A_2 \mid A_1) = \frac 3{51},
\qquad
\mathbb P(A_3 \mid A_1 \cap A_2) = \frac 2{50},
\qquad
\mathbb P(A_4 \mid A_1 \cap A_2 \cap A_3) = \frac 1{49}$ .

Applying the chain rule,

: $\mathbb P(A_1 \cap A_2 \cap A_3 \cap A_4)
= \frac 4{52} \cdot \frac 3{51} \cdot \frac 2{50} \cdot \frac 1{49} = \frac{24}{6497400}$ .

= Statement of the theorem and proof =

Let $(\Omega, \mathcal A, \mathbb P)$ be a probability space. Recall that the conditional probability of an $A \in \mathcal A$ given $B \in \mathcal A$ is defined as

: $\begin{align}
\mathbb P(A \mid B) :=
\begin{cases} \frac{\mathbb P(A \cap B)}{\mathbb P(B)}, & \mathbb P(B) > 0,\\ 0 & \mathbb P(B) = 0. \end{cases}
\end{align}$

Then we have the following theorem.{{math theorem|name = Chain rule| Let $(\Omega, \mathcal A, \mathbb P)$ be a probability space. Let $A_1, ..., A_n \in \mathcal A$ . Then

: $\begin{align}
\mathbb P\left(A_1 \cap A_2 \cap \ldots \cap A_n\right)
&= \mathbb P(A_1) \mathbb P(A_2 \mid A_1) \mathbb P(A_3 \mid A_1 \cap A_2) \cdot \ldots \cdot \mathbb P(A_n \mid A_1 \cap \dots \cap A_{n-1})\\
&= \mathbb P(A_1) \prod_{j=2}^n \mathbb P(A_j \mid A_1 \cap \dots \cap A_{j-1}).
\end{align}$ }}

{{Math proof|The formula follows immediately by recursion

: $\begin{align}
(1) && &\mathbb P(A_1) \mathbb P(A_2 \mid A_1) &=&\qquad \mathbb P(A_1 \cap A_2) \\
(2) && &\mathbb P(A_1) \mathbb P(A_2 \mid A_1) \mathbb P(A_3 \mid A_1 \cap A_2) &=&\qquad \mathbb P(A_1 \cap A_2) \mathbb P(A_3 \mid A_1 \cap A_2) \\
&&&&=&\qquad \mathbb P(A_1 \cap A_2 \cap A_3),
\end{align}$

where we used the definition of the conditional probability in the first step.}}

Chain rule for discrete random variables

=Two random variables=

For two discrete random variables $X,Y$ , we use the events $A := \{X = x\}$ and $B := \{Y = y\}$ in the definition above, and find the joint distribution as

: $\mathbb P(X = x,Y = y) = \mathbb P(X = x\mid Y = y) \mathbb P(Y = y),$

: $\mathbb P_{(X,Y)}(x,y) = \mathbb P_{X \mid Y}(x\mid y) \mathbb P_Y(y),$

where $\mathbb P_X(x) := \mathbb P(X = x)$ is the probability distribution of $X$ and $\mathbb P_{X \mid Y}(x\mid y)$ conditional probability distribution of $X$ given $Y$ .

=Finitely many random variables=

Let $X_1, \ldots , X_n$ be random variables and $x_1, \dots, x_n \in \mathbb R$ . By the definition of the conditional probability,

: $\mathbb P\left(X_n=x_n, \ldots , X_1=x_1\right) = \mathbb P\left(X_n=x_n | X_{n-1}=x_{n-1}, \ldots , X_1=x_1\right) \mathbb P\left(X_{n-1}=x_{n-1}, \ldots , X_1=x_1\right)$

and using the chain rule, where we set $A_k := \{X_k = x_k\}$ , we can find the joint distribution as

: $\begin{align}
\mathbb P\left(X_1 = x_1, \ldots X_n = x_n\right)
&= \mathbb P\left(X_1 = x_1 \mid X_2 = x_2, \ldots, X_n = x_n\right) \mathbb P\left(X_2 = x_2, \ldots, X_n = x_n\right) \\
&= \mathbb P(X_1 = x_1) \mathbb P(X_2 = x_2 \mid X_1 = x_1) \mathbb P(X_3 = x_3 \mid X_1 = x_1, X_2 = x_2) \cdot \ldots \\
&\qquad \cdot \mathbb P(X_n = x_n \mid X_1 = x_1, \dots, X_{n-1} = x_{n-1})\\
\end{align}$

=Example=

For $n=3$ , i.e. considering three random variables. Then, the chain rule reads

: $\begin{align}
\mathbb P_{(X_1,X_2,X_3)}(x_1,x_2,x_3)
&= \mathbb P(X_1=x_1, X_2 = x_2, X_3 = x_3)\\
&= \mathbb P(X_3=x_3 \mid X_2 = x_2, X_1 = x_1) \mathbb P(X_2 = x_2, X_1 = x_1) \\
&= \mathbb P(X_3=x_3 \mid X_2 = x_2, X_1 = x_1) \mathbb P(X_2 = x_2 \mid X_1 = x_1) \mathbb P(X_1 = x_1) \\
&= \mathbb P_{X_3\mid X_2, X_1}(x_3 \mid x_2, x_1) \mathbb P_{X_2\mid X_1}(x_2 \mid x_1) \mathbb P_{X_1}(x_1).
\end{align}$

Bibliography

{{citation|author=René L. Schilling|date=2021|edition=1|isbn=979-8-5991-0488-9|location=Technische Universität Dresden, Germany|title=Measure, Integral, Probability & Processes - Probab(ilistical)ly the Theoretical Minimum}}
{{citation|author=William Feller|date=1968|edition=3|isbn=978-0-471-25708-0|location=New York / London / Sydney|publisher=Wiley|title=An Introduction to Probability Theory and Its Applications|volume=I}}
{{Russell Norvig 2003}}, p. 496.

References

Category:Bayesian inference

Category:Bayesian statistics

Category:Mathematical identities

Category:Probability theory