Conditional Probability

(This is a section in the notes here.)

Conditional probabilities are probabilities where we have assumed that another event has occurred.

An example: Two aces. Suppose we have a deck of cards and we consider the probability of getting an ace. If we take one card out, then the probability of an ace is $4/52 = 1/ 13$. (There are $4$ aces in a pack of $52$ cards.) Now, given that the first card was an ace, we take a second card out the pack, notice now the probability of the second card being an ace has changed. Specifically there are now $3$ aces and $51$ cards in the pack. So the probability is $3/51=1/17$. (Also observe that this probability is different than if we had assumed that the first card was not an ace, which in that case the probability would be $4/51$.) Thus our condition on the first card has effected the probability for the second card. This is an example of conditional probability, which we now develop more generally.

Motivating a definition of conditional probability. Recall that we thought of probability as

Similarly, as we think of conditional probability as the proportion of time that an event occurs knowing that an other event has occurred. In this case, an analogous statement would be that

where here we use $\mathbb P(B|A)$ to denote the probability of event $B$ conditional on event $A$.

Observe, that we can express (9) in terms of (8). In particular,

Here we divide the denominator and numerator from the left hand side of by the number of experiments and then we apply to both expressions.

This motivates the following

Definition of conditional probability. Given the above we have:

Definition [Conditional Probability] For events $A$ and $B$, the conditional probability of $B$ given $A$ is

when $\mathbb P(A)>0$. By convention, if $\mathbb P(A) =0$ then we define $\mathbb P(B|A) = 0$.

A couple of examples.

Example. The results of a survey of 250 different 75 year-olds is given in the following Venn diagram.

For a randomly selected participant in the survey:

1) Calculate the probability living 10 more years given that they eat fruit.

2) Calculate the probability living 10 more years given that they don’t eat fruit.

2) So you should eat fruit…

Example. I have two siblings, given I have a brother, what is the probability that I also have a sister.

Answer. Note that our sample space is $\Omega = \{BB,BS,SB,SS\}$. E.g. here $BS$ means the eldest sibling is a brother and the youngest is a sister. Each outcome has equal probability of $1/4$. So

For some this is an example of conditional probability being a bit counter-intuitive, as ones gut reaction is that the answer is a half. Notice if I specified that my eldest sibling was a brother then the answer would indeed be a half. This is just one small example of the subtle art of manipulating conditional probabilities.

Independence

We are interested in the setting where knowing that an event has occurred does not affect another event. E.g. if I roll a dice twice, in principle the outcome of the first roll should not influence the second roll. If event $A$ does not influence event $B$ then it should be that

I.e. conditioning on $A$ having happened does not change the probability of $B$ occuring. Since $\mathbb P (B | A) = \mathbb P (A \cap B)/\mathbb P( A)$, we can slightly more symmetrically express the above equality as

This is what we call independence of two events.

Definition [Independence] We say that events $A$ and $B$ are independent if

So if knowing that $A$ has happened does not affect the probability of $B$ then we multiply the probabilities together.

Warning! The following is sometimes confused by students. We say that two events are “mutually exclusive” if $A \cap B = \emptyset$. In that case we know from Lemma [lem:op1] that we add the probabilities together. This not the same as independence where we multiply the probabilities together.

As discussed, independence says that knowing an outcome is in $A$ does not effect the probability of $B$. However, if events $A$ and $B$ are mutually exclusive, then knowing the outcome is in $A$ effects the probability of $B$. Specifically, if we know an outcome is in $A$ then it definitely is not in $B$.

Example. What is the probability of getting $10$ heads in a row from an unbiased coin?

Answer. Since with multiple probabilities together

This is actually a “magic trick”. Notice there are $1440$ minutes in a day. So enough time to have a reasonable chance of getting $10$ heads in a row over the course of a day. There are TV magicians that had performed this as a trick (by cutting out the roughly $1023$ previous camera takes).

Rules for Conditional Probabilities

Here are few useful formulas for Conditional Probabilities. (Like with operations on sets the proofs are not entirely necessary to know for exams.)

Lemma 1.

Proof. Follows immediately from the definition of $\mathbb P(B|A)$. $\square$.

This is useful as it can be easier to find $\mathbb P(B|A)$. E.g. like with the earlier two aces example, we can easily find the probability of drawing a ace from a deck given the previous card was an ace, and from that calculate the probability of two aces.

Lemma 2.

Proof. Since $B = (A \cap B) \cup (A^c \cap B)$, then Lemma 1 gives

and then applying Lemma 1

gives as required. $\square$

Note that the above result can be applied to any number of sets $A_1,...,A_n$. So long as $\cup_i A_i = \Omega$ and $A_i \cap A_j = \emptyset$ for $i \neq j$, it holds that

Lemma 3 [Bayes’ Rule]

The result is sometimes called Bayes’ Theorem, as well.

Bayes’ Rule reverses the order of the conditional probability. There is a whole branch of statistics developed to this which we will very briefly touch upon shortly.

Example [Two aces] Taking out two cards from a well-shuffled deck. What is the probability that both cards are aces? What is the probability that the 2nd card is an ace?

Answer. We know that $\mathbb P(A_1) = {4}/{52}$. Also we know that $\mathbb P (A_2 | A_1 ) = {3}/{51}$, because after one ace is dealt then there are $3$ aces and $51$ cards. Thus using Lemma 1

For the 2nd part, we can apply Lemma 7. Here

This should come as no surprise, as the probability that the 2nd card is an ace should be the same as the probability that the 1st card is an ace. (Imagine taking the first card out the pack and putting it to the back, and then taking the 2nd card out and looking at it.)

Example [Frequentist vs Bayesian Statistics] I have a biased coin, it is biased so the probability of heads, $\theta$, is either $3/4$ or $1/4$, but you don’t know which. So you throw the coin three times and get three heads. From this data determine if $\theta = 3/4$ or $\theta = 1/4$.

Answer. This is clearly not a well-defined question, because we cannot determine with certainty the value of $\theta$. Given it’s subjective nature (where there is no right answers). We give two approaches: a frequentist approach, which is a more classical statistical approach, and a bayesian statistical approach.

The Frequentist Answer. The likelihood of three heads for both choices of $\theta$ is

The parameter that gives the highest probability is $\hat \theta= {3}/{4}$. So our answer for this problem is the estimator $\hat \theta = 3/4$.

The Bayesian Answer. Since we don’t know in prior to throwing the coin, which of the two possibilities hold. We could give each possibility equal likelihood that is

This is called the prior distribution. After throwing the coin and getting three heads then we want to update our estimate to find

This called the posterior distribution. We can use Bayes’ Rule to calculate the posterior:

To calculate $\mathbb P (HHH)$ we can apply Lemma 2

which then gives with (10) that

and thus $\mathbb P ( \theta = 1/4 | HHH) = \frac{1}{28}$. Thus in the Bayesian approach we says that we think $\theta = 3/4$ with probability $27/28$.

Under reasonable assumptions and enough data both the Bayesian and Frequentist approaches will converge on the correct parameter. The choice of the prior in the Bayesian approach is quite subjective. When the range of parameters gets large (or continuous) then we need to solve an optimization problem in the frequentist approach, while in the Bayesian approach we need to sum over a large number of terms to find the normalizing constant, which was the $\mathbb P(HHH)$ term in the above example.