Discrete Probability Distributions

There are some probability distributions that occur frequently. This is because they either have a particularly natural or simple construction. Or they arise as the limit of some simpler distribution. Here we cover

  • Bernoulli random variables
  • Binomial distribution
  • Geometric distribution
  • Poisson distribution.

(This is a section in the notes here.)

Of course there are many other important distributions. Like with counting, it is often easy when first learning about probability to think of different probability distributions as being a main destination for probability. However, it is perhaps better to think of the probability distributions that we cover now as simple building blocks that we can then be later used to construct more expressive probabilistic and statistical models.

Again we focus on probability distributions that take discrete values though shortly we will begin to discuss their continuous counterpart.

Notation. There are many different distributions which often have different parameters. For instance, shortly we will define the Binomial Distribution, which has two parameters n and p and we denote by \text{Bin}(n,p). If a random variable X has a specific distribution then we use “\sim ” to denote this. E.g. if X is a random variable with parameters n=4 and p=0.2 then we write

Screenshot 2021-11-18 at 15.53.30

Bernoulli Distribution.

We start with the simplest discrete probability distribution.

Definition [Bernoulli Distribution] A random variable that is either zero or one is a Bernoulli random variable. That is we we write X \sim \text{Bern}(p) if

It is a straight-forward calculation to show that

Screenshot 2021-11-18 at 15.54.34

for X \sim \text{Bern}(p)

Binomial Distribution.

If we take X_1,X_2,...,X_n to be independent Bernoulli random variables with parameter p, and we add then together

Screenshot 2021-11-18 at 15.55.25

then we get a Binomial distribution with parameters n and p.

So if we consider an experiment with probability of success p and we repeat an experiment n times and count up the number of successes, then the resulting probability distribution is a Binomial distribution.

Let’s briefly consider the probability that X=k. One event where \{ X=k \} occurs is when the first k experiments end in success and the rest fail, \{X_1=1, ..., X_k=1, X_{k+1} = 0 ,..., X_n = 0 \} . Note that by independence

Screenshot 2021-11-18 at 15.56.14

Indeed the probability of any individual sequence X_1,...,X_n where \sum_i X_i = k. So how may such sequences are there? Well we have seen this before. It is the number of ways we can label n points with k ones (see additional remarks on combinations in Section [sec:Counting]), that is the combination C^n_k. Thus we have

Screenshot 2021-11-18 at 15.56.47

This motivates the following definition.

Definition [Binomial Distribution] A random variable X has a binomial distribution with parameters n and p, if it has probability mass function

Screenshot 2021-11-18 at 15.57.13

for k=0,1,...n, and we write X \sim \text{Bin}(n,p).

Here are some results on Binomial distributions that might be handy.

Lemma 1. If X \sim \text{Bin}(n,p), then

Screenshot 2021-11-18 at 15.59.42

Screenshot 2021-11-18 at 15.59.10

Lemma 2. If X \sim \text{Bin}(n,p) and Y \sim \text{Bin}(m,p) and are independent then

Screenshot 2021-11-18 at 16.00.19

Proof. X= \sum_{i=1}^n X_i for X_i independent and X_i \sim \text{Bern}(p), and X= \sum_{i=n+1}^m X_i for X_i independent and X_i \sim \text{Bern}(p). So X+Y = \sum_{i=1}^{n+m} X_i thus is Bernoulli with parameters n+m and p.

Lemma 3. If X \sim Bin(n,p) and Y_1,...,Y_n are independent Bernoulli random variables with parameter q then

Screenshot 2021-11-18 at 16.01.00

Proof. Since we can write X= \sum_{i=1}^n X_i for X_i \sim \text{Bern}(p). Note that an equivalent way to represent the above random variable is Since X_i Y_i \sim \text{Bern}(pq), then the above random variable must be \text{Bin}(n,pq). \square

Geometric Distribution

Suppose we throw a biased coin until the first time that it lands on heads. The distribution of the number of throws is a geometric distribution. For instance, the probability that it takes X=5 coin throws is the same as the probability of 4 tails in a row and then one heads which is

Screenshot 2021-11-18 at 16.01.57

where p is the probability of heads. In general, the probability we need k throws isScreenshot 2021-11-18 at 16.02.00

This gives the geometric distribution.

Definition [Geometric distribution] The geometric distribution with success probability p is the distribution with probability mass function

Screenshot 2021-11-18 at 16.03.03

for k=1,2,..., and we write X \sim \text{Geo}(p).

The following lemma is useful for geometrics distributions but also various forms of compound interest and other applications.

Lemma 4. [Geometric Series]For |x| <1,


Now subtracting gives

Screenshot 2021-11-18 at 16.04.37


Differentiating the above with respect to x gives

Differentiating again gives


Lemma 4. If X \sim \text{Geo}(p) then

Screenshot 2021-11-18 at 16.05.54

Screenshot 2021-11-18 at 16.06.54

Screenshot 2021-11-18 at 16.07.35

If we throw a coin and get 8 tails in a row, and we ask how long should we wait until we next get a heads, then (even though it might feel like we are now due a heads) it is the same as the time we would have expected when we first started throwing the coin. This is key property of the geometric distribution and its called memoryless property.

Lemma 5. [Memoryless Property] If T \sim Geo(p) then, conditional on \{ T > t \}, the distribution of T-t is geometrically distributed with parameter p. In otherwords (T-t | T \geq t) \sim Geo(p) .



Example [Waiting for a bus] At a bus stop, the probability that a bus arrives at any given minute is p and is independent from one minute to the next.

  1. What is the expect gap in the time between any two busses?
  2. You arrive at the bus stop and there is no bus there. What is the expected gap between the last time a bus arrived and the next bus to arrive?

Answer. 1. The time from one bus to the next is geometric p, so the expected wait is 1/p.

2. Given you at a time with no bus the time until the last bus too arrive is geometrically distributed with parameter p and so is the time until the next bus to arrive. The time between this bus arrivals is thus the sum of these geometeric distributions, and so the expected time is 2/p.

This is sometimes called the waiting time paradox. Here we see that when we turn up at the bus station the gap between the buses is twice as long as the mean time between the buses. This is because when we turn up and there is no bus there then we are more likely to have chosen a time with a bigger gap between the buses.

Poisson Distribution.

The Poisson distribution arises when we count the number of successes of an unlikely event over a large population. This occurs in all manner of settings from nuclear decay, to insurance, to call over a telephone line.

We present a definition first and then we will motivate the Poisson distribution.

Definition [Poisson distribution] For a parameter \lambda >0, the Poisson distribution has probability mass function

Screenshot 2021-11-18 at 16.09.26

for k=0,1,2,...\, and we write X \sim Po(\lambda)

Motivation for Poisson Distribution. If we take a Binomial distribution where the number of trails n is large but the probability of success in each trail is small, specifically p=\lambda /n, then the Binomial distribution is well approximated by a Poisson distribution.

This is the reason the Poisson distribution is a reasonable distribution to represent pheonomena like nuclear decay. In nuclear decay, there are a large number of atoms in a radio-active substance, and, in any given time interval, there is a very small probability of one of these atom undergoing nuclear decay and the emitting a particle (e.g. a gamma-ray). For this reason the distribution of the number of observed gamma-rays over a time interval is well approximated by a Poisson distribution.

The following lemma sets out how the Poisson distribution approximates the Binomial distribution (again students primarily interested in assessment can skip with argument).

Theorem 1 [Binomial to Poisson Limit] Consider a sequence of Binomial random variables X^{(n)} \sim \text{Bin}(n , \frac{\lambda}{n} ) for n \in \mathbb N, and let X \sim Po(\lambda). Then

Screenshot 2021-11-18 at 16.10.20

That is as n gets large the probability of X^{(n)}=k approaches the probability that X=k for each k.Screenshot 2021-11-18 at 16.11.14

Screenshot 2021-11-18 at 16.11.25

Now for some more standard facts about the Poisson distribution.

Screenshot 2021-11-18 at 16.12.38

Lemma 6 [Poisson Summation Property] If X \sim Po(\lambda), Y \sim Po (\mu), and X and Y are independent then

Lemma 7 [Poisson Thinning Property] If N \sim Po(\lambda) and, independent of N, we let X_1,X_2,... be independent Bernoulli random variables with parameter p then

In Lemma 6, we can begin to see how we can think of a Poisson distribution as part of a process that evolves in time. For instance we might say that the number of calls on a set of telephone lines in each minute is Poisson distributed with mean 4, then the number of calls per hour is Poisson mean 4\times 60 = 240.

In Lemma 7, we can see that if we exclude points according to an independent random variable then the resulting random variable is still Poisson. This is useful for instance in insurance. Here the number of claims an insurance company receives in a given day might be Poisson with mean 20. The company might split the claims into big and small claims (say on average half the claims are big and half small). Since there is some fixed probability that each claim is, say, big then the resulting number of big claims is Poisson mean 10. This is useful for an insurance company as they can divide up, reinsure or resell some of their risk.

Both lemmas can be proved directly by summing things but is a bit of a messy calculation. Intuitively the above lemmas holds because an equivalent results, Lemma 2 and Lemma 3, hold for Binomial random variables. So the both properties persists when we take the limit to a Poisson random variable (like in Theorem 1). The cleanest proof (using moment generating functions) is beyond the scope of this course, so we omit the proof for now.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s