Loading [MathJax]/extensions/Safe.js

Example of the EM algorithm: estimating coin biases

Here's a nice example of using the EM algorithm.

Suppose we have two coins numbered 0 and 1, each with unknown bias, so the probability of seeing a head when each one is tossed is $p_0$ and $p_1$ respectively.

Suppose we have the results of 5 experiments: in each experiment

We are told the results of the tosses, but not which coin was tossed in each experiment.

So our observed data $z$ is the results of all the coin tosses. This can be condensed into the number of heads thrown in each experiment

$$ z = [z_0, z_1, z_2, z_3, z_4], 0 \le z_i \le 10$$

Our hidden data $x$ is which coin was tossed in which experiment

$$ x = [x_0, x_1, x_2, x_3, x_4], x_i \in [0, 1]$$

And our $\theta$ is a vector consisting of all the unknown parameters - both coin biases and the probability of choosing coin 0

$$\theta = [p_0, p_1, \lambda]$$

We're going to use the EM algorithm to estimate these parameters. At each stage we'll have a current best guess of the parameters

$$\theta^{(t)} = [p_0^{(t)}, p_1^{(t)}, \lambda^{(t)}]$$

and we'll use the EM algorithm formulas above to make a new guess

$$\theta^{(t+1)} = [p_0^{(t+1)}, p_1^{(t+1)}, \lambda^{(t+1)}]$$

Step 1: calculate $P(x | z, \theta^{(t)})$

Let's calculate $P(x | z, \theta^{(t)})$. This is the probability of each of our $x_i$, given the observed data and our current best guess of the parameters.

I'm going to stop writing the dependence on $\theta^{(t)}$ explicitly for a moment, to make the notation a bit easier to read. Instead I'll write

$$P_t( \dots )$$

to mean

$$P( \dots |\ \theta^{(t)} )$$

We can consider each experiment to be independent, so we have

$$ P(x | z, \theta^{(t)}) = P_t(x | z) = \prod_{i=1}^5 P_t(x_i | z_i) $$

and

\begin{align*} P_t(x_i | z_i) = \frac{ P_t(x_i, z_i)}{P_t(z_i)} \\ \end{align*}

by applying Bayes' rule (keeping everything conditional on $\theta^{(t)}$). Each $x_i$ can either be 0 (if coin 0 was chosen for experiment $i$) or 1 (if coin 1 was chosen).

Since we only have two coins, the expression above is simple to break down further.

\begin{align*} \frac{ P_t(x_i, z_i)}{P_t(z_i)} & = \frac{ P_t(x_i, z_i)}{\sum_{j=0,1} P_t(x_i, z_j)} \\ & = \frac{ P_t(z_i | x_i) P_t( x_i ) }{ \sum_{j=0,1} P_t(z_i | x_j) P_t(x_j) } \end{align*}

We can now start to write down what these probabilities actually are. We know that, given our current best guess the parameters $\theta^{(t)}$

For any $i$ \begin{align*} P_t(x_i = 0) & = \lambda^{(t)} \\ P_t(x_i = 1) & = 1 - \lambda^{(t)} \end{align*}

Each $P_t(z_i | x_j)$ term is simply a likelihood from a binomial distribution, assuming the coin thrown, 0 or 1, is known (since the probability is conditional on $x_j$).

If $x_j = 0$, coin 0 was chosen, and we use $p_0$ in our binomial pmf. If $x_j = 1$, coin 1 was chosen, and we use $p_1$ in our binomial pmf. In either case, we observed $z_i$ heads in experiment $i$, and the total number coin tosses is $n = 10$. So the pmf is

$$ P_t(z_i | x_j) = \binom{10}{z_i} \left(p_{x_j}^{(t)} \right)^{z_i} \left( 1 - p_{x_j}^{(t)} \right)^{10 - z_i} $$

All of the above working is actually a fairly simple application of Bayes' rule, with everything being conditional on the current best guess of the parameters, $\theta^{(t)}$

Putting this all together, we can write down our expression for $P_t(x_i | z_j)$.

\begin{align*} P_t(x_i = 0 | z_i) & = \frac{ \binom{10}{z_i} \left(p_0^{(t)} \right)^{z_i} \left( 1 - p_0^{(t)} \right)^{10 - z_i} \lambda }{ \lambda \binom{10}{z_i} \left(p_0^{(t)} \right)^{z_i} \left( 1 - p_0^{(t)} \right)^{10 - z_i} + (1 - \lambda) \binom{10}{z_i} \left(p_1^{(t)} \right)^{z_i} \left( 1 - p_1^{(t)} \right)^{10 - z_i} } \end{align*}

and we know $$P_t(x_i = 1 | z_i) = 1 - P_t(x_i = 0 | z_i)$$

The big fraction above looks a bit complicated, but you can see the same elements repeated all over the place. In code we can work this out trivially using a function to return the Binomial pmf. Note that all the elements of the equation are known values based on the observed data $z$ and our current best guess of the parameters $\theta^{(t)}$.

In words, $P_t(x_i = 0 | z_i)$ is the probability that the coin used in experiment $i$ is coin 0, based on the observed data (the result of experiment $i$, $z_i$) and the current best guess of the parameters.

We're dealing with a lot of probabilities here, so to try to make the notation a little clearer, let's relabel these probabilities as "weights".

Let's relabel $P_t(x_i = 0 | z_i)$ as $w_{i,0}^{(t)}$ and $P_t(x_i = 0 | z_i)$ as $w_{i,1}^{(t)}$

Step 2: maximise $h_t(\theta)$

We now move on to the EM step - we will derive our maximum likelihood estimates of all the parameters, given the function $h_t$ that we will aim to maximise.

$$h_t(\theta) = \sum_x P(x | z, \theta^{(t)}) \log P(x, z |\ \theta)$$

Now we must write down our expression for $\log P(x, z |\ \theta)$.

We'll work in the same was as we did above when deriving the conditional probability $P(x | z, \theta^{(t)}$, applying Bayes' rule to derive a manageable formula for the probability.

Each experiment is independent, so we have a product of the individual likelihoods, which becomes a sum of the individual log likelihoods when we take the logarithm

\begin{align*} \log P(z, x | \theta) & = \log \left( \prod_i P(z_i, \ x_i | \theta) \right) \\ & = \log \left( \prod_i P(z_i |\ x_i, \theta) P(x_i |\ \theta) \right) \\ & = \sum_i \log \left( P(z_i |\ x_i, \theta) P(x_i |\ \theta) \right) \\ \end{align*}

And for each experiment, given $\theta = (p_0, p_1, \lambda)$, we know that

$$P(x_i = 0) = \lambda$$$$P(x_i = 1) = 1 - \lambda$$

and

$$ P(z_i | x_j) = \binom{10}{z_i} p_{x_j}^{z_i} \left( 1 - p_{x_j} \right)^{10 - z_i} $$

The final expression for $h_t(\theta)$ is

$$h_t(\theta) = \sum_x P(x | z, \theta^{(t)}) \log P(z, x |\ \theta)$$

And consider that the sum over all possible values of $x$ is actually simply the sum over the two possibilities $x=0$ or $x=1$ for each experiment $i = 1, ... 5$.

\begin{align*} & = \sum_i w_{i,0}^{(t)} \log \left( \binom{10}{z_i} p_{0}^{z_i} \left( 1 - p_{0} \right)^{10 - z_i} \lambda \right) + \sum_i w_{i,1}^{(t)} \log \left( \binom{10}{z_i} p_{1}^{z_i} \left( 1 - p_{1} \right)^{10 z_i} (1 - \lambda) \right) \\ & = \sum_i w_{i,0}^{(t)} \left( \log \binom{10}{z_i} + z_i \log p_{0} + (10 - z_i) \log \left( 1 - p_{0} \right) + \log \lambda \right) \\ & + \sum_i w_{i,1}^{(t)} \left( \log \binom{10}{z_i} + z_i \log p_{1} + (10 - z_i) \log \left( 1 - p_{1} \right) + \log (1 - \lambda) \right) \\ \end{align*}

To find the maximum of this expression w.r.t each of the individual elements of $\theta$, we must differentiate it w.r.t. each variable.

In fact, we should partially differentiate the expression w.r.t. each variable, and solve the resulting equations simultaneously, but that won't be necessary here since there are no "cross-terms".

We find

\begin{align*} \frac{d h_t}{d p_0} = \sum_i w_{i,0}^{(t)} \left( \frac{z_i}{p_0} + \frac{-(10 - z_i)}{ 1 - p_{0} } \right) \\ \end{align*}

This equals 0 when

\begin{align*} 0 & = \sum_i w_{i,0}^{(t)} \left( \frac{z_i}{p_0} + \frac{-(10 - z_i)}{ 1 - p_{0} } \right) \\ \\ \implies \sum_i w_{i,0}^{(t)} \frac{z_i}{p_0} & = \sum_i w_{i,0}^{(t)} \frac{10 - z_i}{ 1 - p_{0} } \\ \\ \implies \frac{1}{p_0} \sum_i w_{i,0}^{(t)} z_i & = \frac{1}{1 - p_0 } \sum_i w_{i,0}^{(t)} (10 - z_i) \\ \\ \implies \frac{p_0}{1 - p_0 } & = \frac{\sum_i w_{i,0}^{(t)} z_i}{ \sum_i w_{i,0}^{(t)} (10 - z_i) }\\ \\ \implies \frac{1 - p_0 }{p_0} & = \frac{ \sum_i w_{i,0}^{(t)} (10 - z_i) }{\sum_i w_{i,0}^{(t)} z_i}\\ \\ \implies \frac{1}{p_0} - 1 & = \frac{ \sum_i w_{i,0}^{(t)} (10 - z_i) }{\sum_i w_{i,0}^{(t)} z_i}\\ \\ \implies \frac{1}{p_0} - 1 & = \frac{ \sum_i w_{i,0}^{(t)} 10 }{\sum_i w_{i,0}^{(t)} z_i} - 1 \\ \\ \implies p_0 & = \frac{\sum_i w_{i,0}^{(t)} z_i}{ \sum_i w_{i,0}^{(t)} 10 } \\ \end{align*}

So we see that the EM-algorithm next step estimate for $p_0$ is

divided by

Looking at the above expression for $h_t$, by symmetry we can write down the update rule for $p_1$:

$$ p_1 = \frac{\sum_i w_{i,1}^{(t)} z_i}{ \sum_i w_{i,1}^{(t)} 10 } $$

Similarly, the value of $\lambda$ that maximises $h_t$ is found by differentiating w.r.t. $\lambda$.

$$\frac{d h_t}{d \lambda} = \sum_i w_{i,0}^{(t)} \frac{1}{ \lambda } + \sum_i w_{i,1}^{(t)} \frac{-1}{ 1 - \lambda } $$

And setting the derivative equal to zero, we find

\begin{align*} 0 & = \sum_i w_{i,0}^{(t)} \frac{1}{ \lambda } + \sum_i w_{i,1}^{(t)} \frac{-1}{ 1 - \lambda } \\ \\ \implies \sum_i w_{i,0}^{(t)} \frac{1}{ \lambda } & = \sum_i w_{i,1}^{(t)} \frac{1}{ 1 - \lambda } \\ \\ \implies \frac{ 1 - \lambda }{ \lambda } & = \frac{\sum_i w_{i,1}^{(t)} }{ \sum_i w_{i,0}^{(t)}} \\ \\ \implies \frac{ 1 }{ \lambda } - 1 & = \frac{\sum_i w_{i,1}^{(t)} }{ \sum_i w_{i,0}^{(t)}} \\ \\ \implies \frac{ 1 }{ \lambda } & = \frac{\sum_i w_{i,1}^{(t)} }{ \sum_i w_{i,0}^{(t)}} + 1 \\ \implies \frac{ 1 }{ \lambda } & = \frac{\sum_i w_{i,1}^{(t)} + \sum_i w_{i,0}^{(t)}}{ \sum_i w_{i,0}^{(t)}} \\ \\ \implies \lambda & = \frac{ \sum_i w_{i,0}^{(t)}}{\sum_i w_{i,1}^{(t)} + \sum_i w_{i,0}^{(t)}} \\ \\ \implies \lambda & = \frac{ \sum_i w_{i,0}^{(t)}}{ \sum_i 1} \\ \\ \implies \lambda & =\frac{ \sum_i w_{i,0}^{(t)}}{ 5 } \\ \end{align*}

So we see that the best estimate for $\lambda$ at step $t$ is the sum of the weighted probabilities that coin 0 was used, divided by 5 (the number of experiments run).

I hope you'll agree that this intuitively makes a lot of sense as an update.

Seeing the formulas above in action

Some of the formulas and derivations above might seem rather long and complicated.

But the approach isn't as hard as it might seem. Often seeing things done in code clarifies the approach. Here we'll implement some functions for applying the formulas above, allowing us to aplpy the EM algorithm in to the coin flip problem described.

That's it - that's all the functions we'll need to apply the EM algorithm in this situation.

We just need to implement the update formulas, which are stated in terms of the "weights" $$w_{i,0}^{(t)} = P(x_i = 0 | z_i, \theta^{(t)})$$ and $$w_{i,1}^{(t)} = P(x_i = 1 | z_i, \theta^{(t)})$$

Discussion

If we look at these numbers, they make good intuitive sense.

The algorithm has estimated that

Looking at our experiments, we have

So it makes sense that the estimate of $\lambda$ is around 0.4, because it looks like 2 out of our 5 experiments were performed with coin 0.