Here's a nice example of using the EM algorithm.
Suppose we have two coins numbered 0 and 1, each with unknown bias, so the probability of seeing a head when each one is tossed is $p_0$ and $p_1$ respectively.
Suppose we have the results of 5 experiments: in each experiment
We are told the results of the tosses, but not which coin was tossed in each experiment.
So our observed data $z$ is the results of all the coin tosses. This can be condensed into the number of heads thrown in each experiment
$$ z = [z_0, z_1, z_2, z_3, z_4], 0 \le z_i \le 10$$Our hidden data $x$ is which coin was tossed in which experiment
$$ x = [x_0, x_1, x_2, x_3, x_4], x_i \in [0, 1]$$And our $\theta$ is a vector consisting of all the unknown parameters - both coin biases and the probability of choosing coin 0
$$\theta = [p_0, p_1, \lambda]$$We're going to use the EM algorithm to estimate these parameters. At each stage we'll have a current best guess of the parameters
$$\theta^{(t)} = [p_0^{(t)}, p_1^{(t)}, \lambda^{(t)}]$$and we'll use the EM algorithm formulas above to make a new guess
$$\theta^{(t+1)} = [p_0^{(t+1)}, p_1^{(t+1)}, \lambda^{(t+1)}]$$Let's calculate $P(x | z, \theta^{(t)})$. This is the probability of each of our $x_i$, given the observed data and our current best guess of the parameters.
I'm going to stop writing the dependence on $\theta^{(t)}$ explicitly for a moment, to make the notation a bit easier to read. Instead I'll write
$$P_t( \dots )$$to mean
$$P( \dots |\ \theta^{(t)} )$$We can consider each experiment to be independent, so we have
$$ P(x | z, \theta^{(t)}) = P_t(x | z) = \prod_{i=1}^5 P_t(x_i | z_i) $$and
\begin{align*} P_t(x_i | z_i) = \frac{ P_t(x_i, z_i)}{P_t(z_i)} \\ \end{align*}by applying Bayes' rule (keeping everything conditional on $\theta^{(t)}$). Each $x_i$ can either be 0 (if coin 0 was chosen for experiment $i$) or 1 (if coin 1 was chosen).
Since we only have two coins, the expression above is simple to break down further.
\begin{align*} \frac{ P_t(x_i, z_i)}{P_t(z_i)} & = \frac{ P_t(x_i, z_i)}{\sum_{j=0,1} P_t(x_i, z_j)} \\ & = \frac{ P_t(z_i | x_i) P_t( x_i ) }{ \sum_{j=0,1} P_t(z_i | x_j) P_t(x_j) } \end{align*}We can now start to write down what these probabilities actually are. We know that, given our current best guess the parameters $\theta^{(t)}$
For any $i$ \begin{align*} P_t(x_i = 0) & = \lambda^{(t)} \\ P_t(x_i = 1) & = 1 - \lambda^{(t)} \end{align*}
Each $P_t(z_i | x_j)$ term is simply a likelihood from a binomial distribution, assuming the coin thrown, 0 or 1, is known (since the probability is conditional on $x_j$).
If $x_j = 0$, coin 0 was chosen, and we use $p_0$ in our binomial pmf. If $x_j = 1$, coin 1 was chosen, and we use $p_1$ in our binomial pmf. In either case, we observed $z_i$ heads in experiment $i$, and the total number coin tosses is $n = 10$. So the pmf is
$$ P_t(z_i | x_j) = \binom{10}{z_i} \left(p_{x_j}^{(t)} \right)^{z_i} \left( 1 - p_{x_j}^{(t)} \right)^{10 - z_i} $$All of the above working is actually a fairly simple application of Bayes' rule, with everything being conditional on the current best guess of the parameters, $\theta^{(t)}$
Putting this all together, we can write down our expression for $P_t(x_i | z_j)$.
\begin{align*} P_t(x_i = 0 | z_i) & = \frac{ \binom{10}{z_i} \left(p_0^{(t)} \right)^{z_i} \left( 1 - p_0^{(t)} \right)^{10 - z_i} \lambda }{ \lambda \binom{10}{z_i} \left(p_0^{(t)} \right)^{z_i} \left( 1 - p_0^{(t)} \right)^{10 - z_i} + (1 - \lambda) \binom{10}{z_i} \left(p_1^{(t)} \right)^{z_i} \left( 1 - p_1^{(t)} \right)^{10 - z_i} } \end{align*}and we know $$P_t(x_i = 1 | z_i) = 1 - P_t(x_i = 0 | z_i)$$
The big fraction above looks a bit complicated, but you can see the same elements repeated all over the place. In code we can work this out trivially using a function to return the Binomial pmf. Note that all the elements of the equation are known values based on the observed data $z$ and our current best guess of the parameters $\theta^{(t)}$.
In words, $P_t(x_i = 0 | z_i)$ is the probability that the coin used in experiment $i$ is coin 0, based on the observed data (the result of experiment $i$, $z_i$) and the current best guess of the parameters.
We're dealing with a lot of probabilities here, so to try to make the notation a little clearer, let's relabel these probabilities as "weights".
Let's relabel $P_t(x_i = 0 | z_i)$ as $w_{i,0}^{(t)}$ and $P_t(x_i = 0 | z_i)$ as $w_{i,1}^{(t)}$
We now move on to the EM step - we will derive our maximum likelihood estimates of all the parameters, given the function $h_t$ that we will aim to maximise.
$$h_t(\theta) = \sum_x P(x | z, \theta^{(t)}) \log P(x, z |\ \theta)$$Now we must write down our expression for $\log P(x, z |\ \theta)$.
We'll work in the same was as we did above when deriving the conditional probability $P(x | z, \theta^{(t)}$, applying Bayes' rule to derive a manageable formula for the probability.
Each experiment is independent, so we have a product of the individual likelihoods, which becomes a sum of the individual log likelihoods when we take the logarithm
\begin{align*} \log P(z, x | \theta) & = \log \left( \prod_i P(z_i, \ x_i | \theta) \right) \\ & = \log \left( \prod_i P(z_i |\ x_i, \theta) P(x_i |\ \theta) \right) \\ & = \sum_i \log \left( P(z_i |\ x_i, \theta) P(x_i |\ \theta) \right) \\ \end{align*}And for each experiment, given $\theta = (p_0, p_1, \lambda)$, we know that
$$P(x_i = 0) = \lambda$$$$P(x_i = 1) = 1 - \lambda$$and
$$ P(z_i | x_j) = \binom{10}{z_i} p_{x_j}^{z_i} \left( 1 - p_{x_j} \right)^{10 - z_i} $$The final expression for $h_t(\theta)$ is
$$h_t(\theta) = \sum_x P(x | z, \theta^{(t)}) \log P(z, x |\ \theta)$$And consider that the sum over all possible values of $x$ is actually simply the sum over the two possibilities $x=0$ or $x=1$ for each experiment $i = 1, ... 5$.
\begin{align*} & = \sum_i w_{i,0}^{(t)} \log \left( \binom{10}{z_i} p_{0}^{z_i} \left( 1 - p_{0} \right)^{10 - z_i} \lambda \right) + \sum_i w_{i,1}^{(t)} \log \left( \binom{10}{z_i} p_{1}^{z_i} \left( 1 - p_{1} \right)^{10 z_i} (1 - \lambda) \right) \\ & = \sum_i w_{i,0}^{(t)} \left( \log \binom{10}{z_i} + z_i \log p_{0} + (10 - z_i) \log \left( 1 - p_{0} \right) + \log \lambda \right) \\ & + \sum_i w_{i,1}^{(t)} \left( \log \binom{10}{z_i} + z_i \log p_{1} + (10 - z_i) \log \left( 1 - p_{1} \right) + \log (1 - \lambda) \right) \\ \end{align*}To find the maximum of this expression w.r.t each of the individual elements of $\theta$, we must differentiate it w.r.t. each variable.
In fact, we should partially differentiate the expression w.r.t. each variable, and solve the resulting equations simultaneously, but that won't be necessary here since there are no "cross-terms".
We find
\begin{align*} \frac{d h_t}{d p_0} = \sum_i w_{i,0}^{(t)} \left( \frac{z_i}{p_0} + \frac{-(10 - z_i)}{ 1 - p_{0} } \right) \\ \end{align*}This equals 0 when
\begin{align*} 0 & = \sum_i w_{i,0}^{(t)} \left( \frac{z_i}{p_0} + \frac{-(10 - z_i)}{ 1 - p_{0} } \right) \\ \\ \implies \sum_i w_{i,0}^{(t)} \frac{z_i}{p_0} & = \sum_i w_{i,0}^{(t)} \frac{10 - z_i}{ 1 - p_{0} } \\ \\ \implies \frac{1}{p_0} \sum_i w_{i,0}^{(t)} z_i & = \frac{1}{1 - p_0 } \sum_i w_{i,0}^{(t)} (10 - z_i) \\ \\ \implies \frac{p_0}{1 - p_0 } & = \frac{\sum_i w_{i,0}^{(t)} z_i}{ \sum_i w_{i,0}^{(t)} (10 - z_i) }\\ \\ \implies \frac{1 - p_0 }{p_0} & = \frac{ \sum_i w_{i,0}^{(t)} (10 - z_i) }{\sum_i w_{i,0}^{(t)} z_i}\\ \\ \implies \frac{1}{p_0} - 1 & = \frac{ \sum_i w_{i,0}^{(t)} (10 - z_i) }{\sum_i w_{i,0}^{(t)} z_i}\\ \\ \implies \frac{1}{p_0} - 1 & = \frac{ \sum_i w_{i,0}^{(t)} 10 }{\sum_i w_{i,0}^{(t)} z_i} - 1 \\ \\ \implies p_0 & = \frac{\sum_i w_{i,0}^{(t)} z_i}{ \sum_i w_{i,0}^{(t)} 10 } \\ \end{align*}So we see that the EM-algorithm next step estimate for $p_0$ is
divided by
Looking at the above expression for $h_t$, by symmetry we can write down the update rule for $p_1$:
$$ p_1 = \frac{\sum_i w_{i,1}^{(t)} z_i}{ \sum_i w_{i,1}^{(t)} 10 } $$Similarly, the value of $\lambda$ that maximises $h_t$ is found by differentiating w.r.t. $\lambda$.
$$\frac{d h_t}{d \lambda} = \sum_i w_{i,0}^{(t)} \frac{1}{ \lambda } + \sum_i w_{i,1}^{(t)} \frac{-1}{ 1 - \lambda } $$And setting the derivative equal to zero, we find
\begin{align*} 0 & = \sum_i w_{i,0}^{(t)} \frac{1}{ \lambda } + \sum_i w_{i,1}^{(t)} \frac{-1}{ 1 - \lambda } \\ \\ \implies \sum_i w_{i,0}^{(t)} \frac{1}{ \lambda } & = \sum_i w_{i,1}^{(t)} \frac{1}{ 1 - \lambda } \\ \\ \implies \frac{ 1 - \lambda }{ \lambda } & = \frac{\sum_i w_{i,1}^{(t)} }{ \sum_i w_{i,0}^{(t)}} \\ \\ \implies \frac{ 1 }{ \lambda } - 1 & = \frac{\sum_i w_{i,1}^{(t)} }{ \sum_i w_{i,0}^{(t)}} \\ \\ \implies \frac{ 1 }{ \lambda } & = \frac{\sum_i w_{i,1}^{(t)} }{ \sum_i w_{i,0}^{(t)}} + 1 \\ \implies \frac{ 1 }{ \lambda } & = \frac{\sum_i w_{i,1}^{(t)} + \sum_i w_{i,0}^{(t)}}{ \sum_i w_{i,0}^{(t)}} \\ \\ \implies \lambda & = \frac{ \sum_i w_{i,0}^{(t)}}{\sum_i w_{i,1}^{(t)} + \sum_i w_{i,0}^{(t)}} \\ \\ \implies \lambda & = \frac{ \sum_i w_{i,0}^{(t)}}{ \sum_i 1} \\ \\ \implies \lambda & =\frac{ \sum_i w_{i,0}^{(t)}}{ 5 } \\ \end{align*}So we see that the best estimate for $\lambda$ at step $t$ is the sum of the weighted probabilities that coin 0 was used, divided by 5 (the number of experiments run).
I hope you'll agree that this intuitively makes a lot of sense as an update.
Some of the formulas and derivations above might seem rather long and complicated.
But the approach isn't as hard as it might seem. Often seeing things done in code clarifies the approach. Here we'll implement some functions for applying the formulas above, allowing us to aplpy the EM algorithm in to the coin flip problem described.
from scipy.stats import binom
NUM_EXPERIMENTS = 5
NUM_TOSSES = 10
# The results: the number of heads observed in each experiment
z = [7, 8, 3, 7, 2]
# Our parameters: (p0, p1, lambda)
theta = (0.45, 0.55, 0.5)
# Note that it's important to break symmetry when
# setting up our initial guesses for parameters.
#
# If we started with both p0 and p1 as the same
# values, the EM algorithm wouldn't be able to
# split them when applying its formulas (since
# likelihoods for coin 0 and coin 1 would be
# identical throughout) and even after updating
# we would never get different p0 and p1 values.
#
# Often in practice we will initialise our guesses
# to random values to avoid this issue.
def binomial_pmf(n, z, p):
return binom.pmf(z, n, p)
def P_x_i_given_z_i_and_theta(x_i, z_i, theta):
"""
P(x_i | z_i, theta)
This is the probability of the i'th coin
being `x_i`, given the observation `z_i`
and parameters `theta`.
This is really just a simple application of
Bayes' rule.
"""
assert x_i in (0, 1) # x_i indicates which coin was used
numer = P_z_i_given_x_i_and_theta(z_i, x_i, theta) * P_x_i_given_theta(x_i, theta)
denom = (
P_z_i_given_x_i_and_theta(z_i, 0, theta) * P_x_i_given_theta(0, theta) +
P_z_i_given_x_i_and_theta(z_i, 1, theta) * P_x_i_given_theta(1, theta)
)
p = numer / denom
return p
def P_z_i_given_x_i_and_theta(z_i, x_i, theta):
"""
P(z_i | x_i, theta)
This is the probability of the i'th coin
showing `z_i` heads, given that it is coin
`x_i` and parameters `theta`.
"""
assert x_i in (0, 1) # x_i indicates which coin was used
# x_i tells us which coin we're using
# and therefore which probability to
# pick out from our parameters vector.
p0, p1, lamb = theta
p = p0 if x_i == 0 else p1
return binomial_pmf(NUM_TOSSES, z_i, p)
def P_x_i_given_theta(x_i, theta):
"""
P(x_i | theta)
This is the marginal probability of the i'th
coin being coin `x_i` (0 or 1), without taking
into account any observed data. So it is
determined by our current lambda value.
"""
assert x_i in (0, 1)
p0, p1, lamb = theta
p = lamb if x_i == 0 else 1 - lamb
return p
That's it - that's all the functions we'll need to apply the EM algorithm in this situation.
We just need to implement the update formulas, which are stated in terms of the "weights" $$w_{i,0}^{(t)} = P(x_i = 0 | z_i, \theta^{(t)})$$ and $$w_{i,1}^{(t)} = P(x_i = 1 | z_i, \theta^{(t)})$$
for t in range(10):
print(f't={t}, theta=({theta[0]:.5f}, {theta[1]:.5f}, {theta[2]:.5f})')
# Step one: calculate our weights.
# These are based on the observations and the current
# best estimate of theta.
w_i_0_t = [
P_x_i_given_z_i_and_theta(0, z[i], theta)
for i in range(NUM_EXPERIMENTS)
]
w_i_1_t = [
P_x_i_given_z_i_and_theta(1, z[i], theta)
for i in range(NUM_EXPERIMENTS)
]
# Step two: apply our EM update formulas.
# p0 = P(head when coin 0 is tossed)
numer = sum(w_i_0_t[i] * z[i] for i in range(NUM_EXPERIMENTS))
denom = sum(w_i_0_t[i] * NUM_TOSSES for i in range(NUM_EXPERIMENTS))
new_p0 = numer / denom
# p1 = P(head when coin 1 is tossed)
numer = sum(w_i_1_t[i] * z[i] for i in range(NUM_EXPERIMENTS))
denom = sum(w_i_1_t[i] * NUM_TOSSES for i in range(NUM_EXPERIMENTS))
new_p1 = numer / denom
# lambda = P(coin 0 chosen for each experiment)
new_lambda = sum(w_i_0_t[i] for i in range(NUM_EXPERIMENTS)) / NUM_EXPERIMENTS
# Update our parameters ready for the next iteration
theta = (new_p0, new_p1, new_lambda)
print()
print(f't={t+1}, theta=({theta[0]:.5f}, {theta[1]:.5f}, {theta[2]:.5f})')
t=0, theta=(0.45000, 0.55000, 0.50000) t=1, theta=(0.42385, 0.63970, 0.46189) t=2, theta=(0.33388, 0.70400, 0.44309) t=3, theta=(0.27241, 0.72693, 0.41127) t=4, theta=(0.25603, 0.72888, 0.39945) t=5, theta=(0.25376, 0.72870, 0.39732) t=6, theta=(0.25349, 0.72862, 0.39698) t=7, theta=(0.25345, 0.72860, 0.39693) t=8, theta=(0.25345, 0.72859, 0.39692) t=9, theta=(0.25345, 0.72859, 0.39692) t=10, theta=(0.25345, 0.72859, 0.39692)
If we look at these numbers, they make good intuitive sense.
The algorithm has estimated that
Looking at our experiments, we have
So it makes sense that the estimate of $\lambda$ is around 0.4, because it looks like 2 out of our 5 experiments were performed with coin 0.