Expectation maximisation is a really clever technique for estimating the most likely values of some parameters, in a situation where we have both observed and unobserved (hidden) data.
This notebook aims to lay out the mathematics behind the algorithm as clearly as possible.
Suppose we have
I like this labelling of variables because the word "observed" contains a "z" sound. It makes it a bit easier to remember what is what in the equations.
A sensible estimate of the parameters $\hat{\theta}$ would be the parameters that maximise the log likelihood of the observed data $x$.
That is, we would like find the parameters that maximise
$$\log P(z |\ \theta)$$Note that for any distribution $Q(x)$ on $x$, the unobserved variables, and any particular value of $\theta$, $\theta_p$ say, we have
\begin{align} & \log P(z |\ \theta_p) \\ \nonumber & = \log \left( \sum_x P(z, x\ |\ \theta_p) \right) \\ & = \log \left( \sum_x Q(x) \frac{ P(z, x\ |\ \theta_p)}{Q(x)} \right) \\ & \ge \sum_x Q(x) \log \left( \frac{ P(z, x\ |\ \theta_p)}{Q(x)} \right) \\ \end{align}The last inequality here holds via Jensen's inequality, since the penultimate line is the logarithm of an average (where the average is over the distribution of the unobserved variables $Q(x)$). Since log is a concave function, the average of the logarithm'ed quantity is less than or equal to the logarithm of the average of the quantity.
In particular, the above inequality holds when the distribution $Q(x)$ is the distribution of $x$ given the observed data $z$ and any parameters $\theta$, $P(x | z, \theta)$.
And in fact, it holds with equality in the case when $\theta = \theta_p$, since the expression in the last line above (on the right hand side of the inequality) is
\begin{align*} & \sum_x P(x | z, \theta_p) \log \left( \frac{ P(z, x|\ \theta_p)}{P(x | z, \theta_p)} \right) \\ & = \sum_x P(x |\ z, \theta_p) \log P(z|\ \theta_p) \\ & = \log P(z|\ \theta_p) \sum_x P(x | z, \theta_p) \\ & = \log P(z|\ \theta_p) \end{align*}To move between the first and second lines above, we have simply applied Bayes' theorem: $$P(z, x|\ \theta_p) = P(x | z, \theta_p) P(z | \theta_p)$$
Since the inequality above holds for any particular value of our parameters, $\theta_p$, we can also say that it holds in general for $\log P(x |\ \theta)$ as a function of $\theta$.
If we define a function $f(\theta)$ to be
$$ f(\theta) = \sum_x Q(x) \log \left( \frac{ P(z, x\ |\ \theta)}{Q(x)} \right) $$then that function is a lower bound for the function
$$ \log P(z |\ \theta) $$which is the log likelihood we'd like to maximise.
The above all seems a bit abstract right now, but I hope it is fairly clear - we haven't used any advanced mathematics to derive the results above.
Based on what we've shown, we can now define an iterative update procedure for finding a best estimate of $\theta$.
Suppose we have a current set of parameters, our current best guess, $\theta^{(t)}$. We wish to form a better guess from these parameters.
We can do so as follows: let
$$ \theta^{(t + 1)} = \arg\!\max_{\theta}\ g_t(\theta) $$where $$ g_t(\theta) = \sum_x P(x | z, \theta^{(t)}) \log \left( \frac{ P(z, x |\ \theta)}{P(x | z, \theta^{(t)})} \right) \\ $$
This looks a bit complicated. Let's break it down in words.
For our next value of $\theta$, we choose the value of $\theta$ which maximises the expression shown for $g_t(\theta)$
When calculating the expression, the numerator in the fraction depends on $\theta$ (i.e. the parameters we are maximising over) but the other terms use only the current value of $\theta^{(t)}$
That second bullet point is quite important. The terms involving $\theta^{(t)}$ don't vary as we vary $\theta$ when we perform the maximisation. They're fixed because they're calculated from $\theta^{(t)}$, our current best guess (and the observed data $z$, and the hidden data $x$ (although the final expression doesn't depend on $x$, because we sum over all possible values of $x$)).
Then from the definition of the update law, we have
$$ g_t(\theta^{(t)}) \ge g_t(\theta^{(t+1)}) $$since by definition $\theta^{(t+1)}$ is the value that maximises $g_t(\theta)$.
$\theta^{(t)}$ is one possible value of $\theta$, but we maximise over all possible values. So this guarantees that $$ g_t(\theta^{(t)}) \ge g_t(\theta^{(t+1)})$$
Also note that if we put $\theta^{(t)}$ into our function $g$, then we get $\log P(z |\ \theta^{(t)})$.
$$g(\theta^{(t)} = P(z\ |\ \theta^{(t)}$$Also note that by the argument in the Preliminaries section, $g(\theta)$ is a lower bound on $P(z\ |\ \theta)$.
$$g(\theta^{(t+1)} \le P(z\ |\ \theta^{(t+1)}$$So the update rule we have defined allows us to move from a set of parameters $\theta^{(t)}$ to a new set of parameters $\theta^{(t+1)}$.
Using the inequalities above, we have
$$P(z\ |\ \theta^{(t)}) = g_t(\theta^{(t)}) \le g_t(\theta^{(t+1)}) \le P(z\ |\ \theta^{(t+1)})$$So our new set of parameters $\theta^{(t+1)}$ give us a better log likelihood than the previous set of parameters $\theta^{(t)}$. This is great news! We can keep applying the algorithm recursively and we will reach a maximum of the log-likelihood.
This might, in unfortunate cases, be a local maximum rather than a global one. It's hard to solve problems with unobserved data. But this is still a great way to iterate towards a sensible solution, and in many cases the technique shown here will allow us to find a global maximum likelihood estimate for $\theta$.
Note that in the update rule above we are choosing a value of $\theta$ that maximises $g_t(\theta)$. The numerator in the logarithm in the definition of $g_t(\theta)$ allowed us to use the results from the Preliminaries section to prove that the iterative solution will improve our log-likelihood.
But in fact the numerator doesn't affect the value of $\theta^{(t+1)}$. We can remove it with some algebra, using the properties of logs:
\begin{align*} g_t(\theta) & = \sum_x P(x | z, \theta^{(t)}) \log \left( \frac{ P(z, x |\ \theta)}{P(x | z, \theta^{(t)})} \right) \\ & = \sum_x P(x | z, \theta^{(t)}) \left( \log P(z, x |\ \theta) - \log P(x | z, \theta^{(t)}) \right) \\ & = \sum_x P(x | z, \theta^{(t)}) \log P(z, x |\ \theta) - \sum_x P(x | z, \theta^{(t)}) \log P(x | z, \theta^{(t)}) \\ \end{align*}Note that the second sum here does not depend on $\theta$, which is the thing we're varying when looking for the argmax, so we can forget about it and use the update rule
$$ \theta^{(t + 1)} = \arg\!\max_{\theta} h_t(\theta) $$where $$ h_t(\theta) = \sum_x P(x | z, \theta^{(t)}) \log P(z, x |\ \theta) $$
This expression is actually just an average, or an expectation. It is the expectation of the log likelihood of the full data (including both observed variables $z$ and hidden variables $x$) where the average is taken over the conditional distribution of the hidden variables $x$ given the observed variables $z$ and the current best guess of the parameters $\theta^{(t)}$.