Expectation - Maximisation algorithm and its applications in finite mixture models
Missing data and latent variables are frequently encountered in various machine learning and statistical inference applications. A common example is the finite mixture model, which includes Gaussian mixture and multinomial mixture models. Due to the inherent nature of missing data or latent variables, calculating the likelihood of these models requires marginalisation over the latent variable distribution. This, in turn, complicates the process of maximum likelihood estimation (MLE).
The expectation-maximisation (EM) algorithm, introduced in (Dempster, Laird, and Rubin 1977), offers a general technique for handling latent variable models. The fundamental concept behind the EM algorithm is to iterate between two steps: the E-step (expectation step) and the M-step (maximization step). In the E-step, the posterior distribution of the latent variables (or missing data) is estimated. This estimated information is then used in the M-step to compute the MLE as if the data were complete. It has been proven that this iterative process guarantees a non-decreasing likelihood function. In simpler terms, the EM algorithm converges to a saddle point.
While the EM algorithm is a powerful tool, this explanation may not be as clear as desired. Consequently, this post aims to provide a more accessible explanation of the EM algorithm. Additionally, some readers may question the choice of EM over stochastic gradient descent (SGD), a prevalent optimisation method. This post will, therefore, explore the key differences between these two approaches. Finally, the applications of the EM algorithm in the context of finite mixture modeling, specifically focusing on the MLE problems in Gaussian mixture models and multinomial mixture models, are also demonstrated.
1 Notations
Before diving into the explanation and formulation, it is important to define the notations used in this post as follows:
Notation | Description |
---|---|
\mathbf{x} \in \mathbb{R}^{D} | observable data |
\mathbf{z} \in \mathbb{R}^{K} | latent variable or missing data |
\theta \in \Theta | the parameter of interest in MLE |
2 EM algorithm
The formulation presented in this post follows a probabilistic approach. In probabilistic modelling, there are two processes: data generation (also known as a forward problem) and parameter inference (also known as an inverse problem).
2.1 Data generation
The data is generated as follows:
- draw the parameter \pi from its prior: \pi \sim \Pr(\pi),
- draw the parameter \theta from its prior: \theta \sim \Pr(\theta),
- draw a hidden sample \mathbf{z} from a prior distribution: \mathbf{z} \sim \Pr(\mathbf{z} | \pi), and
- draw an observable sample \mathbf{x} given \mathbf{z} as follows: \mathbf{x} \sim \Pr(\mathbf{x} | \mathbf{z}, \theta),
where \pi and \theta are the parameter of the model of interest.
In many tutorials of EM, the parameter \pi of the prior of the latent variable \mathbf{z} is often defined implicitly. In this post, it is defined explicitly to make the explanation easier to follow.
Such a data generation process is often visualised by the graphical model shown below
2.2 Parameter inference
Given a set of observed i.i.d data \mathcal{D} = \{\mathbf{x}_{i}\}_{i = 1}^{N}, the general objective is to infer the posterior \Pr(\pi, \theta | \mathbf{x}). of the parameters \pi and \theta. Instead of inferring the exact posterior \Pr(\pi, \theta | \mathbf{x}), which may be difficult in many cases, one can perform point estimate, such as MLE or maximise a posterior (MAP), which can be written as follows:
\begin{aligned} \max_{\pi, \theta} \ln \Pr(\pi, \theta | \{\mathbf{x}_{i}\}_{i = 1}^{N}) & = \max_{\pi. \theta} \sum_{i = 1}^{N} \underbrace{\ln \Pr(\mathbf{x}_{i} | \pi, \theta)}_{\text{in-complete log-likelihood}} + \ln \Pr(\pi) + \ln \Pr(\theta) \\ & = \max_{\pi, \theta} \sum_{i = 1}^{N} \ln \left[ \sum_{\mathbf{z}_{i}} \Pr(\mathbf{x}_{i}, \mathbf{z}_{i} | \pi, \theta) \right] + \ln \Pr(\pi) + \ln \Pr(\theta). \end{aligned} \tag{1}
Due to the presence of the sum over the latent variable \mathbf{z}, the in-complete log-likelihood may not be evaluated directly on the joint distribution (especially when \mathbf{z} is continuous), making the optimisation difficult.
Fortunately, according to the data generation presented in Section 2.1, the completed log-likelihood \Pr(\mathbf{x}, \mathbf{z} | \pi, \theta) can be evaluated easily:
\ln \Pr(\mathbf{x}, \mathbf{z}| \pi, \theta) = \ln \Pr(\mathbf{x} | \mathbf{z}, \theta) + \ln \Pr(\mathbf{z} | \pi).
Such an assumption allows EM to get around the difficulty when evaluating the expression in Equation 1.
- find a lower bound of the objective function in Equation 1,
- tighten the lower bound, and
- maximise the tightest lower bound.
The first two sub-steps combined are often known as the Expectation step (or E-step for short), while the last step is known as the Maximisation step (or M-step for short). These steps are then presented in the following sub-sub-sections.
2.2.1 Evidence lower bound (ELBO)
To find a lower bound of the objective function in Equation 1, one can follow the variational inference approach to obtain the ELBO. In particular, let q(\mathbf{z}) > 0 be an arbitrary distribution of the latent variable \mathbf{z}. The in-complete log-likelihood in Equation 1 can be re-written as follows: \begin{aligned} \ln \Pr(\mathbf{x} | \pi, \theta) & = \mathbb{E}_{q(\mathbf{z})} \left[ \ln \Pr(\mathbf{x} | \pi, \theta) \right] \\ & = \mathbb{E}_{q(\mathbf{z})} \left[ \ln \Pr(\mathbf{x} | \pi, \theta) + \ln \Pr(\mathbf{z} | \mathbf{x}, \pi, \theta) - \ln \Pr(\mathbf{z} | \mathbf{x}, \pi, \theta) + \ln q(\mathbf{z}) - \ln q(\mathbf{z}) \right] \\ & = \mathbb{E}_{q(\mathbf{z})} \left[ \ln \Pr(\mathbf{x} | \pi, \theta) + \ln \Pr(\mathbf{z} | \mathbf{x}, \pi, \theta) - \ln q(\mathbf{z}) \right] + \mathbb{E}_{q(\mathbf{z})}\left[ \ln q(\mathbf{z}) - \ln \Pr(\mathbf{z} | \mathbf{x}, \pi, \theta) \right] \\ & = \mathbb{E}_{q(\mathbf{z})} \left[ \ln \Pr(\mathbf{x}, \mathbf{z} | \pi, \theta) - \ln q(\mathbf{z}) \right] + \operatorname{KL} \left[ q(\mathbf{z}) \| \Pr(\mathbf{z} | \mathbf{x}, \pi, \theta) \right], \end{aligned} where: \operatorname{KL}[ q \| p ] is the Kullback-Leibler divergence (KL divergence for short) between probability distributions q and p.
Since \operatorname{KL}[ q \| p ] \ge 0 and \operatorname{KL}[ q \| p ] = 0 iff q = p, the log-likelihood of interest can be lower-bounded as: \ln \Pr(\mathbf{x} | \pi, \theta) \ge \mathbb{E}_{q(\mathbf{z})} \left[ \ln \Pr(\mathbf{x}, \mathbf{z} | \pi, \theta) - \ln q(\mathbf{z}) \right], and the equality occurs iff q(\mathbf{z}) = \Pr(\mathbf{z} | \mathbf{x}, \pi, \theta), which is the posterior of the latent variable \mathbf{z} after observing the data \mathbf{x}.
2.2.2 Tightening the ELBO
To obtain the tightest lower bound, one must perform the following optimisation:
q^{*} = \operatorname*{argmax}_{q} \mathbb{E}_{q(\mathbf{z})} \left[ \ln \Pr(\mathbf{x}, \mathbf{z} | \pi, \theta) - \ln q(\mathbf{z}) \right]. \tag{2}
As mentioned above, the tightest bound is when q^{*}(\mathbf{z}) = \Pr(\mathbf{z} | \mathbf{x}, \pi, \theta), or the “variational” posterior approaches the true posterior of the latent variable \mathbf{z}. Such a true posterior can be obtained in certain simple cases, but is intractable when the modelling becomes more complex. In those cases, only a local optima “variational” posterior q(\mathbf{z}) is calculated (Bernardo et al. 2003).
Such an observation explains why in the vanilla EM, it is often stated that the E-step is to calculate the true posterior of the latent variable \Pr(\mathbf{z} | \mathbf{x}, \pi^{(t)}, \theta^{(t)}). The superscript t denotes the parameters at the t-th iteration. This is to avoid taking them into account when maximising the completed-log-likelihood in the M-step. Instead of following that convention, q^{*} is used to avoid the confusion.
2.2.3 Maximising the possibly-tightest lower bound
Finally, the possibly-tightest lower bound is then maximised with respect to the parameters \pi and \theta as follows:
\pi^{(t + 1)}, \theta^{(t + 1)} \gets \operatorname*{argmax}_{\pi, \theta} \sum_{i = 1}^{N} \mathbb{E}_{q^{*}(\mathbf{z}_{i})} \left[ \ln \Pr(\mathbf{x}_{i}, \mathbf{z}_{i} | \pi, \theta) - \cancel{\ln q^{*}(\mathbf{z})} \right] + \ln \Pr(\pi) + \ln \Pr(\theta). \tag{3}
In summary, instead of maximising the difficult-to-calculate objective function in Equation 1, the EM algorithm is to execute the alternative optimisation written as follows:
\max_{\pi, \theta} \max_{q_{i}} \sum_{i = 1}^{N} \mathbb{E}_{q(\mathbf{z}_{i})} \left[ \ln \Pr(\mathbf{x}_{i}, \mathbf{z}_{i} | \pi, \theta) - \ln q(\mathbf{z}) \right] + \ln \Pr(\pi) + \ln \Pr(\theta).
The whole EM algorithm can be referred to Algorithm 1.
2.3 Convergence of the EM algorithm
The following theorem proves that the EM algorithm improves the lower-bound after every iteration. For simplicity, the priors \Pr(\pi) and \Pr(\theta) are ignored from the proof below, but extending to include these prior terms is trivial.
Theorem 1 Assume that q^{*}(\mathbf{z}) = \Pr(\mathbf{z} | \mathbf{x}, \pi, \theta), then after each EM iteration, the log-likelihood \ln \Pr(\mathbf{x} | \pi, \theta) is non-decreasing. Mathematically, it can be written as follows: \Pr(\mathbf{x} | \pi^{(t + 1)}, \theta^{(t + 1)}) \ge \Pr(\mathbf{x} | \pi^{(t)}, \theta^{(t)}), where the superscript denotes the result obtained after that iteration.
Proof. The log-likelihood of interest can be written as: \begin{aligned} \ln \Pr(\mathbf{x} | \pi, \theta) & = \mathbb{E}_{\Pr(\mathbf{z} | \mathbf{x}, \pi^{(t)}, \theta^{(t)})} \left[ \ln \Pr(\mathbf{x} | \pi, \theta) \right] \\ & = \mathbb{E}_{\Pr(\mathbf{z} | \mathbf{x}, \pi^{(t)}, \theta^{(t)})} \left[ \ln \Pr(\mathbf{x}, \mathbf{z} | \pi, \theta) - \ln \Pr(\mathbf{z} | \mathbf{x}, \pi, \theta) \right]. \end{aligned} \tag{4}
Since it holds for any (\pi, \theta), substituting \pi = \pi^{(t)} and \theta = \theta^{(t)} gives: \ln \Pr(\mathbf{x} | \pi^{(t)}, \theta^{(t)}) = \mathbb{E}_{\Pr(\mathbf{z} | \mathbf{x}, \pi^{(t)}, \theta^{(t)})} \left[ \ln \Pr(\mathbf{x}, \mathbf{z} | \pi^{(t)}, \theta^{(t)}) - \ln \Pr(\mathbf{z} | \mathbf{x}, \pi^{(t)}, \theta^{(t)}) \right]. \tag{5}
Substracting side by side of Equation 4 and Equation 5 gives the following: \begin{aligned} & \ln \Pr(\mathbf{x} | \pi, \theta) - \ln \Pr(\mathbf{x} | \pi^{(t)}, \theta^{(t)}) \\ & = \mathbb{E}_{\Pr(\mathbf{z} | \mathbf{x}, \pi^{(t)}, \theta^{(t)})} \left[ \ln \Pr(\mathbf{x}, \mathbf{z} | \pi, \theta) \right.\\ & \quad \left. - \ln \Pr(\mathbf{x}, \mathbf{z} | \pi^{(t)}, \theta^{(t)}) + \ln \Pr(\mathbf{z} | \mathbf{x}, \pi^{(t)}, \theta^{(t)}) - \ln \Pr(\mathbf{z} | \mathbf{x}, \pi, \theta) \right] \\ & = \mathbb{E}_{\Pr(\mathbf{z} | \mathbf{x}, \pi^{(t)}, \theta^{(t)})} \left[ \ln \Pr(\mathbf{x}, \mathbf{z} | \pi, \theta) \right.\\ & \quad \left. - \ln \Pr(\mathbf{x}, \mathbf{z} | \pi^{(t)}, \theta^{(t)}) \right] + \operatorname{KL} \left[ \Pr(\mathbf{z} | \mathbf{x}, \pi^{(t)}, \theta^{(t)}) \| \Pr(\mathbf{z} | \mathbf{x}, \pi, \theta) \right]. \end{aligned}
Since KL divergence is non-negative, one can imply that: \begin{aligned} & \ln \Pr(\mathbf{x} | \pi, \theta) - \ln \Pr(\mathbf{x} | \pi^{(t)}, \theta^{(t)}) \\ & \quad \ge \mathbb{E}_{\Pr(\mathbf{z} | \mathbf{x}, \pi^{(t)}, \theta^{(t)})} \left[ \ln \Pr(\mathbf{x}, \mathbf{z} | \pi, \theta) - \ln \Pr(\mathbf{x}, \mathbf{z} | \pi^{(t)}, \theta^{(t)}) \right]. \end{aligned} \tag{6}
In the M-step, the parameters (\pi^{(t + 1)}, \theta^{(t + 1)}) are obtained by maximising the first term in the right hand side: \mathbb{E}_{\Pr(\mathbf{z} | \mathbf{x}, \pi^{(t)}, \theta^{(t)})} \left[ \ln \Pr(\mathbf{x}, \mathbf{z} | \pi, \theta) \right] w.r.t. (\pi, \theta). Thus, according to the definition of the maximisation: \mathbb{E}_{\Pr(\mathbf{z} | \mathbf{x}, \pi^{(t)}, \theta^{(t)})} \left[ \ln \Pr(\mathbf{x}, \mathbf{z} | \pi^{(t + 1)}, \theta^{(t + 1)}) \right] \ge \mathbb{E}_{\Pr(\mathbf{z} | \mathbf{x}, \pi^{(t)}, \theta^{(t)})} \left[ \ln \Pr(\mathbf{x}, \mathbf{z} | \pi^{(t)}, \theta^{(t)}) \right].
Hence, one can conclude that: \ln \Pr(\mathbf{x} | \pi^{(t + 1)}, \theta^{(t + 1)}) \ge \ln \Pr(\mathbf{x} | \pi^{(t)}, \theta^{(t)}).
3 Applications of EM in finite mixture models
One of the typical applications of EM algorithm is to perform maximum likelihood for finite mixture models. This section is, therefore, dedicated to discuss the application of EM on Gaussian and multinomial mixture models.
3.1 Gaussian mixture models
The Gaussian mixture distribution can be written as a convex combination of K Gaussian components: \Pr(\mathbf{x} | \pi, \mu, \Sigma) = \sum_{k = 1}^{K} \pi_{k} \, \mathcal{N}(\mathbf{x}; \mu_{k}, \Sigma_{k}), where: \pi_{k} \in [0, 1] and \pmb{\pi}^{\top} \pmb{1} = 1.
3.1.1 Data generation
A data-point of the above Gaussian mixture distribution can be generated as follows:
- sample a probability \pi from a Dirichlet prior: \pi \sim \Pr(\pi | \alpha) = \operatorname{Dir}(\pi | \alpha),
- sample K sets of parameters (\mu_{k}, \Sigma_{k}) from an normal-inverse-Wishart prior: (\mu_{k}, \Sigma_{k}) \sim \Pr(\mu, \Sigma | m, \lambda, \Psi, \nu) = \operatorname{NIW}(\mu, \Sigma | m, \lambda, \Psi, \nu),
- sample the index of a Gaussian component: \mathbf{z} \sim \Pr(\mathbf{z} | \pi) = \operatorname{Categorical}(\mathbf{z} | \pmb{\pi}), then
- sample a data-point from the corresponding Gaussian component: \mathbf{x} \sim \Pr(\mathbf{x} | \mathbf{z}, \mu, \Sigma) = \mathcal{N}(\mathbf{x}| \mu_{k}, \Sigma_{k}), where z_{k} = 1.
The data generation process can also be visualised in the graphical model shown below.
3.1.2 Objective
Given set of data-points \{\mathbf{x}_{i}\}_{i = 1}^{N} sampled from the Gaussian mixture distribution, the aim is to infer the point estimate, and in particular MAP, of (\pi, \mu, \Sigma). Such an objective can be written as follows:
\begin{aligned} & \max_{\pi, \mu, \Sigma} \ln \Pr(\pi, \mu, \Sigma | \{\mathbf{x}_{i}\}_{i = 1}^{N}, \alpha, m, \lambda, \Psi, \nu) \\ &= \max_{\pi, \mu, \Sigma} \frac{1}{N} \sum_{i = 1}^{N} \ln \Pr(\mathbf{x}_{i} | \pi, \mu, \Sigma) + \ln \operatorname{Dir}(\pi | \alpha) + \ln \operatorname{NIW}(\mu, \Sigma | m, \lambda, \Psi, \nu). \end{aligned}
3.1.3 Parameter inference
In this case, one can simply follow the EM algorithm presented in Section Section 2.2. Note that the likelihood on N iid data-points can be written as:
\prod_{i = 1}^{N} \Pr(\mathbf{x}_{i} | \pi, \theta) = \prod_{i = 1}^{N} \sum_{k = 1}^{K} \Pr(\mathbf{x}_{i} | \mathbf{z}_{ik} = 1, \theta) \, \Pr(z_{ik} = 1 | \pi).
E-step: optimises the lower bound with respect to the “variational” posterior. As shown in Section 2.2, q^{*} = \Pr(\mathbf{z} | \mathbf{x}, \pi, \mu, \Sigma) results in the tightest bound. Fortunately, in this case of Gaussian mixture models, the true posterior \Pr(\mathbf{z} | \mathbf{x}, \pi, \mu, \Sigma) can be calculated in closed-form as follows:
\boxed{ \begin{aligned} q^{*}(\mathbf{z}_{ik} = 1) & = \Pr(\mathbf{z}_{ik} = 1 | \mathbf{x}_{i}, \pi^{(t)}, \mu^{(t)}, \Sigma^{(t)}) \\ & = \frac{\Pr(\mathbf{x}_{i} | \mathbf{z}_{ik} = 1, \mu^{(t)}, \Sigma^{(t)}) \, \Pr(\mathbf{z}_{ik} = 1 | \pi^{(t)})}{\sum_{j = 1}^{K} \Pr(\mathbf{x}_{i} | \mathbf{z}_{ij} = 1, \mu^{(t)}, \Sigma^{(t)}) \, \Pr(\mathbf{z}_{ij} = 1 | \pi^{(t)})} \\ & \quad (\text{Bayes' rule}) \\ & = \frac{\pi_{k} \, \mathcal{N}(\mathbf{x}_{i}; \mu_{k}^{(t)}, \Sigma_{k}^{(t)})}{\sum_{j = 1}^{K} \pi_{j} \, \mathcal{N}(\mathbf{x}_{i}; \mu_{j}^{(t)}, \Sigma_{j}^{(t)})}. \end{aligned} } \tag{7}
M-step: maximises the “tighest” lower-bound w.r.t. model parameter (\pi, \mu, \Sigma): \begin{aligned} & \operatorname*{argmax}_{\pi, \mu, \Sigma} \sum_{i = 1}^{N} \mathbb{E}_{q^{*}(\mathbf{z}_{i})} [ \ln \Pr(\mathbf{x}_{i} | \mathbf{z}_{i}, \mu, \Sigma) + \ln \Pr(\mathbf{z}_{i} | \pi) ] + \ln \Pr(\pi | \alpha) + \ln \Pr(\mu, \Sigma | m, \lambda, \Psi, \nu) \\ & = \operatorname*{argmax}_{\mu, \Sigma} \sum_{i = 1}^{N} \sum_{k = 1}^{K} q^{*}(\mathbf{z}_{ik} = 1) \left[\ln \Pr(\mathbf{x}_{i} | \mathbf{z}_{ik} = 1, \pi, \mu, \Sigma) + \ln \Pr(\mathbf{z}_{ik} = 1| \pi) \right] \\ & \quad + \ln \operatorname{Dir}(\pi | \alpha) + \ln \operatorname{NIW}(\mu_{k}, \Sigma_{k} | m, \lambda, \Psi, \nu)\\ & = \operatorname*{argmax}_{\mu, \Sigma} \sum_{i = 1}^{N} \sum_{k = 1}^{K} q^{*}(\mathbf{z}_{ik} = 1) \left[ \ln \mathcal{N}(\mathbf{x}_{i}; \mu_{k}, \Sigma_{k}) + \ln \pi_{k} \right] \\ & \quad + (\alpha_{k} - 1) \ln \pi_{k} + \ln \mathcal{N} \left( \mu_{k} \left| m, \frac{1}{\lambda} \Sigma_{k} \right. \right) + \ln \mathcal{W}^{-1} \left( \Sigma_{k} | \Psi, \nu \right)\\ & = \operatorname*{argmax}_{\mu, \Sigma} -\frac{1}{2} \sum_{i = 1}^{N} \sum_{k = 1}^{K} q^{*}(\mathbf{z}_{ik} = 1) \left[ \ln \left| \Sigma_{k} \right| + (\mathbf{x}_{i} - \mu_{k})^{\top} \Sigma_{k}^{-1} (\mathbf{x}_{i} - \mu_{k}) + \ln \pi_{k} \right] \\ & \quad + (\alpha_{k} - 1) \ln \pi_{k} - \frac{\nu + D + 2}{2} \ln |\Sigma_{k}| - \frac{1}{2} \operatorname{Tr} \left( \Psi \Sigma_{k}^{-1} \right) - \frac{\lambda}{2} (\mu_{k} - m)^{\top} \Sigma_{k}^{-1} (\mu_{k} - m). \end{aligned}
Taking derivative with respect to \mu_{k} and setting it to zero give:
\begin{aligned} & \sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1) \Sigma_{k}^{-1} (\mathbf{x}_{i} - \mu_{k}) - \lambda \Sigma_{k}^{-1} (\mu_{k} - m) = 0 \\ & \Leftrightarrow \left[ \sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1) + \lambda \right] \mu_{k} = \sum_{i = 1}^{N} \gamma(\mathbf{z}_{ik}) \mathbf{x}_{i} + \lambda m. \end{aligned}
Or:
\boxed{ \mu_{k} = \frac{\sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1) \mathbf{x}_{i} + \lambda m}{\sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1) + \lambda}. } \tag{8}
Similarly for \Sigma_{k}:
\begin{aligned} & -\frac{1}{2} \sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1) \left[ \Sigma_{k}^{-1} - \Sigma_{k}^{-1} (\mathbf{x}_{i} - \mu_{k}) (\mathbf{x}_{i} - \mu_{k})^{\top} \Sigma_{k}^{-1} \right] \\ & \quad + \frac{1}{2} \Sigma_{k}^{-1} \Psi \Sigma_{k}^{-1} - \frac{\nu + D + 2}{2} \Sigma_{k}^{-1} + \frac{\lambda}{2} \Sigma_{k}^{-1} (\mu_{k} - m)^{\top} (\mu_{k} - m) \Sigma_{k}^{-1} = 0. \end{aligned}
To solve for \Sigma_{k}, the covariance matrix itself is used to left- and right-multiply to obtain:
\begin{aligned} & -\frac{1}{2} \sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1) \left[ \Sigma_{k} - (\mathbf{x}_{i} - \mu_{k}) (\mathbf{x}_{i} - \mu_{k})^{\top} \right] \\ & \quad + \frac{1}{2} \Psi - \frac{\nu + D + 2}{2} \Sigma_{k} + \frac{\lambda}{2} (\mu_{k} - m)^{\top} (\mu_{k} - m) = 0 \\ & \Leftrightarrow \left[ \sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1) + \nu + D + 2 \right] \Sigma_{k} \\ & \quad = \sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1) (\mathbf{x}_{i} - \mu_{k}) (\mathbf{x}_{i} - \mu_{k})^{\top} + \Psi + \lambda (\mu_{k} - m)^{\top} (\mu_{k} - m). \end{aligned}
Or:
\boxed{ \Sigma_{k} = \frac{\sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1) (\mathbf{x}_{i} - \mu_{k}) (\mathbf{x}_{i} - \mu_{k})^{\top} + \Psi + \lambda (\mu_{k} - m)^{\top} (\mu_{k} - m)}{\sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1) + \nu + D + 2}. } \tag{9}
One can further substitute \mu_{k} in Equation 8 into Equation 9 to obtain an expression for \Sigma_{k} that only depends on observed data \mathbf{x} and prior parameters.
Finally, one can obtain the optimal value for the mixture coefficient \pi_{k} in a similar way, except it is now a constrained optimisation. Such an optimisation can be written as follows:
\begin{aligned} & \max_{\pi} \sum_{k = 1}^{K} \left[\sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1) + \alpha_{k} - 1 \right] \ln \pi_{k} \\ & \text{subject to: } \sum_{k = 1}^{K} \pi_{k} = 1. \end{aligned}
The constrained optimisation above can simly be solved by Lagrange multiplier. The result for \pi_{k} can then be expressed as:
\boxed{ \pi_{k} = \frac{\sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1) + \alpha_{k} - 1}{N - K + \sum_{k = 1}^{K} \alpha_{k}}. }
One can also refer to Chapter 10.2 in (Bishop 2006) for a similar derivation and result.
3.2 Multinomial mixture models
Similar to the Gaussian mixture models, a multinomial mixture model can also be written as:
\Pr(\mathbf{x} | \pi, m, \rho) = \sum_{\mathbf{z}} \Pr(\mathbf{z} | \pi) \Pr(\mathbf{x} | \mathbf{z}, m, \rho) = \sum_{k = 1}^{K} \pi_{k} \mathrm{Mult}(\mathbf{x}; m, \rho_{k}).
Only the case where all the multinomial components have the same parameter m (the number of trials) are considered. The reason is that optimising for an integer number m is beyond the scope of this post.
3.2.1 Data generation
A data-point of the multinomial mixture model can be generated as follows:
- sample a probability \pi from a Dirichlet prior: \pi \sim \Pr(\pi | \alpha) = \operatorname{Dir}(\pi | \alpha),
- sample K probability vectors, \{ \rho_{k} \}_{k = 1}^{K}), from a Dirichlet prior: \rho_{k} \sim \Pr(\rho | \beta ) = \operatorname{Dir}(\rho | \beta),
- sample the index of a multinomial component: \mathbf{z} \sim \Pr(\mathbf{z} | \pi) = \operatorname{Categorical}(\mathbf{z} | \pmb{\pi}), then
- sample a data-point from the corresponding multinomial component: \mathbf{x} \sim \Pr(\mathbf{x} | \mathbf{z}, \rho) = \operatorname{Multinomial}(\mathbf{x}| \rho_{k}), where z_{k} = 1.
The data generation process can also be visualised in the graphical model shown below.
3.2.2 Objective
Given set of data-points \{\mathbf{x}_{i}\}_{i = 1}^{N} sampled from a multinomial mixture distribution, the aim is to infer the point estimate, and in particular MAP, of (\pi, \rho) as follows:
\max_{\pi, \rho} \ln \Pr(\pi, \rho | \{\mathbf{x}_{i}\}_{i = 1}^{N}, \alpha, m, \beta) = \max_{\pi, \rho} \frac{1}{N} \sum_{i = 1}^{N} \ln \Pr(\mathbf{x}_{i} | \pi, m, \rho) + \ln \operatorname{Dir}(\pi | \alpha) + \ln \operatorname{Dir}(\rho | \beta).
3.2.3 Parameter inference with EM
E-step calculates the posterior of the latent variable \mathbf{z}_{i} given the data \mathbf{x}_{i}: \begin{aligned} q^{*}(\mathbf{z}_{ik} = 1) & = \Pr(\mathbf{z}_{ik} = 1 | \mathbf{x}_{i}, \pi^{(t)}, \rho^{(t)}) \\ & = \frac{\Pr(\mathbf{x}_{i} | \mathbf{z}_{ik} = 1, \rho^{(t)}) \, \Pr(\mathbf{z}_{ik} = 1 | \pi^{(t)})}{\sum_{k = 1}^{K} \Pr(\mathbf{x}_{i} | \mathbf{z}_{ik} = 1, \rho^{(t)}) \, \Pr(\mathbf{z}_{ik} = 1 | \pi^{(t)})} \\ & = \frac{\pi_{k}^{(t)} \, \mathrm{Mult}(\mathbf{x}_{i}; m, \rho_{k}^{(t)})}{\sum_{k = 1}^{K} \pi_{k}^{(t)} \, \mathrm{Mult}(\mathbf{x}_{i}; m, \rho_{k}^{(t)})}. \end{aligned} \tag{10}
M-step In the M-step, we maximise the following expected completed log-likelihood w.r.t. \pi and \rho:
\begin{aligned} & \operatorname*{argmax}_{\pi, \rho} \sum_{i = 1}^{N} \mathbb{E}_{q^{*}(\mathbf{z}_{i})} [ \ln \Pr(\mathbf{x}_{i} | \mathbf{z}_{i}, m, \rho) + \ln \Pr(\mathbf{z}_{i} | \pi) ] + \ln \Pr(\pi | \alpha) + \ln \Pr(\rho | \beta) \\ & = \operatorname*{\argmax}_{\pi, \rho} \sum_{i = 1}^{N} \mathbb{E}_{q^{*}(\mathbf{z}_{i})} \left[ \sum_{k = 1}^{K} \mathbf{z}_{ik} \ln \operatorname{Mult}(\mathbf{x}_{i} | m, \rho_{k}) + \ln \operatorname{Categorical}(\mathbf{z}_{i} | \pi) \right] \\ & \quad + \ln \operatorname{Dir}(\pi | \alpha) + \ln \operatorname{Dir}(\rho | \beta) \\& = \operatorname*{\argmax}_{\pi, \rho} \sum_{i = 1}^{N} \mathbb{E}_{q^{*}(\mathbf{z}_{i})} \left[ \sum_{k = 1}^{K} \mathbf{z}_{ik} \left( \sum_{d = 1}^{D} \mathbf{x}_{id} \ln \rho_{kd} \right) + \mathbf{z}_{ik} \ln \pi_{k} \right] \\ & \quad + \sum_{k = 1}^{K} (\alpha - 1) \ln \pi_{k} + (\beta - 1) \ln \rho_{k} \\ & = \operatorname*{\argmax}_{\pi, \rho} \sum_{i = 1}^{N} \sum_{k = 1}^{K} q^{*}(\mathbf{z}_{ik} = 1) \left[ \ln \pi_{k} + \sum_{d = 1}^{D} \mathbf{x}_{id} \ln \rho_{kd} \right] + (\alpha - 1) \ln \pi_{k} + (\beta - 1) \sum_{d = 1}^{D} \ln \rho_{kd}. \end{aligned}
Due to the nature of a multinomial mixture model, both the parameters \pi and \rho are probability vectors.
The Lagrangian for \pi can be written as: \mathsf{L}_{\pi} = \sum_{i = 1}^{N} \sum_{k = 1}^{K} q^{*}(\mathbf{z}_{ik} = 1) \ln \pi_{k} + (\alpha - 1) \ln \pi_{k} - \lambda \left( \sum_{k = 1}^{K} \pi_{k} - 1 \right), where \lambda is the Lagrange multiplier.
Taking derivative of the Lagrangian w.r.t. \pi_{k} gives: \frac{\partial \mathsf{L}_{\pi}}{\partial \pi_{k}} = \frac{1}{\pi_{k}} \left[ \alpha - 1 + \sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1) \right] - \lambda.
Setting the derivative to zero and solving for \pi_{k} gives: \pi_{k} = \frac{1}{\lambda} \left[ \alpha - 1 + \sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1) \right].
And since \sum_{k = 1}^{K} \pi_{k} = 1, one can substitute and find that \lambda = N + K (\alpha - 1). Thus: \boxed{ \pi_{k}^{(t + 1)} = \frac{\alpha - 1 + \sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1)}{N + K (\alpha - 1)}. }
Similarly, the Lagrangian of \rho can be expressed as: \mathsf{L}_{\rho} = \sum_{i = 1}^{N} \sum_{k = 1}^{K} q^{*}(\mathbf{z}_{ik} = 1) \sum_{d = 1}^{D} \mathbf{x}_{id} \ln \rho_{kd} + (\beta - 1) \ln \rho_{kd} - \sum_{k = 1}^{K} \eta_{k} \left( \sum_{d = 1}^{D} \rho_{kd} - 1 \right), where \eta_{k} is the Lagrange multiplier. Taking derivative w.r.t. \rho_{kd} gives: \frac{\partial \mathsf{L}_{\rho}}{\partial \rho_{kd}} = \frac{1}{\rho_{kd}} \left[ \beta - 1 + \sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1) \mathbf{x}_{id} \right] - \eta_{k}. Setting the derivative to zero and solving for \rho_{kd} gives: \rho_{kd} = \frac{1}{\eta_{k}} \left[ \beta - 1 + \sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1) \mathbf{x}_{id} \right]. The constraint on \rho_{k} as a probability vector leads to \eta_{k} = K (\beta - 1) + m \sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1). Thus: \boxed{ \rho_{kd}^{(t + 1)} = \frac{\beta - 1 + \sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1) \mathbf{x}_{id}}{K (\beta - 1) + m \sum_{i = 1}^{N} q^{*}(\mathbf{z}_{ik} = 1)}. }
One can also refer to (Elmore and Wang 2003) for a similar derivation and result.
4 References
Reuse
Citation
@online{nguyen2022,
author = {Nguyen, Cuong},
title = {Expectation - {Maximisation} Algorithm and Its Applications
in Finite Mixture Models},
date = {2022-07-17},
url = {https://cnguyen10.github.io/posts/mixture-models/},
langid = {en}
}