Cuong Nguyen

Stochastic gradient and Hamiltonian Monte Carlo

Cuong Nguyen — Sun, 19 Nov 2023 00:00:00 GMT

This post is to introduce the formulation of stochastic gradient descent as a Monte Carlo sampling to approximate the posterior of the variables of interest.

1 Motivation of Monte Carlo sampling

According to (MacKay 2003, chap. 29), Monte Carlo based methods make use of random numbers (or in particular, random variables) to solve one or both of the following problems.

Problem 1 - generate samples

Generate samples from a given probability distribution .

Problem 2 - estimate an expected value

Estimate the expectation of a given function under a given distribution : where is assumed to be an -dimensional vector with real components .

It is assumed that is sufficiently complex that we cannot either (i) sample from it by some conventional techniques, and (ii) evaluate those expectations by exact methods. That motivates us to study Monte Carlo approximation methods.

Majority of studies in Monte Carlo methods focus on the first problem (sampling) because if we have solved the first problem, then we can solve the second problem by using the Monte Carlo approximation to give an estimation about the expectation: where: are generated from .

Under this approximation, is an un-biased estimator of the exact expectation .

Why is sampling from hard?

We will assume that the density from which we wish to draw samples, , can be evaluated, at least to within a multiplicative constant. In other words, we can evaluate a function such that: where is the normalising constant (that we do not know): Thus, it is hard to draw samples from since is often assumed to be unknown. Even if we know , drawing samples from is still challenging problem, especially in high-dimensional spaces because there is no obvious way to sample from without enumerating all of the possible states.

There are various sampling techniques to generate samples from a given distribution, such as important sampling, rejection sampling or Metropolis - Hastings method. Here, we focus on a specific method, known as Hamiltonian Monte Carlo, which belongs to the family of the Metropolis - Hastings method.

2 The Metropolis - Hastings method

The Metropolis - Hastings algorithm uses a proposal density which depends on the current state . For example, might be a simple Gaussian distribution centred on the current . The proposal density can be any fixed probability distribution from which we can easily sample.

As before, it is assumed that the un-normalised probability can be evaluated for any . One can generate the next state from the proposal distribution . To decide whether to accept the new state, a quantity (also known as Metropolis - Hastings score) is calculated. Depending on the value of the score, the next state can be (i) accepted, or (ii) accepted with certain probability depending on the value of the score.

If the step is accepted, then .
Otherwise, the previous state is kept: .

The details of the Metropolis - Hastings algorithm can be seen in Algorithm 1.

\begin{algorithm} \caption{The Metropolis - Hastings sampling method} \begin{algorithmic} \Procedure{Metropolis-Hastings}{$P^{*}(\theta), Q(\theta; \theta^{(t)})$} \State initialise $\theta^{(0)}$ \While{$t = 0, 1, \dots, T, \dots, T_{\mathrm{end}}$} \State $\theta^{\prime} \gets$ \Call{sample-from-proposal-distribution}{$Q(\theta; \theta^{(t)})$} \Comment{generate a new state} \State $a \gets \displaystyle \frac{p^{*}(\theta^{\prime})}{p^{*}(\theta^{(t)})} \frac{q(\theta^{(t)}; \theta^{\prime})}{q(\theta^{\prime}; \theta^{(t)})}$ \Comment{calculate Metropolis - Hastings score} \If{$a \ge 1$} \State $\theta^{(t + 1)} \gets \theta^{\prime}$ \Comment{accept the new state} \Else \State $\theta^{(t + 1)} \gets \theta^{(t)}$ \Comment{reject the new state} \EndIf \EndWhile \State return $\{\theta^{(t)}\}_{t = T}^{T_{\mathrm{end}}}$ \EndProcedure \end{algorithmic} \end{algorithm}

Different from rejection sampling

In rejection sampling, rejected points are discarded and have no influence on the list of samples that are collected to represent the distribution . In Metropolis - Hastings method, although rejected points are also discarded, the difference is that a rejection causes the current state to be written again onto the list.

Convergence of the Metropolis - Hastings method It has been shown that for any positive proposal distribution, i.e., , as , the probability distribution of converges to its true distribution defined in Equation 1.

Dependency of samples generated from the Metropolis - Hastings method

The Metropolis - Hastings method is an example of a Markov chain Monte Carlo method (abbreviated MCMC). In MCMC methods, a Markov process is employed to generate a sequence of states , where each sample has a probability distribution depend on the previous state, . And because successive samples are dependent, the Markov chain may need to be run for a considerable amount of time to effectively generate independent samples from the hidden distribution .

3 The Hamiltonian Monte Carlo method

The Hamiltonian Monte Carlo method is an instance of the Metropolis - Hastings method that is applicable to continuous domain. It makes use of gradient information to reduce random walk behaviour, potentially resulting in a more efficient MCMC method. In particular, it replaces the proposal distribution by an implicit distribution in the form of a differential equation.

Similar to the Metropolis - Hastings method, we assume that the density is known up to a normalised constant and written in the form of the potential energy as follows:

The potential energy, , is defined as: where is a likelihood function, and is the prior distribution of .

The Hamiltonian Monte Carlo method augments the variable of interest, , by an -dimensional momentum variables vector . A common analogy is that is the position, while is the velocity of an object of interest. In that case, the kinetic energy is defined as follows: where is symmetric positive definite matrix known as mass matrix.

The Hamiltonian dynamics of the whole system can then be defined as:

One can then define the joint probability density as:

Since the probability distribution is separable, the marginal distribution of is the desired distribution . Thus, simply discarding the momentum variables would allow to obtain a sequence of samples that asymptotically come from .

The characteristics of a Hamiltonian dynamics can be written as:

2D analogy of the Hamiltonian dynamics (Chen et al. 2014)

To analogise the Hamiltonian dynamics, one can imagine a hockey puck sliding over a frictionless ice surface of varying height. The potential energy is proportional to the height of the surface at the current position, , of the puck, while the kinectic energy is proportional to the momentum, , and the mass, , of the hockey puck.

If the surface is flat: then the hockey puck will move at a constant speed.

If it is going uphill (positive slope: ), the kinetic energy decreases as the potential energy increases util the kinetic reaches 0 (equivalently, ). The hockey puck stops in an instant and begins to slide back down the hill, resulting in increasing the kinectic energy and decreasing the potential energy.

Equation 6 defines the transformation of the two variables from time to time This transformation is reversible. Moreover, the Hamiltonian is invariant (or the preservation of the Hamiltonian ):

This makes any proposal obtained from such a perfect simulation always acceptable. If the simulation is imperfect, due to the finite step size when performing the integration for example, then some of the dynamical proposals will be rejected. The rejection rule makes use of the change in , which is zero if the simulation is perfect. Please refer to Algorithm 2 for further details of the Hamiltonian Monte Carlo method.

\begin{algorithm} \caption{Hamiltonian Monte Carlo method} \begin{algorithmic} \Procedure{Hamiltonian-MC}{$U(.), M, \varepsilon$} \State initialise $\theta^{(1)}$ \While{$t = 1, 2, \dots, T, \dots, T_{\mathrm{end}}$} \State sample momentum: $\rho^{(t)} \sim \mathcal{N}(0, M^{-1})$ \State evaluate total energy: $H \gets U(\theta^{(t)}) + K(\rho^{(t)})$ \State $\theta^{(t, 1)} \gets \theta^{(t)}$ \State $\rho^{(t, 1)} \gets \rho^{(t)}$ \For{$i = 1, 2, \dots, \tau$} \Comment{Simulate for next state} \State $\rho^{(t, i + \frac{1}{2})} \gets \rho^{(t, i)} - \frac{1}{2} \varepsilon \nabla_{\theta} U(\theta^{(t, i)})$ \Comment{make a half-step in $\rho$} \State $\theta^{(t, i + 1)} \gets \theta^{(t, i)} + \varepsilon M^{-1} \rho^{(t, i + \frac{1}{2})}$ \Comment{make a step in $\theta$} \State $\rho^{(t, i + 1)} \gets \rho^{(t, i + \frac{1}{2})} - \frac{1}{2} \varepsilon \nabla_{\theta} U(\theta^{(t, i)})$ \Comment{make another half-step in $\rho$} \EndFor \State $\theta^{\prime} \gets \theta^{(t, \tau)}$ \Comment{new state of $\theta$} \State $\rho^{\prime} \gets \rho^{(t, \tau)}$ \Comment{new state of momentum} \State evaluate total energy with the new state: $H_{\mathrm{new}} \gets U(\theta^{\prime}) + K(\rho^{\prime})$ \State calculate: $\operatorname{d}H \gets H_{\mathrm{new}} - H$ \State sample: $u \sim \mathrm{uniform}(0, 1)$ \If{$u < \exp(-\operatorname{d}H)$} \Comment{Metropolis - Hastings step} \State $\theta^{(t + 1)} \gets \theta^{\prime}$ \Comment{accept the new state} \Else \State $\theta^{(t + 1)} \gets \theta^{(t)}$ \Comment{reject the new state} \EndIf \EndWhile \State return $\{\theta^{(t)}\}_{t = T}^{T_{\mathrm{end}}}$ \EndProcedure \end{algorithmic} \end{algorithm}

Despite its efficiency, the Hamiltonian Monte Carlo method still requires to run through the entire dataset to perform the integration for as well as the Metropolis - Hastings step to decide whether to accept or reject the new state generated from the Hamiltonian dynamics. Hence, in the lense of machine learning, it is, however, impractical, especially for large-scaled datasets. It, therefore, motivates further studies and development to make the method practical.

4 Stochastic gradient Hamiltonian Monte Carlo

To reduce the cost calculating on the entire dataset , stochastic versions of Hamiltonian Monte Carlo are proposed in (Welling and Teh 2011; Chen et al. 2014). In this case, the whole-batch gradient, , is estimated by a noisy estimator, , which is based on a single mini-batch, , of data. Such a noisy estimator can be written as follows:

If there are many mini-batches, we can apply the Central Limit Theorem to approximate the noisy gradient of the potential energy as follows: where is the covariance matrix of the stochastic gradient noise (Welling and Teh 2011, Eq. (6)): and denotes the matrix such that (e.g., Cholesky decomposition).

4.1 Naive stochastic gradient Hamiltonian Monte Carlo

A naive way is to directly substitute the noisy estimator in Equation 7 into the Hamiltonian dynamics in Equation 6:

In this case, the Hamiltonian is not guaranteed to be invariant:

When using a larger mini-batch size: , the variance is smaller: , resulting in At the limit, the total energy is preserved, which is the full-batch Hamiltonian Monte Carlo mentioned above.

When using a much smaller mini-batch size: , the noise induced by the mini-batch, , is large (e.g., in terms of matrix norm), resulting in Consequently, the Hamiltonian is no longer invariant.

To correct the error due to the effect of mini-batches, one needs to perform one Metropolis - Hastings step to either reject or accept the new state. Either running a short or long simulation (corresponding to a small or large in Algorithm Hamiltonian Monte Carlo), the cost of a Metropolis - Hastings step is still extremely large and wasteful if the sample is rejected. One workaround solution is to run a Metropolis - Hastings step on a subset of data instead of the entire dataset (Korattikara et al. 2014; Bardenet et al. 2014). There are, of course, some tradeoffs using such approaches.

Hockey puck on ice surface with random wind

To continue with the same analogy of a hockey puck, the environment is now different with random wind blowing over the ice surface. That random wind may push the hockey puck further away in some random direction.

Indeed, the joint distribution can be determined to be stationary or not by analysing the corresponding Fokker - Planck equation as shown in the Appendix about the stationary of stochastic gradient due to mini-batches. In this case, is proved to be non-stationary.

In (Chaudhari and Soatto 2018), the joint distribution in Equation 5 is assumed to be stationary under the stochastic dynamics in Equation 8. This is equivalent to proving that the left hand side term in the Fokker - Planck equation is zero: . The authors then analyse and show that the stationary distribution does not converge to the desired posterior distribution in general (Chaudhari and Soatto 2018). This is, however, only true if the stationary distribution exists. And in this case, we prove that it does not (the distribution is non-stationary as shown in Section stationary of stochastic gradient due to mini-batches).

4.2 Stochastic gradient Hamiltonian Monte Carlo with “friction”

One way to overcome the stochastic estimation for the gradient of the potential energy, , is to introduce a “friction” term to the momentum update: where: denotes friction coefficient matrix. One requirement for is that: (see the section on stationary SGD with injected noise for further details).

Hockey puck on a friction surface with random wind

To continue with the same analogy, the hockey puck is now sliding not on a frictionless ice surface, but a street surface which induces friction from the asphalt. There is still a random wind blowing. However, the friction of the surface prevents the hockey puck from moving too far away than the position it is expected.

In this case, one can prove that the joint distribution is stationary.

To link this sampling to the stochastic gradient descent, one can sample and apply one leapfrog step as follows:

It can be simplified by substituting into the expression of to obtain: which has a similar form as the Stochastic Gradient Langevin Dynamics (Welling and Teh 2011).

5 Conclusion

This post reviews some seminar studies in stochastic gradient and Monte Carlo sampling. There have been many successive studies that explored and extended further. Of course, they have mostly developed on top of these studies and achieved better performance. However, it is important to understand the basic before moving to advance. Hopefully, this post would be found useful in one or another way.

Appendices

6 Fokker - Planck equation

The Fokker - Planck equation is used to analyse the evolution of the distribution of the variables in stochastic differential equation: where is some function (e.g., loss function), is a diffusion matrix and is the Brownian motion, and is a temperature.

Lemma 1 The distribution of the variable in Equation 9 evolves following the Fokker - Planck equation: where: denotes the divergence, and the divergence operator is applied column-wise to matrices.

Thus, one can prove that the distribution of the solution in the stochastic equation Equation 9 is invariant by simply proving that .

7 Stationary distribution of parameters obtained from SGD

The main focus of this section is to investigate the stationary distribution obtained through the stochastic gradient Hamiltonian Monte Carlo. Two types of noises are considered: (i) noise due to mini-batch effect and (ii) injected noise as in (Welling and Teh 2011). The main tool is the Fokker - Planck equation presented in the section about the Fokker - Planck equation. To use the Fokker - Planck equation, the two variables of interest are coupled into a single vector:

7.1 Stochastic gradient with mini-batches

The dynamics in Equation 8 can be rewritten as: where: .

The corresponding Fokker - Planck equation can be written as:

Note that: (assuming the temperature: ), then . Thus, we can rewrite the Fokker - Planck equation as follows:

For the last equality, we use the fact that:

The result in Equation 12 does not guarantee that In other words, there is not enough evidence to prove that is stationary.

In practice, when we perform SGD, the covariance matrix becomes smaller and smaller. In such case, we can assume that , and hence, the distribution is stationary.

7.2 Stochastic gradient with friction

7.2.1 Known covariance matrix

According to (Chen et al. 2014), if the covariance matrix induced by the mini-batch effect is known, then one can introduce a friction force to the system as follows:

This can be rewritten in the form of vectors and matrices as follows:

Following the notations defined in Equation 11, the system dynamics can be rewritten as:

The corresponding Fokker - Planck equation is then written as:

The third equality is due to the Identity 1.11.16 in Tensor calculus note.

The last equality holds due to the fact that . This can easily be proved by using the definition of divergence and the structure of (noise is added to although it depends on ).

In summary, injecting a noise corresponding to a friction force results in a stationary distribution .

7.2.2 Practical stochastic gradient Hamiltonian Monte Carlo with unknown covariance matrix

In practice, we might not know the covariance matrix . In such a situation, one might introduce a friction matrix that satisfies: . In other words, is positive definite. In this case, the system is over-damped and the total energy will gradually decrease to 0.

Note

In certain situations, one can prove that the stochastic gradient Hamiltonian Carlo results in a stationary distribution , it does not mean that is the true posterior of interest (the one without any noise).

References

Bardenet, Rémi, Arnaud Doucet, and Chris Holmes. 2014. “Towards Scaling up Markov Chain Monte Carlo: An Adaptive Subsampling Approach.” International Conference on Machine Learning, 405–13.

Chaudhari, Pratik, and Stefano Soatto. 2018. “Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks.” International Conference on Learning Representations.

Chen, Tianqi, Emily Fox, and Carlos Guestrin. 2014. “Stochastic Gradient Hamiltonian Monte Carlo.” International Conference on Machine Learning, 1683–91.

Korattikara, Anoop, Yutian Chen, and Max Welling. 2014. “Austerity in MCMC Land: Cutting the Metropolis - Hastings Budget.” International Conference on Machine Learning, 181–89.

MacKay, David JC. 2003. Information Theory, Inference and Learning Algorithms. Cambridge university press.

Welling, Max, and Yee W Teh. 2011. “Bayesian Learning via Stochastic Gradient Langevin Dynamics.” International Conference on Machine Learning, 681–88.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{nguyen2023,
  author = {Nguyen, Cuong},
  title = {Stochastic Gradient and {Hamiltonian} {Monte} {Carlo}},
  date = {2023-11-19},
  url = {https://cnguyen10.github.io/posts/stochastic_grad_hamiltonian_monte_carlo/},
  langid = {en}
}

For attribution, please cite this work as:

Nguyen, Cuong. 2023. “Stochastic Gradient and Hamiltonian Monte Carlo.” November 19. https://cnguyen10.github.io/posts/stochastic_grad_hamiltonian_monte_carlo/.

Expectation - Maximisation algorithm and its applications in finite mixture models

Cuong Nguyen — Sun, 17 Jul 2022 00:00:00 GMT

Missing data and latent variables are frequently encountered in various machine learning and statistical inference applications. A common example is the finite mixture model, which includes Gaussian mixture and multinomial mixture models. Due to the inherent nature of missing data or latent variables, calculating the likelihood of these models requires marginalisation over the latent variable distribution. This, in turn, complicates the process of maximum likelihood estimation (MLE).

The expectation-maximisation (EM) algorithm, introduced in (Dempster et al. 1977), offers a general technique for handling latent variable models. The fundamental concept behind the EM algorithm is to iterate between two steps: the E-step (expectation step) and the M-step (maximisation step). In the E-step, the posterior distribution of the latent variables (or missing data) is estimated. This estimated information is then used in the M-step to compute the MLE as if the data were complete. It has been proven that this iterative process guarantees a non-decreasing likelihood function. In simpler terms, the EM algorithm converges to a saddle point.

While the EM algorithm is a powerful tool, this explanation may not be as clear as desired. Consequently, this post aims to provide a more accessible explanation of the EM algorithm. Additionally, some readers may question the choice of EM over stochastic gradient descent (SGD), a prevalent optimisation method. This post will, therefore, explore the key differences between these two approaches. Finally, the applications of the EM algorithm in the context of finite mixture modelling, specifically focusing on the MLE problems in Gaussian mixture models and multinomial mixture models, are also demonstrated.

1 Notations

Before diving into the explanation and formulation, it is important to define the notations used in this post as follows:

Notations used in the formulation of the EM algorithm.
Notation	Description
	observable data
	latent variable or missing data
	the parameter of interest in MLE

2 EM algorithm

The formulation presented in this post follows a probabilistic approach. In probabilistic modelling, there are two processes: data generation (also known as a forward problem) and parameter inference (also known as an inverse problem).

2.1 Data generation

The data is generated as follows:

draw the parameter from its prior: ,
draw the parameter from its prior: ,
draw a hidden sample from a prior distribution: , and
draw an observable sample given as follows: ,

where and are the parameter of the model of interest.

Parameter

In many tutorials of EM, the parameter of the prior of the latent variable is often defined implicitly. In this post, it is defined explicitly to make the explanation easier to follow.

Such a data generation process is often visualised by the graphical model shown below

%%{
    init: {
        'theme': 'base',
        'themeVariables': {
            'primaryColor': '#ffffff'
        }
    }
}%%
flowchart LR
    subgraph data["data"]
        z((z)):::nonfilled-->x((x)):::filled;
    end
    pi((π)):::nonfilled-->z;
    theta((θ)):::nonfilled-->x;

    linkStyle default stroke: black;
    classDef nonfilled fill: none;
    style data fill: none;

2.2 Parameter inference

Given a set of observed i.i.d data , the general objective is to infer the posterior of the parameters and . Instead of inferring the exact posterior , which may be difficult in many cases, one can perform point estimate, such as MLE or maximise a posterior (MAP), which can be written as follows:

Due to the presence of the sum over the latent variable , the in-complete log-likelihood may not be evaluated directly on the joint distribution (especially when is continuous), making the optimisation difficult.

Fortunately, according to the data generation presented in Section 2.1, the completed log-likelihood can be evaluated easily:

Such an assumption allows EM to get around the difficulty when evaluating the expression in Equation 1.

Main idea behind EM

find a lower bound of the objective function in Equation 1,
tighten the lower bound, and
maximise the tightest lower bound.

The first two sub-steps combined are often known as the Expectation step (or E-step for short), while the last step is known as the Maximisation step (or M-step for short). These steps are then presented in the following sub-sub-sections.

2.2.1 Evidence lower bound (ELBO)

To find a lower bound of the objective function in Equation 1, one can follow the variational inference approach to obtain the ELBO. In particular, let be an arbitrary distribution of the latent variable . The in-complete log-likelihood in Equation 1 can be re-written as follows: where: is the Kullback-Leibler divergence (KL divergence for short) between probability distributions and .

Since and iff , the log-likelihood of interest can be lower-bounded as: and the equality occurs iff , which is the posterior of the latent variable after observing the data .

2.2.2 Tightening the ELBO

To obtain the tightest lower bound, one must perform the following optimisation:

As mentioned above, the tightest bound is when , or the “variational” posterior approaches the true posterior of the latent variable . Such a true posterior can be obtained in certain simple cases, but is intractable when the modelling becomes more complex. In those cases, only a local optima “variational” posterior is calculated (Bernardo et al. 2003).

True posterior in the E-step

Such an observation explains why in the vanilla EM, it is often stated that the E-step is to calculate the true posterior of the latent variable . The superscript denotes the parameters at the -th iteration. This is to avoid taking them into account when maximising the completed-log-likelihood in the M-step. Instead of following that convention, is used to avoid the confusion.

2.2.3 Maximising the possibly-tightest lower bound

Finally, the possibly-tightest lower bound is then maximised with respect to the parameters and as follows:

In summary, instead of maximising the difficult-to-calculate objective function in Equation 1, the EM algorithm is to execute the alternative optimisation written as follows:

The whole EM algorithm can be referred to Algorithm 1.

\begin{algorithm} \caption{Expectation - Maximisation algorithm} \begin{algorithmic} \Procedure{EM}{$\mathbf{x}$} \State initialise mixture coefficient $\pi$ \State initialise $\theta$ \While{not converged} \State calculate the ELBO: $Q \gets \operatorname{E-step}(\mathbf{x}, \pi, \theta)$ \State maximise the ELBO: $\pi, \theta \gets \operatorname{M-step}(Q, \pi, \theta)$ \EndWhile \State return $\pi, \theta$ \EndProcedure \end{algorithmic} \end{algorithm}

2.3 Convergence of the EM algorithm

The following theorem proves that the EM algorithm improves the lower-bound after every iteration. For simplicity, the priors and are ignored from the proof below, but extending to include these prior terms is trivial.

Theorem 1 Assume that , then after each EM iteration, the log-likelihood is non-decreasing. Mathematically, it can be written as follows: where the superscript denotes the result obtained after that iteration.

Proof. The log-likelihood of interest can be written as:

Since it holds for any , substituting and gives:

Substracting side by side of Equation 4 and Equation 5 gives the following:

Since KL divergence is non-negative, one can imply that:

In the M-step, the parameters are obtained by maximising the first term in the right hand side: w.r.t. . Thus, according to the definition of the maximisation:

Hence, one can conclude that:

3 Applications of EM in finite mixture models

One of the typical applications of EM algorithm is to perform maximum likelihood for finite mixture models. This section is, therefore, dedicated to discuss the application of EM on Gaussian and multinomial mixture models.

3.1 Gaussian mixture models

The Gaussian mixture distribution can be written as a convex combination of Gaussian components: where: and .

3.1.1 Data generation

A data-point of the above Gaussian mixture distribution can be generated as follows:

sample a probability from a Dirichlet prior: ,
sample sets of parameters from an normal-inverse-Wishart prior: ,
sample the index of a Gaussian component: , then
sample a data-point from the corresponding Gaussian component: , where .

The data generation process can also be visualised in the graphical model shown below.

%%{
    init: {
        'theme': 'base',
        'themeVariables': {
            'primaryColor': '#ffffff'
        }
    }
}%%
flowchart LR
    subgraph data["data"]
        direction LR
        z((z)):::rv --> x((x)):::rv
    end
    alpha((α)):::notfilled --> pi((π)):::params --> z
    sigma((Σ)):::params --> mu
    psi((Ψ)):::notfilled --> sigma
    nu((ν)):::notfilled --> sigma
    sigma --> x
    mu0((m)):::notfilled --> mu((μ)):::params --> x
    lambda((λ)):::notfilled --> mu

    style z fill: none
    classDef params stroke: #000, fill: none
    classDef rv stroke: #000
    classDef notfilled fill: none
    linkStyle default stroke: #000
    style data fill: none

3.1.2 Objective

Given set of data-points sampled from the Gaussian mixture distribution, the aim is to infer the point estimate, and in particular MAP, of . Such an objective can be written as follows:

3.1.3 Parameter inference

In this case, one can simply follow the EM algorithm presented in Section Section 2.2. Note that the likelihood on iid data-points can be written as:

E-step: optimises the lower bound with respect to the “variational” posterior. As shown in Section 2.2, results in the tightest bound. Fortunately, in this case of Gaussian mixture models, the true posterior can be calculated in closed-form as follows:

M-step: maximises the “tighest” lower-bound w.r.t. model parameter :

Taking derivative with respect to and setting it to zero give:

Or:

Similarly for :

To solve for , the covariance matrix itself is used to left- and right-multiply to obtain:

Or:

One can further substitute in Equation 8 into Equation 9 to obtain an expression for that only depends on observed data and prior parameters.

Finally, one can obtain the optimal value for the mixture coefficient in a similar way, except it is now a constrained optimisation. Such an optimisation can be written as follows:

The constrained optimisation above can simly be solved by Lagrange multiplier. The result for can then be expressed as:

One can also refer to Chapter 10.2 in (Bishop 2006) for a similar derivation and result.

3.2 Multinomial mixture models

Similar to the Gaussian mixture models, a multinomial mixture model can also be written as:

is given

Only the case where all the multinomial components have the same parameter (the number of trials) are considered. The reason is that optimising for an integer number is beyond the scope of this post.

3.2.1 Data generation

A data-point of the multinomial mixture model can be generated as follows:

sample a probability from a Dirichlet prior: ,
sample probability vectors, , from a Dirichlet prior: ,
sample the index of a multinomial component: , then
sample a data-point from the corresponding multinomial component: , where .

The data generation process can also be visualised in the graphical model shown below.

%%{
    init: {
        'theme': 'base',
        'themeVariables': {
            'primaryColor': '#ffffff'
        }
    }
}%%
flowchart LR
    subgraph data["data"]
        direction LR
        z((z)):::rv --> x((x)):::rv
    end
    alpha((α)):::notfilled --> pi((π)):::params --> z;
    beta((β)):::params --> rho((ρ)):::params;
    rho --> x;

    style z fill: none
    classDef params stroke: #000, fill: none
    classDef rv stroke: #000
    classDef notfilled fill: none
    linkStyle default stroke: #000
    style data fill: none

3.2.2 Objective

Given set of data-points sampled from a multinomial mixture distribution, the aim is to infer the point estimate, and in particular MAP, of as follows:

3.2.3 Parameter inference with EM

E-step calculates the posterior of the latent variable given the data :

M-step In the M-step, we maximise the following expected completed log-likelihood w.r.t. and :

Probability constrains on and

Due to the nature of a multinomial mixture model, both the parameters and are probability vectors.

The Lagrangian for can be written as: where is the Lagrange multiplier.

Taking derivative of the Lagrangian w.r.t. gives:

Setting the derivative to zero and solving for gives:

And since , one can substitute and find that . Thus:

Similarly, the Lagrangian of can be expressed as: where is the Lagrange multiplier. Taking derivative w.r.t. gives: Setting the derivative to zero and solving for gives: The constraint on as a probability vector leads to . Thus:

One can also refer to (Elmore and Wang 2003) for a similar derivation and result.

4 References

Bernardo, JM, MJ Bayarri, JO Berger, et al. 2003. “The Variational Bayesian EM Algorithm for Incomplete Data: With Application to Scoring Graphical Model Structures.” Bayesian Statistics 7 (453-464): 210.

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Vol. 4. Springer.

Dempster, Arthur P, Nan M Laird, and Donald B Rubin. 1977. “Maximum Likelihood from Incomplete Data via the EM Algorithm.” Journal of the Royal Statistical Society: Series B (Methodological) 39 (1): 1–22.

Elmore, Ryan T, and Shaoli Wang. 2003. Identifiability and Estimation in Finite Mixture Models with Multinomial Components. Technical Report 03-04, Pennsylvania State University.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{nguyen2022,
  author = {Nguyen, Cuong},
  title = {Expectation - {Maximisation} Algorithm and Its Applications
    in Finite Mixture Models},
  date = {2022-07-17},
  url = {https://cnguyen10.github.io/posts/mixture-models/},
  langid = {en}
}

For attribution, please cite this work as:

Nguyen, Cuong. 2022. “Expectation - Maximisation Algorithm and Its Applications in Finite Mixture Models.” July 17. https://cnguyen10.github.io/posts/mixture-models/.

Bias - variance decomposition

Cuong Nguyen — Tue, 03 May 2022 00:00:00 GMT

Bias and variance decomposition is one of the key tools to understand machine learning. However, conventional discussion about bias - variance decomposition revolves around the square loss (also known as mean square error). It is unclear whether such decomposition is still valid for some common loss functions, such as 0-1 loss or cross-entropy loss used in classification. This post is to present the decomposition for those losses following the unified framework of bias and variance decomposition from (Domingos 2000), its extended study on Bregman divergence with un-bounded support from (Pfau 2025) and the special case about Kullback-Leibler (KL) divergence (Heskes 1998).

1 Notations

The notations are similar to the ones in (Domingos 2000), but for -class classification.

Notations used in the bias-variance decomposition.
Notation	Description
	an input instance in
	the -dimensional simplex
	the -dimensional simplex
	a label instance: , for example: (i) one-hot vector if is a categorical distribution, or (ii) soft-label if is a Dirichlet or logistic normal distribution
	loss function , e.g. 0-1 loss or cross-entropy loss
	predicted label distribution:
	the set of training sets

2 Terminologies

Definition 1 The optimal prediction of a target is defined as follows:

Definition 2 The main model prediction for a loss function, , and the set of training sets, , is defined as:

Remark. The defintions of optimal and main model predictions above assume that the loss function is symmetric in terms of the input arguments. For asymmetric loss function, such as Bregmand divergence or cross-entropy, the definitions of such predictions might be slightly changed at the order of the input arguments.

Given the definitions of and , the bias, variance and noise can be defined following the unified framework proposed in (Domingos 2000) as follows:

Definition 3 The bias of a learner on an example is defined as: .

Definition 4 The variance of a learner on an example is defined as: .

Definition 5 The noise of an example is defined as: .

The definitions of bias and variance above are quite intuitive comparing to other definitions in the literature. As is the main model prediction, the bias measures the systematic deviation (loss) from the optimal (or true) label , while the variance measures the loss induced due to the fluctuations of each model prediction on different training datasets around the main prediction . In addition, as the loss is non-negative, both the bias and variance are also non-negative.

Given the defintions of bias, variance and noise above, the unified decomposition proposed in (Domingos 2000) can be expressed as: where and are two scalars. For example, in MSE, .

Of course, not all losses would satisfy the decomposition in Equation 1. However, as shown in (Domingos 2000 - Theorem 7), such decomposition can be used to bound the expected loss as long as the loss is metric. Nevertheless, in this post, we dicuss the composition on some common loss functions, such as 0-1 loss and Bregman divergence which includes MSE and Kullback-Leibler (KL) divergence.

3 Square loss

To warm-up, we discuss a wellknown bias-variance decomposition in the literature. It is applied for MSE or square loss. Here, we use the notations of vectors instead of scalars as often seen in conventional analysis. We will derive a general decomposition for Bregman divergence in which MSE is a particular case in a later section.

Theorem 1 When the loss is the square loss: , then the expected loss on several training sets can be decomposed into:

Please refer to the detailed proof here

From hyper-parameter optimisation to meta-learning

Cuong Nguyen — Mon, 22 Nov 2021 00:00:00 GMT

Meta-learning, also known as learn-how-to-learning, has been being studied from 1980s (Schmidhuber 1987; Naik and Mammone 1992), and recently attracted much attention from the research community. Meta-learning is a technique in transfer learning — a learning paradigm that utilises knowledge gained from past experience to facilitate the learning in the future. Due to being defined implicitly, meta -learning is often confused with other transfer learning techniques, e.g. fine-tuning, multi-task learning, domain adaptation and continual learning. The purpose of this post is to formulate meta-learning explicitly via empirical Bayes, and in particular hyper-parameter optimisation, to differentiate meta-learning from those common transfer learning approaches.

This post is structured as follows: First, we define some terminologies used in general transfer learning and review hyper-parameter optimisation in single-task setting. We then formulate meta-learning as an extension of hyper-parameter optimisation in multi-task setting. Finally, we show the differences between meta-learning and other transfer-learning approaches.

1 Background

1.1 Data generation model of a task

A data point of a task indexed by consists of an input and a corresponding label with . For simplicity, only two families of tasks – regression and classification – are considered in this thesis. As a result, the label is defined as for regression and as for classification, where is the number of classes.

Each data point in a task can be generated in 2 steps:

generate the input by sampling from some probability distribution ,
determine the label , where is the correct labelling function.

Both the probability distribution and the labelling function are unknown to the learning agent during training, and the aim of the supervised learning is to use the generated data to infer such labelling function .

For simplicity, we denote as the data generation model of task -th.

1.2 Task instance

Definition 1 (Hospedales et al. 2021)

A task or a task instance consists of an unknown associated data generation model , and a loss function , denoted as:

Remark. The loss function is defined abstractly, and can be either:

negative log-likelihood (NLL): , corresponding to maximum likelihood estimation. This type of loss is quite common in practice, for example:
- mean squared error (MSE) in regression
- cross-entropy in classification
variational-free energy (negative evidence lower-bound) — corresponding to the objective function in variational inference.

To solve a task , one needs to obtain an optimal task-specific model , parameterised by , which minimises a loss function on the data of that task:

In practice, since both and are unknown, the data generation model is replaced by a dataset consisting of a finite number of data-points generated according to the data generation model , denoted as . The objective to solve that task is often known as empirical risk minimisation:

Since the loss function used is the same for each task family, e.g. is NLL or variational-free energy, the subscript on the loss function is, therefore, dropped, and the loss is denoted as throughout this chapter. Furthermore, given the commonality of the loss function across all tasks, a task can, therefore, be simply represented by either its data generation model or the associated dataset .

1.3 Hyper-parameter optimisation

In single-task setting, the common way to tune or optimise a hyper-parameter is to split a given dataset into two disjoint subsets: where:

is the training (or support) subset,
is the validation (or query) subset.

Note that with this definition, , and and are not necessarily identical.

The subset is used to train the model parameter of interest , while the subset is used to validate the hyper-parameter, denoted by (we provide examples of the hyper-parameter in Section Formulation of meta-learning). Mathematically, hyper-parameter optimisation in the single-task setting can be written as the following bi-level optimisation:

We can extend the hyper-parameter optimisation from the two data subsets and to the general data generation model as the following: where and are the probability distributions of training and validation input data, respectively, and they are not necessarily identical.

Formulation of meta-learning

The setting of the meta-learning problem considered in this paper follows the task environment (Baxter 2000) that describes the unknown distribution over a family of tasks. Each task is sampled from this task environment and can be represented as , where and are the probability of training and validation input data, respectively, and are not necessarily identical. The aim of meta-learning is to use training tasks to train a meta-learning model that can be fine-tuned to perform well on an unseen task sampled from the same task environment.

Such meta-learning methods use meta-parameters to model the common latent structure of the task distribution . In this thesis, we consider meta-learning as an extension of hyper-parameter optimisation in single-task learning, where the hyper-parameter of interest — often called meta-parameter — is shared across many tasks. Similar to hyper-parameter optimisation presented in subsection hyper-parameter-optimisation, the objective of meta-learning is also a bi-level optimisation:

The difference between meta-learning and hyper-parameter optimisation is that the meta-parameter (also known as hyper-parameter) is shared across all tasks sampled from the task environment as highlighted in red colour in Equation 2.

In practice, the meta-parameter (or shared hyper-parameter) can be chosen as one of the followings:

learning rate of gradient-based optimisation used to minimise the lower level objective function in Equation 2 to learn (Li et al. 2017),
initialisation of model parameter (Finn et al. 2017),
data representation or feature extractor (Vinyals et al. 2016; Snell et al. 2017),
optimiser used to optimise the lower-level in Equation 2.

In this post, the meta-parameter is assumed to be the initialisation of model parameters. Formulation, derivation and analysis in the subsequent sections and chapters will, therefore, revolve around this assumption. Note that the analysis can be straight-forwardly extended to other types of meta-parameters with slight modifications.

In general, the objective function of meta-learning in Equation 2 can be solved by gradient-based optimisation, such as gradient descent. Due to the nature of the bi-level optimisation, the optimisation are often carried out in two steps. The first step is to adapt (or fine-tuned) the meta-parameter to the task-specific parameter . This corresponds to the optimisation in the lower-level, and can be written as: where is a hyper-parameter denoting the learning rate for task . For simplicity, the adaptation step in Equation 3} is carried out with only one gradient descent update.

The second step is to minimise the validation loss induced by the locally-optimal task-specific parameter evaluated on the validation subset w.r.t. the meta-parameter . This corresponds to the upper-level optimisation, and can be expressed as: where is another hyper-parameter representing the learning rate to learn .

The general algorithm of meta-learning using gradient-based optimisation is shown in Algorithm 1.

\begin{algorithm} \caption{Training procedure of meta-learning in general} \begin{algorithmic} \Procedure{Training}{task environment $p(\mathcal{D}, f)$, learning rates $\gamma$ and $\alpha$} \State initialise meta-parameter $\theta$ \While{$\theta$ not converged} \State sample a mini-batch of $T$ tasks from task environment $p\left( \mathcal{D}, f \right)$ \For{each task $\mathcal{T}_{i}, i \in \{1, \ldots, T\}$} \State sample two data subsets $S_{i}^{(t)}$ and $S_{i}^{(v)}$ from task $\mathcal{T}_{i} = (\mathcal{D}_{i}^{(t)}, \mathcal{D}_{i}^{(v)}, f_{i})$ \State adapt meta-parameter to task $\mathcal{T}_{i}$: $\mathbf{w}_{i}^{*} \left( \theta \right) = \theta - \frac{\alpha}{m_{i}^{(t)}} \sum_{j = 1}^{m_{i}^{(t)}} \nabla_{\theta} \left[ \ell \left( \mathbf{x}_{ij}^{(t)}, y_{ij}^{(t)}; \theta \right)\right]$ \EndFor \State update meta-parameter: $\theta \gets \theta - \frac{\gamma}{T} \sum_{i=1}^{T} \frac{1}{m_{i}^{(v)}} \sum_{k=1}^{m_{i}^{(v)}} \nabla_{\theta} \left[\ell \left( \mathbf{x}_{ik}^{(v)}, y_{ik}^{(v)}; \mathbf{w}_{i}^{*} \left( \theta \right) \right) \right]$ \EndWhile \State \textbf{return} the trained meta-parameter $\theta$ \EndProcedure \end{algorithmic} \end{algorithm}

1.4 Second-order meta-learning

As shown in Equation 4, the optimisation for the meta-parameter requires the gradient of the validation loss averaged across tasks. Given that each task-specific parameter is a function of due to the lower-level optimisation in Equation 3, the gradient of interest can be expanded as: where the first equality is due to chain rule, and the second equality is the result that differentiates the gradient update in Equation 3. Note that in the second equality, we remove the transpose notation since the corresponding matrix is symmetric.

Thus, naively implementing such gradient would require to calculate the Hessian matrix $ $, resulting in an intractable procedure for large models, such as deep neural networks. To obtain a more efficient implementation, one can utilise the Hessian-vector product (Pearlmutter 1994) between the gradient vector and the Hessian matrix $ $ to efficiently calculate the gradient of the validation loss w.r.t. .

Another way to calculate the gradient of the validation loss w.r.t. the meta-parameter is to use implicit differentiation (Domke 2012; Rajeswaran et al. 2019; Lorraine et al. 2020). This approach is more advantaged since it does not need to stores the computational graph and takes gradient via chain rule. Such implicit differentiation technique reduces the memory usage and therefore, allows to work with large-scale models. However, the trade-off is the increasing computational time to apply the chain rule to calculate the gradient of interest.

Nevertheless, the implementations that compute the exact gradient of the validation loss w.r.t. without approximation are often referred to as second-order meta-learning.

1.5 First-order meta-learning

In practice, the Hessian matrix $ $ is often omitted from the calculation to simplify the update for the meta-parameter (Finn et al. 2017). The resulting gradient consists of only the gradient of validation loss , which is more efficient to calculate with a single forward-pass if auto differentiation is used. This approximation is often referred as first-order meta-learning, and the gradient of interest can be presented as:

REPTILE [Nichol et al. (2018)} — a variant first-order meta-learning — approximates further the gradient of validation loss by the difference , resulting in a much simpler approximation:

2 Differentiation from other transfer learning approaches

In this section, some popular transfer learning methods are described with their objective functions to purposely distinguish from meta-learning.

2.1 Fine-tuning

Fine-tuning is the most common technique in neural network based transfer learning (Pratt et al. 1991; Yosinski et al. 2014) where the last or a couple of last layers in a neural network pre-trained on a source task are replaced and fine-tuned on a target task. Formally, if is denoted as the forward function of the shared layers with shared parameters , where and are the parameters of the remaining layers specifically trained on source and target tasks, respectively, then the objective of fine-tuning can be expressed as:

where and are the data sampled from the source task and target task , respectively.

Although the objective of fine-tuning shown in Equation 5 is still a bi-level optimisation, it is easier to solve than the one in meta-learning due to the following reasons:

The objective in fine-tuning has only one constrain corresponding to one source task, while meta-learning has several constrains corresponding to multiple training tasks.
In fine-tuning, and are inferred separately, while in meta-learning, the task-specific parameter is a function of the meta-parameter, resulting in a more complicated correlation.

The downside of fine-tuning is the requirement of a reasonable number of training examples on the target task to fine-tune . In contrast, meta-learning leverages the knowledge extracted from several training tasks to quickly adapt to a new task with only a few training examples.

2.2 Domain adaptation and generalisation

Domain adaptation or domain-shift refers to the case when the joint data-label distribution on source and target are different, denoted as , or simply (Heckman 1979; Shimodaira 2000; Japkowicz and Stephen 2002; Daume III and Marcu 2006; Ben-David et al. 2007). The aim of domain adaptation is to leverage the model trained on source domain to available data in the target domain, so that the model adapted to the target domain can perform reasonably well. In other words, domain adaptation relies on a data transformation that produces a domain-invariant latent space. Mathematically, the transformation is obtained by minimising a divergence between the two transformed data distribution:

After obtaining the transformation , one can simply train a model using the transformed data of the source domain, and then use that model to make predictions on the target domain.

Given the optimisation in Equation 6, domain adaptation is different from meta-learning due to the following reasons:

Domain adaptation assumes a shift in the task environments that generate source and target tasks, while meta-learning is based on the assumption of same task generation.
Domain adaptation utilises information of data from target domain, while meta-learning does not have such access.

In general, meta-learning learns a shared prior or hyper-parameters to generalise for unseen tasks, while domain adaptation produces a model to solve a particular task in a specified target domain. Recently, there is a variance of domain adaptation, named domain generalisation, where the aim is to learn a domain-invariant model without any information of target domain. In this view, domain generalisation is very similar to meta-learning, and there are some works that employ meta-learning algorithms for domain generalisation (Li et al. 2018; Li et al. 2019).

2.3 Multi-task learning

Multi-task learning learns several related auxiliary tasks and a target task simultaneously to exploit the diversity of task representation to regularise and improve the performance on the target task (Caruana 1997). If the input is assumed to be the same across extra tasks and the target task , then the objective of multi-task learning can be expressed as: where and are the label, loss function and the classifier for task , respectively, and is the shared feature extractor for tasks.

Multi-task learning is often confused with meta-learning due to their similar nature extracting information from many tasks. However, the objective function of multi-task learning in Equation 7 is a single-level optimisation for the shared parameter and multiple task-specific classifier . It is, therefore, not as complicated as a bi-level optimisation seen in meta-learning as shown in Equation 2. Furthermore, multi-task learning aims to solve a number of specific tasks known during training (referred to as target tasks), while meta-learning targets the generalisation for unseen tasks in the future.

2.4 Continual learning

Continual or life-long learning refers to a situation where a learning agent has access to a continuous stream of tasks available over time, and the number of tasks to be learnt is not pre-defined (Chen and Liu 2018; Parisi et al. 2019). The aim is to accommodate the knowledge extracted from one-time observed tasks to accelerate the learning of new tasks without catastrophically forgetting old tasks (French 1999). In this sense, continual learning is very similar to meta-learning. However, continual learning most likely focuses on systematic design to acquire new knowledge in such a way that prevents interfering to the existing one, while meta-learning is more about algorithmic design to learn the new knowledge more efficiently. Thus, we cannot mathematically distinguish their differences as done in sub-sections Fine-tuning, Domain adaptation and generalisation and Multi-task learning . Nevertheless, continual learning criteria, especially catastrophic forgetting, can be encoded into meta-learning objective to advance further continual learning performance (Al-Shedivat et al. 2018; Nagabandi et al. 2019).

3 Summary

In general, meta-learning is an extension of hyper-parameter optimisation in multi-task setting. The objective function of meta-learning is, therefore, a bi-level optimisation, where the lower-level is to adapt the meta-parameter to a task, while the upper-level is to evaluate how well the meta-parameter performs across tasks. Given such mathematical formulation, we can easily distinguish meta-learning from some common transfer learning approaches, such as fine-tuning, multi-task learning, domain adaptation and continual learning.

Hope that this post would give another perspective of meta-learning. I’ll see you in the next post about probabilistic methods in meta-learning.

4 References

Al-Shedivat, Maruan, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel. 2018. “Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments.” International Conference on Learning Representation.

Baxter, Jonathan. 2000. “A Model of Inductive Bias Learning.” Journal of Artificial Intelligence Research 12: 149–98.

Ben-David, Shai, John Blitzer, Koby Crammer, Fernando Pereira, et al. 2007. “Analysis of Representations for Domain Adaptation.” Advances in Neural Information Processing Systems 19: 137.

Caruana, Rich. 1997. “Multitask Learning.” Machine Learning 28 (1): 41–75.

Chen, Zhiyuan, and Bing Liu. 2018. “Lifelong Machine Learning.” Synthesis Lectures on Artificial Intelligence and Machine Learning 12 (3): 1–207.

Daume III, Hal, and Daniel Marcu. 2006. “Domain Adaptation for Statistical Classifiers.” Journal of Artificial Intelligence Research 26: 101–26.

Domke, Justin. 2012. “Generic Methods for Optimization-Based Modeling.” Artificial Intelligence and Statistics, 318–26.

Finn, Chelsea, Pieter Abbeel, and Sergey Levine. 2017. “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks.” International Conference on Machine Learning, 1126–35.

French, Robert M. 1999. “Catastrophic Forgetting in Connectionist Networks.” Trends in Cognitive Sciences 3 (4): 128–35.

Heckman, James J. 1979. “Sample Selection Bias as a Specification Error.” Econometrica: Journal of the Econometric Society, 153–61.

Hospedales, Timothy M, Antreas Antoniou, Paul Micaelli, and Amos J Storkey. 2021. “Meta-Learning in Neural Networks: A Survey.” IEEE Transactions on Pattern Analysis and Machine Intelligence.

Japkowicz, Nathalie, and Shaju Stephen. 2002. “The Class Imbalance Problem: A Systematic Study.” Intelligent Data Analysis 6 (5): 429–49.

Li, Da, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. 2018. “Learning to Generalize: Meta-Learning for Domain Generalization.” Thirty-Second AAAI Conference on Artificial Intelligence.

Li, Yiying, Yongxin Yang, Wei Zhou, and Timothy Hospedales. 2019. “Feature-Critic Networks for Heterogeneous Domain Generalization.” International Conference on Machine Learning, 3915–24.

Li, Zhenguo, Fengwei Zhou, Fei Chen, and Hang Li. 2017. “Meta-Sgd: Learning to Learn Quickly for Few-Shot Learning.” arXiv Preprint arXiv:1707.09835.

Lorraine, Jonathan, Paul Vicol, and David Duvenaud. 2020. “Optimizing Millions of Hyperparameters by Implicit Differentiation.” International Conference on International Conference on Artificial Intelligence and Statistics, 1540–52.

Nagabandi, Anusha, Ignasi Clavera, Simin Liu, et al. 2019. “Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning.” International Conference on Learning Representation.

Naik, Devang K, and RJ Mammone. 1992. “Meta-Neural Networks That Learn by Learning.” International Joint Conference on Neural Networks 1: 437–42.

Nichol, Alex, Joshua Achiam, and John Schulman. 2018. “On First-Order Meta-Learning Algorithms.” CoRR abs/1803.02999. http://arxiv.org/abs/1803.02999.

Parisi, German I, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. 2019. “Continual Lifelong Learning with Neural Networks: A Review.” Neural Networks 113: 54–71.

Pearlmutter, Barak A. 1994. “Fast Exact Multiplication by the Hessian.” Neural Computation 6: 147–60.

Pratt, Lorien Y, Jack Mostow, Candace A Kamm, and Ace A Kamm. 1991. “Direct Transfer of Learned Information Among Neural Networks.” Aaai 91: 584–89.

Rajeswaran, Aravind, Chelsea Finn, Sham Kakade, and Sergey Levine. 2019. Meta-Learning with Implicit Gradients.

Schmidhuber, Jürgen. 1987. “Evolutionary Principles in Self-Referential Learning (on Learning How to Learn: The Meta-Meta-... Hook).” Diploma thesis, Technische Universität München.

Shimodaira, Hidetoshi. 2000. “Improving Predictive Inference Under Covariate Shift by Weighting the Log-Likelihood Function.” Journal of Statistical Planning and Inference 90 (2): 227–44.

Snell, Jake, Kevin Swersky, and Richard Zemel. 2017. “Prototypical Networks for Few-Shot Learning.” Advances in Neural Information Processing Systems, 4077–87.

Vinyals, Oriol, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. “Matching Networks for One Shot Learning.” Advances in Neural Information Processing Systems 29: 3630–38.

Yosinski, Jason, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. “How Transferable Are Features in Deep Neural Networks?” Advances in Neural Information Processing Systems.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{nguyen2021,
  author = {Nguyen, Cuong},
  title = {From Hyper-Parameter Optimisation to Meta-Learning},
  date = {2021-11-22},
  url = {https://cnguyen10.github.io/posts/meta-learning/},
  langid = {en}
}

For attribution, please cite this work as:

Nguyen, Cuong. 2021. “From Hyper-Parameter Optimisation to Meta-Learning.” November 22. https://cnguyen10.github.io/posts/meta-learning/.

Outer product approximation of Hessian matrix

Cuong Nguyen — Mon, 12 Apr 2021 00:00:00 GMT

Hessian matrix is heavily studied in the optimisation community. The purpose is to utilise the second order derivative to optimise a function of interest (also known as Newton’s method). In machine learning, especially Bayesian inference, Hessian matrix can be found in some applications, such as Laplace’s method which approximates a distribution by a Gaussian distribution. Although Hessian matrix provides additional information which improves the convergence rate in optimisation or reduces a complicated distribution to a Gaussian distribution, calculating a Hessian matrix often increases computation complexity. In neural networks where the number of model parameters is very large, Hessian matrix is often intractable due to the limited computation and memory.

Many efficient approximations of Hessian matrix have been developed to either reduce the running time complexity or decompose the Hessian matrix to reduce the amount of memory storage. Hessian-free approaches which utilises the Hessian-vector product are also attracted much research interest. This post will present an approximation of Hessian matrix using the outer product. Note that this approximation represents an approximated Hessian matrix by a set of matrices whose sizes are reasonable to store in GPU memory. The trade-off is that the running time complexity to obtain the Hessian matrix is still quadratic. Note that this approximation is also known as Gauss-Newton matrix.

1 Notations

Before going into details, let’s define some notations used:

is the input and label of data-point -th,
is the parameter of the model of interest, or the weight of a neural network,
is the loss function, e.g. MSE or cross-entropy,
is the pre-nonlinearity output of the neural network at the final layer that has hidden units,
is the activation output at the final layer. For example, in regression, is the identity function, or in logistic regression, is the sigmoid function, while in multi-class classification, is the softmax function,

The loss function of interest is defined as the sum of losses over each data point: Note that in the following, we will omit the notation of the label from the loss to make the notation uncluttered.

2 Derivation of the approximated Hessian matrix

An element of the Hessian matrix can then be written as:

Applying the chain rule for the first term gives:

Rearranging gives:

Near the optimum, the scalar would be very closed to its target . Hence, the derivative of the loss w.r.t. is very small, and we can approximate the Hessian as:

Rewriting this with matrix notation yields a much simpler formulation: where:

Note that the Hessian matrix can be manually calculated.

Remark. Instead of storing the Hessian matrix with size which needs a large amount of memory, we can store the two matrices . This will reduce the amount of memory required. Of course, the trade-off is the increasing of the computation when performing the multiplication to obtain the Hessian matrix .

The following section will present how to calculate the matrix for some commonly-used losses.

3 Derivation for

3.1 Mean square error in regression

In the regression:

is the identity function
.

Hence, , resulting in: which agrees with the results in (Bishop and Nasrabadi 2006 - Eq.(5.84)).

3.2 Logistic regression

In this case:

is the sigmoid function
.

The first derivative is expressed as:

The second derivative is therefore:

Hence: which agrees with the result derived in the literature (Bishop and Nasrabadi 2006 - Eq. (5.85)).

3.3 Cross entropy loss in classification

In this case:

is the softmax function,
.

According to the definition of the softmax function:

Hence, the derivative can be written as: and

An element of the Jacobian vector of the loss w.r.t. can be written as:

Hence, the Jacobian vector can be expressed as:

The Hessian matrix is given as:

Or, in the explicit matrix form:

4 Conclusion

In this post, we derive an approximation of the Hessian matrix. The Gauss-Newton matrix is a good approximation since it is positive-definite and more efficient to store under the form of a set of smaller matrices. Of course, we have not got away from the curse of dimensionality since the running time complexity to obtain the Hessian matrix is still quadratic w.r.t. the number of the model parameters. One final note is that one should use the approximated Hessian matrix with care since the approximation is assumed to be near the minimal value of the considered loss function.

5 References

Bishop, Christopher M, and Nasser M Nasrabadi. 2006. Pattern Recognition and Machine Learning. Vol. 4. Springer.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{nguyen2021,
  author = {Nguyen, Cuong},
  title = {Outer Product Approximation of {Hessian} Matrix},
  date = {2021-04-12},
  url = {https://cnguyen10.github.io/posts/Gauss-Newton-matrix/},
  langid = {en}
}

For attribution, please cite this work as:

Nguyen, Cuong. 2021. “Outer Product Approximation of Hessian Matrix.” April 12. https://cnguyen10.github.io/posts/Gauss-Newton-matrix/.

PAC-Bayes bounds for generalisation error

Cuong Nguyen — Sat, 26 Dec 2020 00:00:00 GMT

Properly approaximately correct (PAC) learning is a part of statistical machine learning which has been a fundamental course for most of graduate programs in machine learning. Its main idea is to upper-bound the true risk (or generalisation error) by the empirical risk with certain confidence level. In other words, it is often written in the following form: where is the probability of event , is the confidence parameter, and – a function of sample size and the confidence parameter – is the regularisation that is satisfied: PAC-Bayes upper generalisation bound is a kind of PAC learning. It was firstly proposed in 1999 McAllester (1999), and has attracted much of research interest. There has been many subsequent improvements made to tighten further this classic PAC-Bayes bound or to extend it to more general loss functions. However, the classic PAC-Bayes theorem is still the backbone. In this post, I will show how to prove this interesting theorem.

1 Auxillary lemmas

To prove the classic PAC-Bayes theorem, we need two auxilliary lemmas shown below.

1.1 Change of measure inequality for Kullback-Leibler divergence

Lemma 1 (Banerjee 2006 - Lemma 1) For any measurable function on a set of predictor under consideration , and any distributions and on , the following inequality holds: Further,

Proof. For any measurable function , the following holds:

For the second part of the lemma, we need to examine the equality condition of the Jensen’s inequality. Since is a strictly concave function for , it follows that the equality holds when: With this choice of , we can verify that the equality does hold.

This completes the proof.

1.2 Concentration inequality

Lemma 2 (Shalev-Shwartz and Ben-David 2014 - Exercise 31.1) Let be a random variable that satisfies: . Prove that

Proof. Since the assumption is expressed in term of probability, while the conclusion is written in form of an expectation, what we need to do first is to try to present the expectation in terms of probability.

For simplicity, let . Since , then and can be presented as: where is the indication function of event . Note that the integral above is the area of a rectangle with height as 1 and the width .

One important property of the indication function is that: This allows to express the expectation of interest as: Or:

We then make a change of variable from to to utilise the given inequality in the assumption. Let’s define: Since is assumed to be non-negative, we can express it as: and:

The expectation of interest can, therefore, be written as:

2 PAC-Bayes bound

Theorem 1 Let be an arbitrary distribution over an example domain . Let be a hypothesis class, be a loss function, be a prior distribution over , and . If is an i.i.d. training set sampled according to , then for any “posterior” over , the following holds:

Proof. We define some notations to ease the proving: - - -

Applying Lemma 1 with and gives:

We upper-bound the last term in the RHS (highlighted in purple colour) by Lemma 2. To do that, we consider the empirical loss on each observable data point as a random variable in with true and empirical means and , respectively. Following the Hoeffding’s inequality gives: According to Lemma 2, this implies: Taking the expectation w.r.t. on both sides and applying Fubini’s theorem (to interchange the 2 expectations) gives: Note that the last implication is due to Jensen’s inequality.

We then apply Markov’s inequality for the term highlighted in purple:

This implies:

Combining the results in Equation 1 and Equation 2 gives:

This is equivalent to:

Note that squared function is a strictly concave function, resulting in:

Hence, Equation 3 can be written as:

Seting , and expanding according to its definition complete the proof.

3 Discussion

AFAIK, the result in Theorem 1 is a seminal PAC-Bayes bound in the literature of PAC learning. Readers could refer subsequent derivations of tighter PAC-Bayes bounds developed later.

4 References

Banerjee, Arindam. 2006. “On Bayesian Bounds.” International Conference on Machine Learning, 81–88.

McAllester, David A. 1999. “PAC-Bayesian Model Averaging.” Conference on Computational Learning Theory, 164–70.

Shalev-Shwartz, Shai, and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge university press.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{nguyen2020,
  author = {Nguyen, Cuong},
  title = {PAC-Bayes Bounds for Generalisation Error},
  date = {2020-12-26},
  url = {https://cnguyen10.github.io/posts/PAC-Bayes-bounds/},
  langid = {en}
}

For attribution, please cite this work as:

Nguyen, Cuong. 2020. “PAC-Bayes Bounds for Generalisation Error.” December 26. https://cnguyen10.github.io/posts/PAC-Bayes-bounds/.

VAE: normalising constant matters

Cuong Nguyen — Tue, 24 Nov 2020 00:00:00 GMT

Variational auto-encoder (VAE) is one of the most popular generative models in machine learning nowadays. However, the rapid development of the field has made many machine learning practitioners (or, maybe only me) focus too much on deep learning without paying much attention to some fundamentals, such as linear regression. That causes much confusion due to the discrepancy between the derivation and the practical implementation, in which the regularization of the loss, or specifically the Kullback-Leibler (KL) divergence, is weighted by some factor . I myself did experience and struggle at the beginning of my research. Even though weighting the KL divergence term by a factor $ $ could temporarily resolve the issue, I has been questioning why the balancing between reconstruction and KL divergence is necessary. Eventually, the answer is quite simple: the normalising constant in the reconstruction loss (or negative log-likelihood) that has been often ignored. This ignorance is the main cause of the imbalance between the two losses.

1 Variational auto-encoder

Given data points , the model of a VAE assumes that there is a corresponding latent variable that generates data . In short, the objective function of a VAE is to minimise the variational-free energy (VFE) given as: where is the variational distribution of the latent variable, and is the weighting factor.

In practice, people often “specify” the reconstruction loss as mean squared error (MSE) or binary cross-entropy loss and use gradient descent to minimise VFE. With as in (vfe), the reconstruction of different images seem to be the same image (see Figure 1 (top)), whereas setting $ $ results in much better reconstructed images (see Figure 1 (bottom)).

Figure 1. The reconstructed images from VAE with β = 1 (top) and β ≪ 1 (bottom). Source: stats.stackexchange.com

This does not make me satisfied, although some justifications for setting to some small value are made. For example: - Setting leads to even a “further lower-bound”. Hence, maximizing this “further lower-bound” is still mathematically reasonable. However, this bound is very loose. Can we do something better? - One can cast the problem to a constrained optimisation as in β-VAE paper. However, β in that case is the Lagrange multiplier, and should be obtained through the optimisation. Is it mathematically correct if considering β as a hyper-parameter? I doubt that.

Later on, I figure out that the main reason of the imbalance between the two losses is due to the “specification” of the reconstruction loss. Simply specifying the type of the loss as MSE or binary cross-entropy would ignore the normalising constant, resulting in an incorrect reconstruction loss. The correct way is to specify the modelling assumption of the likelihood , which, in the case of VAE, goes back to linear regression.

In the following sections, denotes the output of the decoder parameterized by a neural network with weight . Usually, is assumed to be the reconstructed images, but this might not always true depending on the assumption used.

2 Reconstruction likelihood with Gaussian assumption

This corresponds to linear regression with Gaussian noise assumption.

The variable of interest is assumed to be a deterministic function with additional Gaussian noise, so that: where: . Thus, the reconstruction likelihood can be written as: Hence, the negative log-likelihood, or the reconstruction loss in the VAE, can be expressed as:

Note that current practice uses only MSE, which ignores the first term and the scaling factor relating to the noise precision .

Under this modelling approach, the decoder would consist of 2 networks: one for mean and the other for noise precision . Of course, one can consider as a hyper-parameter to simplify further the implementation.

The “full” loss function of a VAE is, therefore, presented as:

After training, one can pass an image to the encoder and decoder to get the predicted mean and precision. The reconstructed images can then be obtained as: Although this approach is easy to understand, one drawback is the unbounded support of the Gaussian distribution, resulting in reconstructed pixel intensity values out of the desired range . Consequently, when visualizing, the pixels that are out of that range will be truncated to 0 or 1, potentially making the reconstructed images blurrier.

3 Reconstruction likelihood with continuous Bernoulli assumption

This corresponding to linear regression in (not $ {0, 1 } $ as in logistic regression), and hence, the words “continuous Bernoulli”.

This modelling approach is not as intuitive as the one with Gaussian assumption, but please bear with me for a moment.

The likelihood of interest, , is assumed to be a continuous Bernoulli distribution: and $f(z_{n}; )) , n {1, , N } $.

Note that: - the usage of continuous Bernoulli distribution is due to the fact that VAE tries to regress the pixel intensity which falls in , not $ {0, 1 } $ as in classification, - the pdf of a continuous Bernoulli distribution differs from a Bernoulli distribution at the normalising constant term, - the output of the decoder now is not the mean of the reconstructed pixel intensity as in the case of Gaussian distribution, - due to the assumption of the continuous Bernoulli distribution, the last layer of the decoder must be activated by sigmoid function to ensure the output falling in $[0, 1] $.

The negative log-likelihood, or reconstruction loss, can be easily derived as:

Current practice uses binary cross-entropy loss only, corresponding to Bernoulli distribution. To me, that practice is not correct, since the learning is to infer the parameter of the Bernoulli distribution, which is the probability when the outcome is 1. In that case, the pixel intensity is in $ {0, 1 } $, not $[0, 1] $. This explains why VAE using binary cross-entropy loss often works well for grey-scale, but not colour, images.

Substituting (nll-CB) into (vfe) gives the “full” objective function for VAE:

Note that after training, direct plotting as the pixel intensity might result in an incorrect reconstructed image, since the mean of the continuous Bernoulli distribution is not equal to its parameter. To reconstruct an image , one needs to pass that image through the encoder and decoder, and then: and plot to visualize the reconstructed image.

4 Conclusion

VAE is often considered as a basic generative model. However, most machine learning practitioners often learn by memorization about the “type” of reconstruction loss. This leads to the weighting trick in the implementation. Understanding the nature of the reconstruction loss as the log-likelihood in linear regression allows one to obtain the “full” objective function without applying any weighting tricks. Hopefully, this post would be useful to save time for ones who start to practise machine learning.

5 References

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S. and Lerchner, A., 2016. β-VAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representation.
Loaiza-Ganem, G. and Cunningham, J.P., 2019. The continuous Bernoulli: fixing a pervasive error in variational autoencoders. In Advances in Neural Information Processing Systems (pp. 13287-13297).

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{nguyen2020,
  author = {Nguyen, Cuong},
  title = {VAE: Normalising Constant Matters},
  date = {2020-11-24},
  url = {https://cnguyen10.github.io/posts/vae-normalising-constant-matters/},
  langid = {en}
}

For attribution, please cite this work as:

Nguyen, Cuong. 2020. “VAE: Normalising Constant Matters.” November 24. https://cnguyen10.github.io/posts/vae-normalising-constant-matters/.