welcome/

java-mcmc/

software/

papers/

links/

Metropolis-Hastings algorithms are a class of Markov chains which are commonly used to perform large scale calculations and simulations in Physics and Statistics. The button below opens a separate window from your browser containing a demonstation of some of the most common chains which are used for this purpose. The window is resizable, and you may need to adjust its dimensions depending on your system configuration.

To view this applet, please install the Java Plugin.

MCMC methods in a nutshell

Two common problems which are approached via M-H algorithms are simulation and numerical integration.

The former problem consists in generating a random variable with a prescribed distribution $\pi$ say, which is typically known only to within a constant factor. The MCMC answer is to construct a Markov chain, that is a random process $X_t$ , in such a way that its equilibrium distribution is the "target" $\pi$ . By generating successive values $X_t$ , $X_{t+1}$ , $X_{t+2}$ etc. it can be shown that the distribution of $X_t$ when $t$ is large is close to $\pi$ . We may therefore say that for $T$ sufficiently large, the random variable $X_T$ is approximately the variable we are looking for.

Moreover after that, $X_{T+1}$ , $X_{T+2}$ etc. are all distributed close to $\pi$ also.

The latter problem consists in calculating the value of a (typically complicated) integral which we write as

$\int f(x) \pi(x)dx.$

It is very rare that integrals which arise in real problems can be evaluated explicitly in closed form, and the best one can hope for in general is an approximate numerical value generated by a computer. The MCMC approach consists in defining as before a Markov chain $X_t$ with equilibrium distribution $\pi$ , and to use the theorem that

$\frac1t \sum_{t=1}^T f(X_t) \approx \int f(x)\pi(x)dx.$

In other words, when $T$ is large, the average value of $f(X_t)$ up to time $T$ is approximately equal to the value of the integral we were looking for.

There are many practical problems associated with the above ideas, which to explain here is beyond the scope of this document. Suffice to say that the main question of deciding how large to take $T$ for any given problem is far from solved, although many individual results are known.

You can read more about MCMC methods in Bayesian statistics in the following papers (or read any of the growing number of books on the subject - see your local library for details)

Tierney, L. (1994) Markov chains for exploring posterior distributions. The Annals of Statistics, Vol. 22, No. 4, 1701-1762.
Smith, A.F.M. and Roberts, G.O. (1993) Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods. J. Roy. Stat. Soc. B 55, 1, 3-23.

For up-to-date information and the latest techniques, check out the MCMC preprint server, the McDiag Home Page and David Wilson's Perfect Simulation Home Page.

Description of the applet

The applet you are seeing consists of a window which is subdivided into a number of regions. The left half of the display consists of some animated graphs, while the right half consists of some input fields and choice lists, which allow you influence what you see on the left.

Choosing the target distribution

The top third of the screen consists of a graph of some distribution $\pi$ defined on the interval [0,1]. The first list box on the right allows you to choose the type of distribution $\pi$ . By default, $\pi$ is Gaussian. As you make a selection, some input fields with various parameters will be displayed just below, and you can change these too.

One of the choices in the list is the "user defined" distribution. When you select this, the box with the graph of p will be initially blank. You should see four buttons on the right, marked "New", "Accept", "Load" and "Save". Press the "New" button, and the graph should turn orange. Click somewhere in the orange box, and start dragging the mouse to draw a density. Dragging to the left erases part of the graph. Once you are happy with the distribution, press the button "Accept". The drawing screen should turn white again, and the graph you have just drawn is rescaled and displayed. To draw another, press "New" again etc. The "Load" and "Save" buttons allow you to load and save distributions which you have drawn, but for Java security reasons, these will probably not function properly over the web.

Choosing the algorithm

The middle third of the screen consists of a moving red line on the left just below the graph of $\pi$ , and some controls on the right. The red line represents the path of the Markov chain $X_t$ , while the controls on the right allow you to choose the type of chain $X_t$ that is running.

By clicking anywhere in the region where the red line is, you can place the chain at a new starting point and see how it evolves from there. Try selecting the Trimodal distribution in the top list box, and placing the Markov chain in each of the three bell curves.

Each of the algorithms featured in this applet functions in basically the same way. Suppose that the random variable $X_t$ is known, and generate $X_{t+1}$ as follows:

propose a move to a new location $Y$ according to some prescribed rule (we must be able to write down a probability density function $q(x,y)$ of going from $X_t$ to $Y$ ) accept or reject the proposed move according to the probability

$\alpha(X_t, Y) = \min\Bigl( 1, \frac{\pi(Y)q(Y, X_t)}{\pi(X_t)q(X_t, Y)} \Bigr)$ which preserves the equilibrium distribution $\pi$ .

The algorithms below differ only in the proposal rule. We now give brief descriptions of all the algorithms you can run.

Gaussian Random Walk Metropolis

Given $X_t$ , this algorithm proposes a move to $Y = X_t + \sigma W$ , where $W$ is a standard Gaussian random variable, and $s$ therefore regulates the size of the proposed jump.

Much is known about this algorithm. See for example the following papers:

Mengersen, K.L. and Tweedie, R.L. (1996) Rates of convergence of the Hastings and Metropolis algorithms. Ann. Statist. 24, No. 1, 101-121.
Gelman, A., Roberts, G.O. and Gilks, W.R. (1996) Efficient Metropolis jumping rules. Bayesian statistics, 5 (Alicante, 1994), 599--607, Oxford Sci. Publ., Oxford Univ. Press, New York.

Uniform Random Walk Metropolis

Given $X_t$ , this algorithm proposes a move to $Y = X_t + \sigma U$ , where $U$ is uniformly distributed over the range [-1,1].

Langevin

Given $X_t$ , the proposed move is to $Y = X_t + \sigma W + (1/2)\sigma \nabla\log\pi(X_t)$ . Unlike the Random Walk Metropolis, this algorithm uses some information about the shape of $\pi$ to make better proposals, which have a higher chance of being accepted.

Some of the things known about this algorithm can be found in the following papers:

Roberts, G.O. and Tweedie, R.L. (1996) Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2, no. 4, 341--363.
Roberts, G.O. and Rosenthal, J.S. (1995) Optimal Scaling of Discrete Approximations to Langevin Diffusions. University of Cambridge Research Report.

Lévy Flight Metropolis

This is a variant of the Gaussian random walk Metropolis algorithm, where the proposal is still $Y = X_t + \sigma W$ , but now the random variable $W$ has a symmetric $\alpha$ stable distribution. The great French probabilist Paul Lévy pioneered the study of these random variables. When $\alpha = 2$ , the variable $W$ is simply a standard Gaussian. If $a = 1$ , $W$ has a Cauchy distribution. The smaller $a > 0$ is, the heavier the tails of the distribution of $W$ .

Student t random walk

Like the Levy Flight Metropolis algorithm, but this time the proposal increment $W$ has a Student t distribution with $n = 1, 2, 3$ , etc. degrees of freedom.

U[0,1] Independence Sampler

The proposal $Y$ is uniformly distributed over the interval [0,1]. This is independent of the current position $X_t$ of the algorithm, and that is why it is called an Independence Sampler.

Some of the known properties of this type of algorithm can be found in the following paper:

Mengersen, K.L. and Tweedie, R.L. (1996) Rates of convergence of the Hastings and
Metropolis algorithms. Ann. Statist. 24, No. 1, 101-121.

General Independence Sampler

As above, $Y$ is independent of the current value $X_t$ . However this time, you can draw the distribution of $Y$ yourself!

Gareth's diffusion algorithm

This was suggested by Gareth Roberts to be a "mode hopping" algorithm. The proposal is $Y = X_t + \sigma(\pi(X_t)/\pi_{\text{max}})-\frac12 W$ , where $W$ is a Gaussian random variable and $\pi_{\text{max}}$ is the maximum value of $\pi$ (since this appears in a ratio, we only need to know the function $\pi$ to within a constant factor).

The idea is that whenever the chain is in a valley with low density, it should increase the variance of the jump so as to find the next mode more quickly.

Gaussian Autoregressive(1)

The proposal is $Y = a + b(X_t - a) + \sigma Z$ , where $Z$ is a standard Gaussian. This is a generalization of both the Independence Sampler and the Random Walk Metropolis chain. If $0 < b < 1$ this strategy shrinks the current state toward the point a before adding the increment $Z$ . If $b = -1$ then the current state is first reflected about the point $a$ .

Hamiltonian (Hybrid) Monte Carlo

This algorithm is an example of the use of auxiliary variables, and operates a little differently from the other algorithms above. First, we augment the number of variables which p depends on, by taking the product of p with a standard Gaussian. This gives a two dimensional state space, consisting of "position" and "momentum" variables. The algorithm proceeds from a value $(X_t, P_t)$ in phase space to a new value $(X_{t+1}, P_{t+1})$ by first randomizing $P$ according to the autoregressive formula $P' = \alpha P_t + \frac12(1 - \alpha^2)W$ where $W$ is a standard Gaussian. After that, we use an Euler-type "leapfrog" scheme to move from the value $(X_t, P')$ to the new value $(X'',P'')$ along the contour curves of the Hamiltonian $H = p^2/2 - \nabla(log\pi(x))$ . This last value is accepted or rejected according to the usual Metropolis-Hastings rule.

The orange and yellow bars which you can see scrolling in step with the red path of the chain represent the current values of kinetic energy (orange) and potential energy (yellow) of the chain in phase space.

When $a = 1$ , there is no randomization of momentum. In that case, the total energy will vary very little (due to errors in the leapfrog scheme) and the random variables $X_t$ do not converge to $\pi$ . When $a = 0$ , the algorithm behaves like the Langevin algorithm.

Diagnostics

In the lower third of the applet window, you will see on the left a box with a continuously updated histogram. On the right, you should see a plot of the estimated autocorrelation function (ACF) for the position of the chain. Just above that, there is a reset button, and two input fields which allow you to increase the number of bins in the histogram and the maximum lag used for computing the ACF.

Histogram

On the left side of the window, you can see a continually updated histogram of the position of the chain. As described earlier, for any function $f$ , the time averaged value of $f(X_t)$ should converge to the value $\int f(x)\pi(x)dx$ . If we pick a point $x_0$ and let $f(x) = 1$ for all $x$ in a narrow bin around the value $x_0$ , but set $f(x) = 0$ otherwise, we can get an estimate of $\pi(x_0)$ as the bin gets ever narrower. The theoretical value of this estimate is shown by the height of the grey bins, whereas the blue bins represent the current computed average. As the number of iterations increases, the blue bins converge to the corresponding grey bins.

A measure of the distance of the blue histogram to its theoretical limit (grey) is given by the total variation (TV) norm, which is a number between 0 and 2. The number computed is not the total variation distance of the law of $X_t$ to $\pi$ .

On the histogram display, you can see also the proportion of all moves of the chain where the proposal was accepted. By setting the iteration delay on the right side to zero, the chain may be speeded up a little. Please remember that Java is an interpreted language, so these algorithms will run much faster if implemented in a language such as C.

ACF plot

On top of the histogram, you can also see two vertical lines, one green and one red, which is moving. These represent respectively the location of the mean of p, and its estimate via the time averaged value of the position of the chain. Eventually, the two lines should coincide.

It is possible to measure the correlation between the successive values of the position of $X_t$ used when estimating the mean. This is done by the autocorrelation function plotted to the right of the histogram. When $X_{t+1}$ tends to stay very close to $X_t$ , the ACF varies very slowly.

It is possible to estimate the "efficiency" of the Markov chain $X_t$ in estimating the value $\int f(x)\pi(x)dx$ for any integrable function $f$ . This is done for the position function $f(x) = x$ , which estimates the mean of $\pi$ (green vertical line). A value close to 1 means that the chain is "efficient" for that function $f$ . The estimated efficiency in this applet is computed as follows, where $T$ denotes the current number of iterations, and $\text{maxLag}$ is the maximum autocorrelation lag plotted on screen:

$\Gamma_i = \frac1T \sum_{n=0}^T(X_n - \mu)(X_{n+i} - \mu)\quad i = 0, 1, 2, \dots, \text{maxLag}$ and then $\text{efficiency} = [ 1 + 2 \sum_i \Gamma_i/\Gamma_0 ]^{-1}$ .