Neon HPC Tutorial

MCMC

One final common task in statistical computing that is (for the most part) easy to parallelize is the estimation of posterior distributions using Markov chain Monte Carlo (MCMC) simulation. Although an individual chain is difficult to parallelize, one can carry out the estimation by running multiple chains, and this is easily accomplished in parallel.

There are several programs (BUGS, JAGS, Stan) that accept a specified model and carry out sampling from the posterior. We will focus here on JAGS, simply because that is the one I am most familiar with. JAGS is installed on Neon at /Shared/Tukey/jags; to run it, add the following to your ~/.bash_profile file:

File:
~/.bash_profile

<...more...>
export PATH=~/bin:/Shared/Tukey/jags/bin:$PATH:
export PKG_CONFIG_PATH=/Shared/Tukey/jags/lib/pkgconfig:$PKG_CONFIG_PATH:
export LD_LIBRARY_PATH=/Shared/Tukey/jags/lib:$LD_LIBRARY_PATH:
module load gcc/4.8.2
<...more...>

We will use data from a two-factor study of pilot response to loss of aircraft control carried out in a flight simulator (raw data here). The two factors are type of training that the pilot received and the scenario that was simulated. A reasonable model for this data is the following:

File:
model.txt

model {
  ## Likelihood
  for (i in 1:n) {
    y[i] ~ dbern(theta[i])
    logit(theta[i]) <- mu + a[aid[i]] + b[uid[i]]
  }
  
  ## Priors
  mu ~ dnorm(0, 0.0001)
  for (j in 1:J) {
    a[j] ~ dnorm(0, sigma[1]^(-2))
  }
  for (k in 1:K) {
    b[k] ~ dnorm(0, sigma[2]^(-2))
  }
  for (i in 1:2) {
    sigma[i] ~ dunif(0, 5)
  }
}

JAGS (and BUGS and Stan) can be run from the command line, and thus you could write your own scripts to call and run them in parallel. However, this is a common enough task that people have already written packages to do this from within R; one such package is runjags. Here is how we would fit the above model in R (either interactively or not), here specifying that we want 10 chains, each with 10,000 draws (the default in runjags is a burn-in of 4,000, so we’re drawing 14,000 samples from each chain, but only keeping the last 10,000).

Data <- read.delim("http://myweb.uiowa.edu/pbreheny/neon/misc/flight.txt")
jData <- list(aid = as.numeric(Data$Scenario),
              uid = as.numeric(Data$Group),
              J = length(levels(Data$Scenario)),
              K = length(levels(Data$Group)),
              y = Data$Recovered,
              n = nrow(Data))
monitor <- c("mu", "a", "b", "sigma")
FUN <- function(chain) {list(mu=rnorm(1), sigma=rexp(2), a=rnorm(max(jData$aid)), b=rnorm(max(jData$uid)))}
inits <- mapply(FUN, 1:10, SIMPLIFY=FALSE)
require(runjags)
runjags.options(rng.warning=FALSE) ## I'm happy to let runjags worry about the RNG details;
                                   ## I prefer not to be warned about it
fit <- run.jags("model.txt", monitor=monitor, data=jData, inits=inits, n.chains=length(inits), method="parallel", sample=10000)

## To convert from runjags format to other popular formats:
require(coda)
MCMC <- as.mcmc.list(fit) ## Convert to coda format
require(R2jags)
Fit <- R2jags:::mcmc2bugs(MCMC, program="jags") ## Convert to R2jags format

## Posterior summaries:
Fit
plot(Fit)

The above code can be run either interactively through qlogin or with qsub. Either way, you will want to specify the -pe smp 10 flag to ensure that you have 10 processors available (if you only have one processor available, you’ll have to run the 10 chains sequentially, which will of course be about 10 times slower). For example:

qsub -pe smp 10 -q TUKEY ~/bin/rbatch mcmc.R

Note that in the end, we obtain 100,000 draws from the posterior distribution. As the summaries inform us, this data indicates that there is little reason to think that type of training had any impact on pilot’s ability to recover aircraft control, although scenario certainly had an effect (some simulated scenarios were much more difficult than others).