MCMC
One final common task in statistical computing that is (for the most part) easy to parallelize is the estimation of posterior distributions using Markov chain Monte Carlo (MCMC) simulation. Although an individual chain is difficult to parallelize, one can carry out the estimation by running multiple chains, and this is easily accomplished in parallel.
There are several programs (BUGS, JAGS, Stan) that accept a specified model and
carry out sampling from the posterior. We will focus here on JAGS, simply
because that is the one I am most familiar with. JAGS is installed on Neon at
/Shared/Tukey/jags
; to run it, add the following to your ~/.bash_profile
file:
~/.bash_profile
<...more...> export PATH=~/bin:/Shared/Tukey/jags/bin:$PATH: export PKG_CONFIG_PATH=/Shared/Tukey/jags/lib/pkgconfig:$PKG_CONFIG_PATH: export LD_LIBRARY_PATH=/Shared/Tukey/jags/lib:$LD_LIBRARY_PATH: module load gcc/4.8.2 <...more...>
We will use data from a two-factor study of pilot response to loss of aircraft control carried out in a flight simulator (raw data here). The two factors are type of training that the pilot received and the scenario that was simulated. A reasonable model for this data is the following:
model.txt
model { ## Likelihood for (i in 1:n) { y[i] ~ dbern(theta[i]) logit(theta[i]) <- mu + a[aid[i]] + b[uid[i]] } ## Priors mu ~ dnorm(0, 0.0001) for (j in 1:J) { a[j] ~ dnorm(0, sigma[1]^(-2)) } for (k in 1:K) { b[k] ~ dnorm(0, sigma[2]^(-2)) } for (i in 1:2) { sigma[i] ~ dunif(0, 5) } }
JAGS (and BUGS and Stan) can be run from the command line, and thus you could
write your own scripts to call and run them in parallel. However, this is a
common enough task that people have already written packages to do this from
within R; one such package is runjags
. Here is how we would fit the above
model in R (either interactively or not), here specifying that we want 10
chains, each with 10,000 draws (the default in runjags
is a burn-in of 4,000,
so we’re drawing 14,000 samples from each chain, but only keeping the last
10,000).
Data <- read.delim("http://myweb.uiowa.edu/pbreheny/neon/misc/flight.txt") jData <- list(aid = as.numeric(Data$Scenario), uid = as.numeric(Data$Group), J = length(levels(Data$Scenario)), K = length(levels(Data$Group)), y = Data$Recovered, n = nrow(Data)) monitor <- c("mu", "a", "b", "sigma") FUN <- function(chain) {list(mu=rnorm(1), sigma=rexp(2), a=rnorm(max(jData$aid)), b=rnorm(max(jData$uid)))} inits <- mapply(FUN, 1:10, SIMPLIFY=FALSE) require(runjags) runjags.options(rng.warning=FALSE) ## I'm happy to let runjags worry about the RNG details; ## I prefer not to be warned about it fit <- run.jags("model.txt", monitor=monitor, data=jData, inits=inits, n.chains=length(inits), method="parallel", sample=10000) ## To convert from runjags format to other popular formats: require(coda) MCMC <- as.mcmc.list(fit) ## Convert to coda format require(R2jags) Fit <- R2jags:::mcmc2bugs(MCMC, program="jags") ## Convert to R2jags format ## Posterior summaries: Fit plot(Fit)
The above code can be run either interactively through qlogin
or with qsub
.
Either way, you will want to specify the -pe smp 10
flag to ensure that you
have 10 processors available (if you only have one processor available, you’ll
have to run the 10 chains sequentially, which will of course be about 10 times
slower). For example:
qsub -pe smp 10 -q TUKEY ~/bin/rbatch mcmc.R
Note that in the end, we obtain 100,000 draws from the posterior distribution. As the summaries inform us, this data indicates that there is little reason to think that type of training had any impact on pilot’s ability to recover aircraft control, although scenario certainly had an effect (some simulated scenarios were much more difficult than others).