Customization

One great thing about the Linux command line is how easy it is to customize. For example, submitting an array of R jobs like the one on the previous page is something I do all the time, so I simply added the following commands to ~/bin/:

File:
rbatch
#!/bin/bash
R CMD BATCH --no-save --no-restore "--args $SGE_TASK_ID $SGE_TASK_LAST" $1 .$1.Rout
File:
qbatch
#!/bin/bash
qsub -cwd -V -e ~/err -o ~/out -q TUKEY -t 1-$2 ~/bin/rbatch $1

Now, we can submit arrays of R jobs from the command line without writing any extra scripts like sim and batch-sim (which we will do shortly).

Furthermore, it’s extremely useful to have modular, versatile code. In particular, I don’t like writing R code, then rewriting R code to run on the cluster, then re-re-writing it if I want to run it again on my machine. All of these rewrites are (a) annoying and (b) an opportunity to make a mistake. For example, comparing the versions of sim.R here and here, you’ll note that one of them works in an interactive session but not a batch session, while the other works in a batch session, but won’t run in an interactive session. To avoid switching between the two, I use a simple R function I called Bsave for “batch save” (source code here and here for its cousin, Bgather; add source("Bsave.R") to your ~/.Rprofile file to make this available in your R sessions). To see why this is useful, let’s watch it in action. Let’s rewrite sim.R one last time: <div class="code"> <div class="prompt"> File:
sim.R </div> <pre class="code"> N <- 10000 p <- numeric(N) n <- 10 for (i in 1:N) { x <- rnorm(n) y <- rnorm(n, sd=3) p[i] <- t.test(x, y, var.equal=TRUE)$p.value } Bsave(p) </pre> </div>

Finally, I also have a command, gather, that simply opens R, calls Bgather(), and then closes R:

File:
gather
#!/bin/bash
Rscript -e 'Bgather()'

Now let’s see what these functions do. If we run this in an interactive session of R, p will be saved in a file named with today’s date. When we run it non-interactively on the cluster with

>
qbatch sim.R 10

The results are saved in tmp1.RData, tmp2.RData, and so on. To combine them, we submit:

>
gather

Now all the tmp files are gone and we are left with a file 2014-02-21.RData (or whatever the date is). If you load it, you’ll see that it contains all 100,000 results. In this particular example, the result was a scalar, but Bsave/Bgather work for an array of any dimensions, provided that the first dimension is the one we’re merging on. This is about as low a barrier as you can hope for, short of running jobs on multiple processors from within R itself: no need to modify any code or to write any scripts, just run qbatch and then Bgather() when you’re done.

This has just been an illustration of some personal things I’ve done to customize Neon and make transitioning code to and from Neon go smoothly – feel free to use these customizations, modify them, or ignore them as you see fit.