\documentclass{article} \usepackage{amsmath} \usepackage{fullpage} \usepackage{url} \usepackage{xspace} \newcommand{\R}{\texttt{R}\xspace} \title{Introduction to \R} \author{Patrick Breheny} \date{\today} \begin{document} \maketitle Our goal for today is to introduce \R, an open-source computing language designed to allow for fluid, interactive data manipulation, analysis, and visualization. \R is installed on all the computers in the lab, but if you are interested in installing it at home, go to \url{www.r-project.org}. You can run \R directly, but it is often more convenient to run \R through an integrated development environment such as RStudio \url{www.rstudio.com}, which is also installed on the lab machines. Much of the material here is adapted from {\em S Programming}, by Venables and Ripley (2000), an excellent book on the details of the \texttt{S} and \texttt{R} languages that goes into far more detail than I do here. \section{\R objects} Commands in \R are either {\em expressions}, which are evaluated and printed, or {\em assignments}, which store the result of an evaluation as an object. Arithmetic operations for the most part work very similar to any calculator (note that \verb|#| marks the rest of the line as a comment): <<>>= (5^2)*(10-8)/3 + 1 ## An expression x <- (5^2)*(10-8)/3 + 1 ## An assignment x @ Note that the value of the expression is now stored in an {\em object} called \texttt{x}. This allows us to use it again in further calculations: <<>>= x+1 n <- 50 x/n @ Objects can be named using any combination of upper- and lower-case letters, digits 0-9 (provided they are not in the initial position), and the period. Note that \R is case sensitive (\texttt{x} and \texttt{X} refer to two different objects). All objects in \R have a {\em class}, which describes the kind of thing that is stored in the object. For instance, <<>>= class(x) @ tells us that \texttt{x} is storing a numeric object at the moment. \subsection{Functions} \R is said to be a functional language, meaning that it is built around calling functions to accomplish tasks: <<>>= x <- 1:9 ## Creates a vector of numbers 1, 2, ..., 9 mean(x) median(x) sd(x) min(x) sum(x) x^2 sum(x^2) @ To get more information about any function in \R, just type \texttt{help('sd')} or, more compactly, \texttt{?sd}. To search the help files for pages mentioning, say, regression, type \texttt{help.search("regression")} or \texttt{??regression}. Over time in this course, we'll see a number of functions in \R and how they are used. Functions typically have a number of options which may either be specified or left to their default values: <>= x <- rnorm(1000) y <- rnorm(1000, mean=3) hist(c(x, y)) hist(c(x, y), col="gray", border="white", breaks=40) @ \subsection{Vectors and matrices} The simplest way to create a vector is with \texttt{c} (for concatenate). Vectors are typically either numeric, character, or logical: <<>>= x1 <- c(-0.1, 0.3, -0.9, 0.2, -0.6, -2.3) x2 <- c("red", "blue", "green", "purple") x3 <- x1 > 0 x3 @ The elements of a vector may be named and then accessed by name: <<>>= names(x1) <- c("Justine", "Rachel", "Meng", "Xiuhua", "Whitney", "Paul") x1 names(x1) x1["Justine"] @ If a vector is arranged into a regular series of rows and columns, it becomes a matrix: <<>>= X <- matrix(x1, nrow=2) X @ By default, matrices are filled in by column, but this can be changed: <<>>= matrix(x1, nrow=2, byrow=TRUE) @ The rows and columns of a matrix can be named using \texttt{rownames} and \texttt{colnames}. Individual elements of matrices can be accessed with \texttt{X[i, j]}. Arrays further extend this concept, and can have an arbitrary number of additional dimensions (\texttt{A[i, j, k]}). The dimensions of a matrix or array can be found using the \texttt{dim} function. The functions \texttt{cbind} and \texttt{rbind} can be used to join together vectors or matrices column-wise or row-wise: <<>>= X1 <- matrix(rnorm(6), ncol=3) X2 <- matrix(rnorm(6), ncol=3) rbind(X1, X2) cbind(X1, X2) @ \subsection{Lists} Lists are used in \R to collect together items of different types. Items in a list may be accessed by number, such as \texttt{L[[1]]}, or by name, as in the following example: <<>>= tt <- t.test(1:7, 5:10) tt tt$conf.int tt$p.value @ \subsection{Data frames} A {\em data frame} is typically how data sets are stored in \R. Like a matrix, a data frame has a regular grid pattern of rows and columns. However, unlike a matrix (instead, like a list), a data frame is not restricted to contain only numeric or only character values. Each column of a data frame has its own type, as is typical for real data (some variables are continuous, others are categorical). Data frames can be constructed directly, but in this class, it will be more typical to read them in from raw, tab-delimited files: <<>>= tips <- read.delim("http://web.as.uky.edu/statistics/users/pbreheny/760/data/tips.txt") head(tips) class(tips$TotBill) class(tips$Sex) @ Local addresses can be used as well, either relative to the current directory (\texttt{getwd}) or as an absolute path. Data frames can have row names in general, although the one above does not. The functions \texttt{rbind} and \texttt{cbind} can also be used on data frames. It is often cumbersome to type \verb|tips$| repeatedly to access the elements of the data frame. There are two ways around this: \texttt{attach} and \texttt{with}. The former is permanent (although it can be undone with \texttt{detach} and thus can sometimes lead to unintended side effects, while the latter only acts temporarily: <<>>= with(tips, mean(TotBill)) mean(TotBill) attach(tips) mean(TotBill) detach(tips) @ Note that we can add columns to the data frame after it has been created: <<>>= tips$Rate <- with(tips, Tip/TotBill) @ \subsection{Factors} A factor is a special type of vector used to encode levels of a categorical variable (such as \verb|tips$Sex| above). <>= table(tips$Sex) levels(tips$Sex) barplot(table(tips$Day)) @ \section{Indexing} Elements of a vector can be accessed in five distinct ways -- it is often easier to use one method in one circumstance and a different method in other cicumstances, so the flexibility \R provides in this regard is quite convenient: \begin{itemize} \item A logical vector: Specifies, for each element, whether or not to include it \item A vector of positive integers: Lists the elements to include \item A vector of negative integers: Lists the elements to exclude \item A vector of names: Lists the elements to include by name \item Empty: Select all components \end{itemize} <<>>= x1[x1 > 0] ## Logical vector x1[1:3] ## Positive integers x1[-(5:6)] ## Negative integers x1[c("Whitney", "Meng")] ## Names x1[] @ The last option may seem pointless, but it is necessary among other places when accessing portions of a matrix or data frame (note that we are specifying a subset of rows, but all the columns): <<>>= tips[tips$Tip >= 7, ] @ \section{Operations} \subsection{Arithmetic} As mentioned above, arithmetic in \R generally works as you would expect. One wrinkle worth discussing, however, is {\em recycling}, as illustrated below: <<>>= x <- sample(1:10, 3) y <- sample(1:10, 6) x x + 2 ## Adds 2 to each element of x x + x ## Adds x to itself elementwise x + y ## Adds x to y elementwise, 'recycling' x @ If the length of \texttt{y} had not been a multiple of the length of \texttt{x}, fractional recycling would have occurred, with a warning. Two important logical operators to know about are \verb=&= (and) and \verb=|= (or): <<>>= x <- TRUE y <- FALSE x & y x | y @ There are dozens of useful functions for arithmetic in \R, most of which have obvious names or can easily be found using \texttt{help.search}: \texttt{log, exp, sign, sqrt, round, sum, prod, range, sort, intersect}, and many more. Two functions that are perhaps worth mentioning specifically are \texttt{seq}, which generates equally spaced sequences of numbers, and \texttt{rep}, which is used to repeat an object in various ways: <<>>= seq(0, 1, .2) seq(0, 1, len=5) rep(1, 5) rep(1:2, 2) rep(1:2, c(2,2)) @ Finally, it is worth remarking that \R has a specific value to represent missing data: \texttt{NA}. For example, take a look at the data set \texttt{airquality} and note that \texttt{Ozone} has a number of missing values: <<>>= sum(is.na(airquality$Ozone)) mean(airquality$Ozone) mean(airquality$Ozone, na.rm=TRUE) @ Note that \texttt{NA} is not the same thing as infinity, and it is not the same thing as "undefined" -- those mathematical concepts have their own representations in \R: <<>>= 1/0 0/0 @ \subsection{Matrix and array arithmetic} Arrays may be used in ordinary arithmetic the same way as vectors: <<>>= A <- matrix(rpois(12, 3), nrow=4) B <- matrix(rpois(12, 3), nrow=4) A*B + 2*A @ Matrix multiplication is a separate operation: <<>>= t(A) %*% B crossprod(A, B) A %*% B @ Finally, it is worth knowing that \texttt{solve} returns the inverse of a matrix (provided, of course, that it is invertible): <<>>= X <- crossprod(A) X.inv <- solve(X) X %*% X.inv @ \subsection{Character operations} \R also has a number of functions for computing on strings of characters. Perhaps most importantly, \texttt{paste}, which pastes together strings: <<>>= paste(LETTERS[1:4], 1:4) paste(LETTERS[1:4], 1:4, sep="") a <- paste(LETTERS[1:4], 1:4, sep="-") nchar(a) substr(a, 1, 2) @ There are also very powerful search, replace, and match functions like \texttt{grep}, \texttt{gsub}, and \texttt{match}, although a detailed discussion of them is beyond the scope of this lecture. \subsection{Vectorized calculations} \R also offers a number of convenient functions for applying a function to each element of a list, or to each column of a matrix: \texttt{apply, tapply, sapply, lapply}. <<>>= apply(airquality, 2, mean) apply(airquality, 2, mean, na.rm=TRUE) @ \subsection{Probability distributions} In statistics, we are often interested in working with random numbers and probability distributions. To carry out "draw $n$ balls from an urn" type of sampling, there is the function \texttt{sample}, which can be done with or without replacement: <<>>= sample(1:10, 5) sample(LETTERS[1:10], 5) sample(LETTERS[1:5], 10, replace=TRUE) @ \R provides a wide array of functions for working with specific probability distributions as well, and they are organized in the following systematic manner (using the normal distribution as an example): \begin{itemize} \item \texttt{dnorm}: The density ({\em i.e.}, the pdf); for categorical distributions like ({\em dbinom}), this returns the mass function (pmf) \item \texttt{pnorm}: The CDF \item \texttt{qnorm}: The quantile function, or inverse CDF \item \texttt{rnorm}: Generates random numbers from the distribution \end{itemize} The arguments that each distribution takes are, of course, different, but the organization is the same: there are \texttt{dpois}, \texttt{ppois}, \texttt{qpois}, and \texttt{rpois} that allow you to work with the Poisson distribution, and so on for all common distributions. \section{Writing your own functions} One of the best things about \R (perhaps {\em the} best thing about \R) is how easy it is to write your own functions. For example, let's write a function to solve for the least squares regression coefficients (this function is redundant, of course, as a perfectly good function to do this already exists in \R, but it will be a useful exercise regardless). <<>>= ols <- function(XX, yy) { missing.data <- apply(is.na(XX), 1, any) | is.na(yy) X <- cbind(Intercept=1, as.matrix(XX[!missing.data,])) y <- yy[!missing.data] solve(crossprod(X)) %*% t(X) %*% y } ols(airquality[,-1], airquality$Ozone) ## Check against existing R function: lm(Ozone~., airquality) @ The ability to write your own functions greatly enhances one's ability to customize and extend \R. In particular, there are hundreds of extra functions that people have written which are available on the Comprehensive \R Archive Network (\url{cran.r-project.org}). These functions are bundled together into {\em packages} and may be installed with \texttt{install.packages} and loaded with \texttt{require} or \texttt{library}. For example, the \texttt{lattice} package (by default installed, but not loaded) provides a number of nice tools for plotting: <>= require(lattice) xyplot(Tip~TotBill|Smoker, data=tips) @ \section{Control structures} One last topic: {\em control structures} are commands that control whether or not to execute other commands, or whether to repeatedly execute blocks of commands. The \texttt{if} statement is used as follows: <<>>= x <- airquality$Ozone x <- tips$Sex if (is.numeric(x)) sqrt(x) else stop("Can't take sqrt of things that aren't numbers") @ The other basic control structure is \texttt{for}, which executes loops: <<>>= total <- 0 for (i in 1:10) { total <- total + i } total @ \section{Example: Running a simulation} To get a sense of how loops used, let's carry out a brief simulation study investigating the robustness of Student's $t$-test when the variances of the two groups are unequal. <>= N <- 1000 n <- 10 SD <- 1:10 pW <- pS <- matrix(NA, nrow=N, ncol=length(SD)) for (i in 1:length(SD)) { print(i) for (j in 1:N) { s1 <- rnorm(n) s2 <- rnorm(n, sd=SD[i]) pW[j,i] <- t.test(s1, s2, var.equal=FALSE)$p.value pS[j,i] <- t.test(s1, s2, var.equal=TRUE)$p.value } } plot(SD,apply(pW < .05, 2, mean), type="l", lwd=3, col="red", ylab="Type I error rate",xlab="Ratio of standard deviations", ylim=c(0,0.1)) lines(SD, apply(pS < .05, 2, mean), lwd=3, col="blue") legend("topleft", col=c("red", "blue"), lwd=3, legend=c("Welch","Student")) abline(h=.05, col="gray") @ As with any programming language, the only way to learn \R is to use \R, and certainly, you'll grow more familiar with \R and learn many more functions as the semester progresses. Hopefully, this document serves as a useful introduction and reference. \end{document}