R Review:

Read in the Tips dataset.

How do you access just the Tip column?
How do you access just the Tip column but only for the smokers?
How do you access just the Tip column but only when the total bill is less than 15?
(For checking purposes, the means for each of these are presented below)

## [1] 2.998279
## [1] 3.00871
## [1] 2.05025

Making models

In R, the function we will use is called ‘lm’, which stands for “linear model”. Regression is useful because it can be generalized to all kinds of settings through the notion of a model, as alluded to by its name in R.

To create a model in R, the code looks like this:

mod<-lm(Tip~TotBill,data=tips)
(mod)
## 
## Call:
## lm(formula = Tip ~ TotBill, data = tips)
## 
## Coefficients:
## (Intercept)      TotBill  
##      0.9203       0.1050

Note: The name is arbitrary; within the function, it should look like (y variable) ~ (x variable), data = (the dataset)

Printing out the model itself gives you the intercept and slope.
If you want to add the line to a plot of the data, you can use the abline() funciton, and just put the name of your model inside the parentheses. In this case, it would be mod. (We will practice this below.)
The output tells us that for every additional dollar that a meal costs, the waiter can expect to get 10.5 cents more on his tip. Note that we could have calculated this from the output the function cor() gives us as well:

cor(tips$Tip,tips$TotBill)*sd(tips$Tip)/sd(tips$TotBill)
## [1] 0.1050245

Perhaps our prediction of the waiter’s tip should depend on whether or not the table is in the smoking section. One approach would be to fit regression lines separately and make separate predictions for the two groups (which I would encourage you to do for practice). However, for a lot of reasons, making predictions based on multiple variables gets complicated. This is the kind of thing you would explore in the next course, should you decide to take it.

Manipulating whole columns

You can perform operations on whole columns of data at a time. For instance:

tipPctgs<-tips$Tip/tips$TotBill
summary(tipPctgs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03564 0.12910 0.15480 0.16080 0.19150 0.71030
tipPctgs2<-tipPctgs*100
summary(tipPctgs2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.564  12.910  15.480  16.080  19.150  71.030

Practice Problems:

  1. Make a scatter plot of tip amount vs total bill. (We’ve done this before.)
  2. Now add the regression line to the plot.
  3. Suppose a table is 1 standard deviation above average in terms of total bill How many standard deviations above average in terms of tip would you expect it to be?
  4. Suppose a table is $2 above average tip. How many dollars above average total bill would you expect it to be?
  5. Suppose a table is $10 above average total bill. What would we expect the tip to be?
# Problem 1
plot(tips$TotBill,tips$Tip)
# Problem 2
abline(mod)

# Problem 3
cor(tips$Tip,tips$TotBill)
# Problem 4
2/sd(tips$Tip)*cor(tips$Tip,tips$TotBill)*sd(tips$TotBill)
# Problem 5
mean(tips$Tip)+10/sd(tips$TotBill)*cor(tips$Tip,tips$TotBill)*sd(tips$Tip)
# Alternatively,
mod$coefficients[1]+mod$coefficients[2]*(mean(tips$TotBill)+10)
## [1] 0.6757341
## [1] 8.695428
## [1] 4.048524
## (Intercept) 
##    4.048524