Read in the Tips dataset.
How do you access just the Tip column?
How do you access just the Tip column but only for the smokers?
How do you access just the Tip column but only when the total bill is less than 15?
(For checking purposes, the means for each of these are presented below)
## [1] 2.998279
## [1] 3.00871
## [1] 2.05025
In R, the function we will use is called ‘lm’, which stands for “linear model”. Regression is useful because it can be generalized to all kinds of settings through the notion of a model, as alluded to by its name in R.
To create a model in R, the code looks like this:
mod<-lm(Tip~TotBill,data=tips)
(mod)
##
## Call:
## lm(formula = Tip ~ TotBill, data = tips)
##
## Coefficients:
## (Intercept) TotBill
## 0.9203 0.1050
Note: The name is arbitrary; within the function, it should look like (y variable) ~ (x variable), data = (the dataset)
Printing out the model itself gives you the intercept and slope.
If you want to add the line to a plot of the data, you can use the abline() funciton, and just put the name of your model inside the parentheses. In this case, it would be mod. (We will practice this below.)
The output tells us that for every additional dollar that a meal costs, the waiter can expect to get 10.5 cents more on his tip. Note that we could have calculated this from the output the function cor() gives us as well:
cor(tips$Tip,tips$TotBill)*sd(tips$Tip)/sd(tips$TotBill)
## [1] 0.1050245
Perhaps our prediction of the waiter’s tip should depend on whether or not the table is in the smoking section. One approach would be to fit regression lines separately and make separate predictions for the two groups (which I would encourage you to do for practice). However, for a lot of reasons, making predictions based on multiple variables gets complicated. This is the kind of thing you would explore in the next course, should you decide to take it.
You can perform operations on whole columns of data at a time. For instance:
tipPctgs<-tips$Tip/tips$TotBill
summary(tipPctgs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03564 0.12910 0.15480 0.16080 0.19150 0.71030
tipPctgs2<-tipPctgs*100
summary(tipPctgs2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.564 12.910 15.480 16.080 19.150 71.030
# Problem 1
plot(tips$TotBill,tips$Tip)
# Problem 2
abline(mod)
# Problem 3
cor(tips$Tip,tips$TotBill)
# Problem 4
2/sd(tips$Tip)*cor(tips$Tip,tips$TotBill)*sd(tips$TotBill)
# Problem 5
mean(tips$Tip)+10/sd(tips$TotBill)*cor(tips$Tip,tips$TotBill)*sd(tips$Tip)
# Alternatively,
mod$coefficients[1]+mod$coefficients[2]*(mean(tips$TotBill)+10)
## [1] 0.6757341
## [1] 8.695428
## [1] 4.048524
## (Intercept)
## 4.048524