Objectives

In today’s lab we will:

  1. Manipulate and subset datasets
  2. Learn how to construct a scatter plot
  3. Explore linear regression and correlation functions

Let’s start by loading in the tips dataset from the course website. The html documentation for the dataset can be found here: https://myweb.uiowa.edu/pbreheny/data/tips.html

Manipulating Columns

Using the $ operator, you can add new variables to the dataset by specifying a new name using the ‘data$new_variable’ syntax. You can do this using vectors or also by using other variables in the dataset.

Suppose we want an index variable (ie, labeling the observations as 1, 2,… etc). We define a vector for this new column ranging from 1 to the total number of observations

# This can be done by looking for the total number of observations in our Environment:
tips$index <- 1:244

# Another way is using 'nrow,' which gives us the total number of rows in a dataset
tips$index <- 1:nrow(tips)

You can also add columns by performing operations on other columns of the dataset. Recall in a previous lab we learned that you can perform operations using entire vectors. For instance, if we wanted to make a variable showing what proportion of the bill the tip was:

(We will assume that “total bill” in this case includes the tip.)

tips$Tip_Prop <- tips$Tip/tips$TotBill
summary(tips$Tip_Prop)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03564 0.12913 0.15477 0.16080 0.19148 0.71034
# Converting it to a percent:
tips$Tip_Perc <- tips$Tip_Prop*100
summary(tips$Tip_Perc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.564  12.913  15.477  16.080  19.148  71.034

Scatter Plots

As discussed in lecture, when we are interested in visualizing the connection between two continuous variables, we can use a scatter plot. This is done using the following function:

plot(x = tips$TotBill, y = tips$Tip)

Looking at this plot we can see a positive association between the variables and that there is a good amount of variation. In other words, it seems that as the bill increases, the the tip does as well.

Maybe we’re only curious about the total bill compared to the tip for individuals who don’t smoke. Recall that we can place certain restrictions on which data we want to include by using the indexing brackets [] with the “==” operator. Here’s how we could plot that:

# For the sake of simplicity, we separate just the non-smokers into a new dataset:
nonsmokers <- tips[tips$Smoker == "No",]
      
plot(nonsmokers$TotBill, nonsmokers$Tip, 
     xlab = "Total Bill Non-Smokers", 
     ylab = "Tip Non-Smokers")

Constructing Linear Models

In R, the function we will use is called ‘lm’, which stands for “linear model”. Regression is useful because it can be generalized to all kinds of settings through the notion of a model, as alluded to by its name in R.

To create a model in R, the code looks like this:

model <- lm(Tip ~ TotBill, data = tips)
model
## 
## Call:
## lm(formula = Tip ~ TotBill, data = tips)
## 
## Coefficients:
## (Intercept)      TotBill  
##      0.9203       0.1050
# You can omit the 'data =' argument if you prefer by feeding the data directly into the vectors using the $ operator. This will produce the same result:

model <- lm(tips$Tip ~ tips$TotBill)

Note: The general function should look like lm(y variable ~ x variable, data = the dataset).

Printing out the model itself gives you the intercept and slope. “TotBill” ie, the slope, tells us that for every additional dollar that a meal costs, the waiter can expect to get 10.5 cents more on his tip. The intercept theoretically tells us that for a bill that costs $0, the waiter should expect 92 cents in tip. (Note that this is not possible, so evaluating this at 0 doesn’t make much sense.)

We can also calculate this using the output from ‘cor()’ function. By itself, the cor() function calculates the correlation coefficient between two vectors (shown below using the tip and and total bill amounts). When we multiply that output by the ratio of the standard deviations of those two variables, we end up with the slope of the model.

# Getting our correlation between tip and total bill: 
tip_bill_corr <- cor(tips$Tip, tips$TotBill)
tip_bill_corr
## [1] 0.6757341
# Note that the ratio takes the sd of our y variable as the numerator, and the sd of our x variable as the denominator
tip_bill_corr * sd(tips$Tip) / sd(tips$TotBill)
## [1] 0.1050245

If you are interested in adding this regression line to a plot of the data, you can use the abline() function, and just put the name of your model inside the parentheses.

plot(tips$TotBill, tips$Tip)
abline(model,
       col = "red",
       lwd = 2)

Recall in class we discussed that although switching the x and y variables results in the same correlation, this is not true for the regression line. In other words, we can use total bill to predict tip with this model, but we can’t use tip to predict total bill. If we wanted to use tips to predict total bill, we would need to create a new model with the appropriate parameters. This is illustrated below:

# Note that the correlation from switching the variables is the same:
cor(tips$Tip, tips$TotBill)
## [1] 0.6757341
cor(tips$TotBill, tips$Tip)
## [1] 0.6757341
# Creating our models
model <- lm(Tip ~ TotBill, data = tips)
model_inverted <- lm(TotBill ~ Tip, data = tips)

par(mfrow = c(1,2))

# Original model
plot(tips$TotBill, tips$Tip,
       xlab = "Total Bill",
       ylab = "Tips")
abline(model,
       col = "red",
       lwd = 2)

# Inverted model- note that the axes are swapped
plot(tips$Tip, tips$TotBill,
       xlab = "Tips",
       ylab = "Total Bill")
abline(model_inverted,
       col = "blue",
       lwd = 2)

# Return the plots window to normal
par(mfrow = c(1,1))

Example Problems:

A. Add a column to the tips dataset, called tip_per_person, which is the amount of tip per each party member (tip / size)

B. Make a new dataset that only has the individuals who dined at night

C. Make a scatter plot of tip_per_person (y) vs. tip_percent (x) for the individuals who dined at night.

D. Now add a regression line to the plot.

Now let’s shift back to looking at the entire dataset.

E. Suppose a table is 1 standard deviation above average in terms of total bill. How many dollars above average in terms of tip would you expect it to be?

  • Hint: Refer to slide 10 on the Regression lecture notes

F. Suppose a table is $2 above average tip. How many dollars above the average total bill would you expect it to be?

  • Hint: This is the same process as found on slide 6 of the Regression lecture notes.

G. Suppose a table is $10 above the average total bill. What would we expect the tip to be? Find a solution to this with and without using the “mod” model we created above.

# Part A
tips$Tip_per_person <- tips$Tip / tips$Size

# Part B
night_owls <- tips[tips$Time == "Night",]

# Part C
plot(x = night_owls$TotBill, y = night_owls$Tip_per_person,
     xlab = "Total Bill Amount", ylab = "Tip per Person", 
     main = "People Dining at Night Scatterplot")

# Part D
night_model <- lm(night_owls$Tip_per_person ~ night_owls$TotBill)
abline(night_model)

# Part E
cor(tips$TotBill, tips$Tip) * sd(tips$Tip)
## [1] 0.9349715
# Part F
Zx <- 2 / sd(tips$Tip)
Zy <- Zx * cor(tips$Tip,tips$TotBill)
Zy * sd(tips$TotBill)
## [1] 8.695428
# Part G (Three different methods)
  # By hand:
Zx <- 10 / sd(tips$TotBill)
Zy <- Zx * cor(tips$TotBill,tips$Tip)
(y <- mean(tips$Tip) + Zy * sd(tips$Tip))
## [1] 4.048524
  # Using the model:
model$coefficients[1] + model$coefficients[2]*(mean(tips$TotBill)+10)
## (Intercept) 
##    4.048524
  # Using the raw numbers:
0.9203 + 0.1050 * (mean(tips$TotBill) + 10)
## [1] 4.047824