Objectives

In today’s lab we will:

  1. Manipulate datasets
  2. Learn how to construct a scatter plot
  3. Explore linear regression and correlation functions
  4. Practice performing exploratory data analysis

Let’s start by loading in the tips dataset from the course website. The html documentation for the dataset can be found here: https://iowabiostat.github.io/data-sets/tips/tips.html

Manipulating Columns

Using the $ operator, you can add new variables to the dataset by specifying a new name using the ‘data$new_variable’ syntax. You can do this using vectors or also by using other variables in the dataset.

You can also add columns by performing operations on other columns of the dataset. Recall in a previous lab we learned that you can perform operations using entire vectors. In the United States, we usually base our tip off of a percentage of the total bill amount. We can calculate the percent tipped for each bill in the following way.

(We will assume that “total bill” in this case includes the tip.)

tips$tip_perc <- (tips$Tip/tips$TotBill)*100
summary(tips$tip_perc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.564  12.913  15.477  16.080  19.148  71.034

Scatter Plots

As discussed in lecture, when we are interested in visualizing the connection between two continuous variables, we can use a scatter plot. This is done using the following function:

plot(x = tips$TotBill, y = tips$Tip)

Looking at this plot we can see a positive association between the variables and that there is a good amount of variation. In other words, it seems that as the bill increases, the the tip does as well. We can use a linear model to find the equation for a straight line that would best fit the points in this scatterplot. We can then use that model to plot the line directly to our scatterplot.

Constructing Linear Models

In R, the function we will use is called ‘lm’, which stands for “linear model”. Regression is useful because it can be generalized to all kinds of settings through the notion of a model, as alluded to by its name in R.

We can make a general linear model in R by doing: lm(y_variable ~ x_variable, data = our_dataset).

For this example, we will create a linear model corresponding to the scatterplot we made. Our model will be using the tip amount as the outcome of interest (y) and TotBill as a predictor (x).

model <- lm(Tip ~ TotBill, data = tips)
model
## 
## Call:
## lm(formula = Tip ~ TotBill, data = tips)
## 
## Coefficients:
## (Intercept)      TotBill  
##      0.9203       0.1050
# You can omit the 'data =' argument if you prefer by feeding the data directly into the vectors using the $ operator. This will produce the same result:

model <- lm(tips$Tip ~ tips$TotBill)

Interpretation

Printing out the model itself gives you the intercept and slope. “TotBill” ie, the slope, tells us that for every additional dollar that a meal costs, the waiter can expect to get 10.5 cents more on his tip. The intercept theoretically tells us that for a bill that costs $0, the waiter should expect 92 cents in tip. (Note that this is not possible, so evaluating this at 0 doesn’t make much sense.)

Correlation and Slope Relationship

We can also calculate this using the output from ‘cor()’ function. By itself, the cor() function calculates the correlation coefficient between two vectors (shown below using the tip and and total bill amounts). When we multiply that output by the ratio of the standard deviations of those two variables, we end up with the slope of the model.

# Getting our correlation between tip and total bill: 
tip_bill_corr <- cor(tips$Tip, tips$TotBill)
tip_bill_corr
## [1] 0.6757341
# Note that the ratio takes the sd of our y variable as the numerator, and the sd of our x variable as the denominator
tip_bill_corr * sd(tips$Tip) / sd(tips$TotBill)
## [1] 0.1050245

Notice how this is exactly equal to the slope that we found from the linear model above.

It can be helpful to add your regression line directly to the scatterplot. You can use the abline() function, and just put the name of your model inside the parentheses.

plot(tips$TotBill, tips$Tip)
abline(model,
       col = "red",
       lwd = 2)

Recall in class we discussed that although switching the x and y variables results in the same correlation, this is not true for the regression line. In other words, we can use total bill to predict tip with this model, but we can’t use tip to predict total bill. If we wanted to use tips to predict total bill, we would need to create a new model with the appropriate parameters. This is illustrated below:

# Note that the correlation from switching the variables is the same:
cor(tips$Tip, tips$TotBill)
## [1] 0.6757341
cor(tips$TotBill, tips$Tip)
## [1] 0.6757341
# Creating our models
model <- lm(Tip ~ TotBill, data = tips) # original model y ~ x
model_inverted <- lm(TotBill ~ Tip, data = tips) # switching to x ~ y

par(mfrow = c(1,2))

# Original model
plot(tips$TotBill, tips$Tip,
       xlab = "Total Bill",
       ylab = "Tips")
abline(model,
       col = "red",
       lwd = 2)

# Inverted model- note that the axes are swapped
plot(tips$Tip, tips$TotBill,
       xlab = "Tips",
       ylab = "Total Bill")
abline(model_inverted,
       col = "blue",
       lwd = 2)

# Return the plots window to normal
par(mfrow = c(1,1))

Exploring Possible Relationships

We will now try to answer more interesting questions about tipping behavior, as obviously the tip amount will increase as the total bill amount increases. We will use the tip_perc variable created earlier that tells us the tip percentage to more fairly make comparisons between groups.

Example 1

The first question we will try to answer is whether the total bill affects the percentage in which people tip. Do those with smaller bills tip more or less proportionally than those with larger ones?

cor(tips$TotBill, tips$tip_perc)
## [1] -0.3386241
model2 <- lm(tip_perc ~ TotBill, data = tips)
model2
## 
## Call:
## lm(formula = tip_perc ~ TotBill, data = tips)
## 
## Coefficients:
## (Intercept)      TotBill  
##     20.6766      -0.2323
plot(tips$TotBill, tips$tip_perc,
     xlab = "Total Bill",
     ylab = "Tip Rate")
abline(model2,
       col = "red",
       lwd = 2)

Based on these results, what would you conclude about the strength of the association between the amount of the total bill and the tip rate?

Answer

There appears to be a small negative association between tip percentage and total bill, as the correlation coefficient is -0.34. The regression coefficient also shows that for every $1 increase in the total bill, the tip percentage decreases by about 0.2%.

Example 2

Another interesting question we could explore is the role of gender. Suppose that an equal number of men and women dine at the restaurant. Are men more likely to pick up the check than women? Does this depend on whether the meal is lunch or dinner?

We can see if men are more likely than women to pick up the check by simply comparing the proportions of men and women who paid the bill. We can look at a table that breaks down the proportion of males and females by using the table() and prop.table() functions we learned from Lab 2.

gender_tab <- table(tips$Sex) 
gender_tab # number of men/women
## 
##   F   M 
##  87 157
prop.table(gender_tab) # proportion of men and women
## 
##         F         M 
## 0.3565574 0.6434426

We can see that males tend to cover the bill about 2/3 of the time and women about 1/3 of the time. Now, we are wondering how this behavior might be different based off of the time of day the meal happens. You could think about some hypotheses about how this could change tipping behavior:

For example, will men pick up the bill more frequently for dinner because of “conventional dating norms”? We can look at this question by making a contingency table that compares “Sex” to “Time” and we can further visualize the difference using a stacked barplot.

gender_vs_time <- table(tips$Sex, tips$Time)
prop.table(gender_vs_time, 2) # proportion of female/male given time of day
##    
##           Day     Night
##   F 0.5147059 0.2954545
##   M 0.4852941 0.7045455
barplot(gender_vs_time,
        col = c("cyan4", "coral3"),
        main ="Distribution of Bill Payment by Gender and Time of Day",
        ylab = "Frequency")
legend(x = "topleft", 
       legend = c("Male", "Female"), 
       title = "Gender", 
       fill = c("coral3", "cyan4"),
       cex = 0.8)

Based on these results, what would you conclude about about our primary questions of interest? Are men more likely to pick up the check than women? Does this depend on whether the meal is lunch or dinner?

Answer In our analysis, we saw men picked up the bill about 64% of the time and women about 36% of the time regardless of the time of day the meal was eaten. When we investigated further, we saw the time of day affects which gender pays. If the meal was eaten during the day, it is closer to a 50-50 split between male and female picking up the bill. However, at night, men are more likely to pick up the check. In our dataset, men picked up the bill more than 70% of the time when the meal was at night.

Practice Problems

Further Data Exploration

For the following questions, use the methods we’ve learned about in previous labs to perform your own comparisons about different groups in the data and state your conclusion.

  1. Do smokers tip differently than nonsmokers?

Hint: Refer to Lab 4 to solve this problem. We have 1 continuous outcome and 1 categorical grouping variable.

Answer
by(tips$tip_perc, tips$Smoker, summary)
## tips$Smoker: No
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.68   13.69   15.56   15.93   18.50   29.20 
## ------------------------------------------------------------ 
## tips$Smoker: Yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.564  10.677  15.385  16.320  19.506  71.034
boxplot(tips$tip_perc ~ tips$Smoker,
        xlab = "Smoker in Party?",
        ylab = "Tip Rate")

After analyzing the table and side-by-side boxplots, it seems that smokers and non-smokers tip about the same percentage on their bills. They both had similar median tipping percentages around 15.5%. From the boxplot, we can see there is more variation in tipping behavior for smokers with two very high tips. Overall, I would conclude that smokers do not tip differently than nonsmokers.

  1. Does tipping behavior change at lunch versus dinner?

We can apply the same type analysis on this question as the question above. Our new categorical grouping variable is Time.

Answer
by(tips$tip_perc, tips$Time, summary)
## tips$Time: Day
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.296  13.915  15.408  16.413  19.392  26.631 
## ------------------------------------------------------------ 
## tips$Time: Night
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.564  12.319  15.540  15.952  18.821  71.034
boxplot(tips$tip_perc ~ tips$Time,
        xlab = "Time",
        ylab = "Tip Rate")

The results look very similar to the previous question. We see there are some extreme tip percentages at night, so we should examine the medians as they are more robust to outliers. The medians between meals during the day and meals at night look very similar (approximately 15.5% for both). The box plots look to be about the same shape and spread, besides the few outliers. It seems that there is little difference in tipping percentages based on the time of day the meal occurred.

  1. Does tipping behavior differ by days of the week?

Again, this question can be completed by using a side by side boxplot and summary of the tipping percentages broken out by day.

Answer
by(tips$tip_perc, tips$Day, summary)
## tips$Day: Fri
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.36   13.37   15.56   16.99   19.66   26.35 
## ------------------------------------------------------------ 
## tips$Day: Sat
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.564  12.386  15.183  15.315  18.827  32.573 
## ------------------------------------------------------------ 
## tips$Day: Sun
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.945  11.998  16.110  16.690  18.789  71.034 
## ------------------------------------------------------------ 
## tips$Day: Thu
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.296  13.821  15.385  16.128  19.269  26.631
boxplot(tips$tip_perc ~ tips$Day,
        xlab = "Day",
        ylab = "Tip Rate")

Overall, tipping behavior appears consistent across the 4 days measured in this dataset. The median tip rate is all about 15-16% and their variability is all similar as well. One note is that Sunday does have two high outliers and Saturday also has a couple of higher tips as well. However, in general, we can conclude that tipping behavior does not differ by day of the week.

Regression Review

  1. Suppose a table is 1 standard deviation above average in terms of total bill. How many dollars above average in terms of tip would you expect it to be?
  • Hint: Refer to slide 10 on the Regression lecture notes
Answer
cor(tips$TotBill, tips$Tip) * sd(tips$Tip)
## [1] 0.9349715
  1. Suppose a table is $2 above average tip. How many dollars above the average total bill would you expect it to be?
  • Hint: This is the same process as found on slide 6 of the Regression lecture notes.
Answer
Zx <- 2 / sd(tips$Tip)
Zy <- Zx * cor(tips$Tip,tips$TotBill)
Zy * sd(tips$TotBill)
## [1] 8.695428
  1. Suppose a table is $10 above the average total bill. What would we expect the tip to be? Find a solution to this with and without using the “mod” model we created above.
Answer
# By hand:
Zx <- 10 / sd(tips$TotBill)
Zy <- Zx * cor(tips$TotBill,tips$Tip)
(y <- mean(tips$Tip) + Zy * sd(tips$Tip))
## [1] 4.048524
# Using the model:
model$coefficients[1] + model$coefficients[2]*(mean(tips$TotBill)+10)
## (Intercept) 
##    4.048524
# Using the raw numbers:
0.9203 + 0.1050 * (mean(tips$TotBill) + 10)
## [1] 4.047824