Pratice Problems

1.

Recall the ‘tips’ dataset that we used in a previous lab. If we divide the tips up based on whether the customer was a smoker or non-smoker, our data is reasonably skewed right. Using the methods we covered in this lab, perform a 2-sample t-test on whether there is a significant difference in tips between smokers, and find the confidence interval.

library(ggplot2)

set.seed(58008)
icecream <- as.data.frame(matrix(c(rep("A", 40), rep("B", 60)), nrow = 100))
colnames(icecream) <- "group"
var1 <- rlnorm(40, meanlog = 0, sdlog = 1)
var2 <- rgamma(60, shape = 1, rate = 0.5)

icecream$cups <- c(var1, var2)

# Normal data: we observe it appears right-skewed
ggplot(icecream, aes(x = cups)) + geom_histogram(fill = "pink", color = "black", bins = 20) + facet_wrap(~group)

# Log transformed data looks much more normal
ggplot(icecream, aes(x = log(cups))) + geom_histogram(fill = "pink", color = "black", bins = 20) + facet_wrap(~group)

# Standard deviations are very close, so we set the var.equal argument to TRUE
by(icecream$cups, icecream$group, sd)
## icecream$group: A
## [1] 1.513396
## ------------------------------------------------------------ 
## icecream$group: B
## [1] 1.462479
# Log transformed t-test:
icecream$logcups <-log(icecream$cups)
t.test(icecream$logcups~icecream$group, var.equal=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  icecream$logcups by icecream$group
## t = 0.23729, df = 97.517, p-value = 0.8129
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
##  -0.3502319  0.4453565
## sample estimates:
## mean in group A mean in group B 
##      0.13500594      0.08744364
# Confidence interval: don't forget to exponentiate at the end
ci <- t.test(icecream$logcups~icecream$group, var.equal=FALSE)$conf.int
exp(ci)
## [1] 0.7045247 1.5610466
## attr(,"conf.level")
## [1] 0.95


2.

On the course website, the cystic fibrosis dataset contains paired data on the patient’s reduction in FVC for each treatment period for a drug vs. placebo. Conduct a Wilcoxon Signed-Rank Test to determine if there is a significant difference in reduction of FVC for drug vs. placebo.

fibrosis <- read.delim("https://s3.amazonaws.com/pbreheny-data-sets/cystic-fibrosis.txt")
# Making a histogram of the data: we note there is a potential outlier in the placebo group at around 1000
hist(fibrosis$Drug)

hist(fibrosis$Placebo)

# Conducting our Wilcoxon rank sum test. Use the paired=TRUE argument because this data is paired
wilcox.test(fibrosis$Drug, fibrosis$Placebo, paired=TRUE)
## 
##  Wilcoxon signed rank exact test
## 
## data:  fibrosis$Drug and fibrosis$Placebo
## V = 19, p-value = 0.03528
## alternative hypothesis: true location shift is not equal to 0


3.

Previously we looked at the nhanes data containing the heights and weights of men. For this problem, we will look at the nhanes data containing the heights and weights of women and perform a similar analysis. Using the nhanes-aw data from the website, find the Spearman’s rank correlation between height and weight.

nhanes_woman <- read.delim("https://s3.amazonaws.com/pbreheny-data-sets/nhanes-aw.txt")

# Plotting the data to see if theres any obvious outliers
plot(nhanes_woman$Weight~nhanes_woman$Height)

# Spearman's Rank Correlation
cor.test(nhanes_woman$Height, nhanes_woman$Weight, method = "spearman")
## Warning in cor.test.default(nhanes_woman$Height, nhanes_woman$Weight, method =
## "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  nhanes_woman$Height and nhanes_woman$Weight
## S = 2169597258, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.2996993
# For comparison, here's what we got using Pearson's correlation:
cor.test(nhanes_woman$Height, nhanes_woman$Weight, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  nhanes_woman$Height and nhanes_woman$Weight
## t = 15.897, df = 2647, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2600585 0.3295970
## sample estimates:
##       cor 
## 0.2952187