Last week in lab, we began analyzing the Infant Diarrhea study. In this lab, we will further analyze that data set using what we now know about outliers, transforming data, and non-parametric testing procedures.

Examining the data

To start, let’s examine the distribution of the Infant Diarrhea data by group. Notice that the data are right-skewed with several outliers.

diarrhea <- read.delim("http://myweb.uiowa.edu/pbreheny/data/diarrhea.txt")
boxplot(diarrhea$Stool~diarrhea$Group, main = "Effect of Bismuth Salicylate", horizontal = T, col = c("lightblue", "pink"))

The mean and standard error are heavily influenced by outliers. Therefore the two-sample t-test may be inadequate for analyzing this data.

Transformations

Recall the distribution of the Odds Ratio, and how it was right-skewed. We ‘fixed’ this by transforming it to a new statistic (Log Odds Ratio) which is normally distributed. The same idea can be used for right-skewed data. (Note: If you’ve switched computers throughout the semester, you may have to install the lattice package prior to running the code.)

diarrhea$logStool <- log(diarrhea$Stool)
library(ggplot2)

#Original
ggplot(diarrhea, aes(x = Stool)) + geom_histogram(fill = "tomato3", color = "black", bins = 20) + facet_wrap(~Group)

#Log-transformed
ggplot(diarrhea, aes(x = log(Stool))) + geom_histogram(fill = "tomato3", color = "black", bins = 20) + facet_wrap(~Group)

We can see that the distribution of logDiarrhea is still right-skewed; however, it is not as severe as before. Now we can perform a two-sample t-test and achieve a more powerful result. Compare this with the t-test with the original data.

# Original
t.test(diarrhea$Stool~diarrhea$Group,var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  diarrhea$Stool by diarrhea$Group
## t = 2.245, df = 167, p-value = 0.02608
## alternative hypothesis: true difference in means between group Control and group Treatment is not equal to 0
## 95 percent confidence interval:
##    9.457362 147.396700
## sample estimates:
##   mean in group Control mean in group Treatment 
##                260.2976                181.8706
# log-transformed
t.test(diarrhea$logStool~diarrhea$Group,var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  diarrhea$logStool by diarrhea$Group
## t = 2.82, df = 167, p-value = 0.005383
## alternative hypothesis: true difference in means between group Control and group Treatment is not equal to 0
## 95 percent confidence interval:
##  0.1025092 0.5810823
## sample estimates:
##   mean in group Control mean in group Treatment 
##                5.212370                4.870574

Note: The confidence bounds provided are on the log scale. In order to obtain a more interpretable interval, we need to exponentiate them.


Exponentiating the Confidence Interval

Using our previous t-test, we can extract the confidence interval using “conf.int” and store the results as a variable before exponentiating.

logCI <- t.test(diarrhea$logStool~diarrhea$Group,var.equal=TRUE)$conf.int
exp(logCI)

Note: When we’re working on the log scale, we’re now thinking about the ratio between the two groups, since log(a)-log(b) = log(a/b).

By taking the difference of the log means, we are actually calculating the log ratio of the two groups. We can exponentiate this result for interpretaion.

by(diarrhea$logStool, diarrhea$Group, mean)
## diarrhea$Group: Control
## [1] 5.21237
## ------------------------------------------------------------ 
## diarrhea$Group: Treatment
## [1] 4.870574
estimate <- 5.2124-4.8706
exp(estimate)

How would you interpret this estimate?

Non-parametric tests

When the normality assumption is violated, another way to analyze the data is with a non-parametric test. These tests do not require a distributional assumption and are robust to the presence of outliers. These ‘rank-based methods’ are a powerful way to analyze data when distributional assumptions are questionable, and particularly effective in the presence of outliers.

Wilcoxon Rank Sum Test

Refer back to the Infant Diarrhea study. Instead of transforming the data or discarding outliers, we can use the Wilcoxon Rank Sum Test to test whether the treatment and control groups have different stool output values. Ranking the data minimizes the impact of outliers, and removes assumptions about the underlying distribution of the data.

# Rank-Sum Test
wilcox.test(diarrhea$Stool~diarrhea$Group)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  diarrhea$Stool by diarrhea$Group
## W = 4452, p-value = 0.005573
## alternative hypothesis: true location shift is not equal to 0

How would you interpret this result?

Wilcoxon Signed-Rank Test

If the data we are analyzing is paired, the signed-rank test is a non-parametric procedure that takes in to account the relationship between the two groups. Let’s revisit the familiar paired data set from the Oatbran study.

# Signed Rank Test
oatbran <- read.delim("http://myweb.uiowa.edu/pbreheny/data/oatbran.txt")
wilcox.test(oatbran$CornFlakes, oatbran$OatBran, paired=TRUE)
## 
##  Wilcoxon signed rank exact test
## 
## data:  oatbran$CornFlakes and oatbran$OatBran
## V = 93, p-value = 0.008545
## alternative hypothesis: true location shift is not equal to 0
# For comparison, in case you don't remember:
t.test(oatbran$CornFlakes, oatbran$OatBran, paired=TRUE)
## 
##  Paired t-test
## 
## data:  oatbran$CornFlakes and oatbran$OatBran
## t = 3.3444, df = 13, p-value = 0.005278
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1284606 0.5972537
## sample estimates:
## mean of the differences 
##               0.3628571

Describe the difference between the two results. What conclusions can you draw?

Spearman’s Rank Correlation

Another non-parametric option is Spearman’s Rank Correlation for continuous data. This test ranks the values of both variables in the data set before computing a correlation coefficient. Let’s revisit the nhanes data for men. Remember that the two variables of height and weight are continuous. When we plot the data, we can see a clear outlier that is over 800 pounds.

nhanes <- read.delim("http://myweb.uiowa.edu/pbreheny/data/nhanes-am.txt")
plot(nhanes$Weight~nhanes$Height)

Let’s see how our results would have been influenced by this outlier using Spearman’s Rank Correlation.

#Spearman's Rank Correlation
cor.test(nhanes$Height, nhanes$Weight, method = "spearman")

#For comparison, here's what we got using Pearson's correlation:
cor.test(nhanes$Height, nhanes$Weight, method = "pearson")

How much of an influence did the outlier have?

Parametric advantages: When the assumptions hold, the parametric tests are more powerful and construction of the confidence intervals is straightforward.

Non-parametric advantages: There are minimal assumptions, and they are more powerful when parametric assumptions are invalid.

In summary, when you have continuous data, don’t automatically use a t-test. Look at the data, and if it’s skewed or contains large outliers, consider a transformation or a non-parametric option.


Practice Problem

1. Transforming Data

Suppose we have a dataset with 1000 measurements of how many cups of ice cream individuals eat every day. The individuals were separated into two groups “A” and “B”. We want to know whether there is a significant difference in ice cream consumption between the two groups. The data ‘toydata’ will be generated by running the code below.

If we plot the data as-is, we see that the measurements are highly skewed right for both groups. Using the methods we covered in this lab, perform a 2-sample t-test on whether there is a significant difference in the ice cream consumption between groups “A” and “B”, and find the confidence interval.

set.seed(58008)
icecream <- as.data.frame(matrix(c(rep("A", 40), rep("B", 60)), nrow = 100))
colnames(icecream) <- "group"
var1 <- rlnorm(40, meanlog = 0, sdlog = 1)
var2 <- rgamma(60, shape = 1, rate = 0.5)

icecream$cups <- c(var1, var2)

# Normal data: we observe it appears right-skewed
ggplot(icecream, aes(x = cups)) + geom_histogram(fill = "pink", color = "black", bins = 20) + facet_wrap(~group)


2. Wilcoxon Sign-Rank Test

On the course website, the cystic fibrosis dataset contains paired data on the patient’s reduction in FVC for each treatment period for a drug vs. placebo. Create a histogram for the drug and placebo group and note if there are any outliers. Conduct a Wilcoxon Signed-Rank Test to determine if there is a significant difference in reduction of FVC for drug vs. placebo.

fibrosis <- read.delim("https://s3.amazonaws.com/pbreheny-data-sets/cystic-fibrosis.txt")


3. Spearman’s Rank Correlation

Previously we looked at the nhanes data containing the heights and weights of men. For this problem, we will look at the nhanes data containing the heights and weights of women and perform a similar analysis. Using the nhanes-aw data from the website, find the Spearman’s rank correlation between height and weight.

nhanes_woman <- read.delim("https://s3.amazonaws.com/pbreheny-data-sets/nhanes-aw.txt")