Objectives

  1. Practice applying (log) transformations to skewed data

  2. Introduce nonparametric testing procedures

Last week in lab, we began analyzing the Infant Diarrhea study. In this lab, we will further analyze that data set using what we now know about outliers, transforming data, and nonparametric testing procedures.

Examine the data

Recall the data we worked with last week.

Researchers in Peru conducted a double-blind randomized controlled trial, published in The New England Journal of Medicine, to determine whether Bismuth salicylate would improve the outcomes of infants suffering from diarrhea.

In their study, all infants received the standard therapy for diarrhea: oral re-hydration. In addition to the rehydration, 85 babies received bismuth salicylate, while 84 babies received a placebo. The total stool volumes for all infants over the course of their illness was measured. To adjust for body size, the researchers divided by body weight to obtain their outcome of interest: stool output per kilogram of body weight. The results of their study are available on the course website.

Let’s examine the distribution of the Infant Diarrhea data by group to visualize our data.

diarrhea <- read.delim("https://raw.githubusercontent.com/IowaBiostat/data-sets/main/diarrhea/diarrhea.txt")
boxplot(diarrhea$Stool~diarrhea$Group, main = "Effect of Bismuth Salicylate", 
        xlab = "Stool Output (mL/kg)", ylab = "Group", horizontal = T, col = c("lightblue", "pink"))

What is your outcome of interest? What type of data is it (continuous vs discrete/binary)? How many groups do you have? If you have multiple groups, are they paired groups or independent groups? What type of test do we have in our toolbox that we could use to analyze this type of data? Are there any concerns you have with using this test after inspecting the distribution of the data?

Answer

Our outcome of interest is total stool volume per kg of body weight. This is a continuous outcome. We have two INDEPENDENT groups: a treatment group with 85 babies and a control group with 84 babies. When we have two independent samples with a continuous outcome of interest, we typically use a two-sample t-test (either Student’s t-test or Welch’s t-test). After visualizing the data, I see both groups have a right-skew to them with several outliers. We know the mean and standard error are heavily influenced by outliers. Therefore, the two-sample t-test may be inadequate for analyzing this data. We should apply a transformation to the data or use a nonparametric approach.

Transformations

Recall the distribution of the odds ratio, and how it was right-skewed. We ‘fixed’ this by transforming it to a new statistic (Log odds ratio) which is normally distributed. The same idea can be used for other right-skewed data. We can apply the log transformation and re-examine the distribution of our diarrhea data to see if it becomes normally distributed.

Recall: Any time we write “log”, we are referring to the natural log (ln)

diarrhea$logStool <- log(diarrhea$Stool)

boxplot(diarrhea$logStool~diarrhea$Group, main = "Effect of Bismuth Salicylate", 
        xlab = "Log(Stool Output)", ylab = "Group", horizontal = T, col = c("lightblue", "pink"))

We can also look at histograms of the data to get a better sense of normality/skewness.

library(ggplot2)

#Original
ggplot(diarrhea, aes(x = Stool)) + geom_histogram(fill = "tomato3", color = "black", bins = 20) + facet_wrap(~Group)

#Log-transformed
ggplot(diarrhea, aes(x = log(Stool))) + geom_histogram(fill = "tomato3", color = "black", bins = 20) + facet_wrap(~Group)

We can see that the distribution of log(Stool) is still right-skewed; however, it is not as severe as before. We can perform a two-sample t-test with log(Stool) and achieve a more powerful result. Compare this with the results from the t-test using the original data.

# Original
t.test(diarrhea$Stool ~ diarrhea$Group, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  diarrhea$Stool by diarrhea$Group
## t = 2.245, df = 167, p-value = 0.02608
## alternative hypothesis: true difference in means between group Control and group Treatment is not equal to 0
## 95 percent confidence interval:
##    9.457362 147.396700
## sample estimates:
##   mean in group Control mean in group Treatment 
##                260.2976                181.8706
# log-transformed
t.test(diarrhea$logStool ~ diarrhea$Group, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  diarrhea$logStool by diarrhea$Group
## t = 2.82, df = 167, p-value = 0.005383
## alternative hypothesis: true difference in means between group Control and group Treatment is not equal to 0
## 95 percent confidence interval:
##  0.1025092 0.5810823
## sample estimates:
##   mean in group Control mean in group Treatment 
##                5.212370                4.870574

Note: The confidence bounds provided are on the log scale. In order to obtain a more interpretable interval, we need to exponentiate them (which we will do in the next section).


Log Ratios and Confidence Intervals

Two sample (continuous)

Note: When we’re working on the log scale, we’re now thinking about the ratio between the two groups, since log(a)-log(b) = log(a/b).

By taking the difference of the log means, we are actually calculating the log ratio of the two groups. We can exponentiate this result for interpretation.

In our example, we can find the difference in the mean log_stool output of the treatment and control groups and then exponeniate that difference to get an estimate for the difference in stool output in ml/kg.

by(diarrhea$logStool, diarrhea$Group, mean)
## diarrhea$Group: Control
## [1] 5.21237
## ------------------------------------------------------------ 
## diarrhea$Group: Treatment
## [1] 4.870574
estimate <- 5.2124-4.8706 # difference in mean(log_stool)
exp(estimate) # exponentiate to get estimate of diff in stool output
## [1] 1.407479

How would you interpret this estimate?

Answer

The mean stool output in the control group is approximately 1.41 times higher than in the treatment group. In other words, the control groups has an approximate 41% higher mean stool output compared to the treatment group.

Using our previous t-test, we can extract the confidence interval using “conf.int” and store the results as a variable before exponentiating it to get the confidence interval for the difference in means based on the original units (ml/kg).

logCI <- t.test(diarrhea$logStool~diarrhea$Group,var.equal=TRUE)$conf.int
exp(logCI)
## [1] 1.107948 1.787973
## attr(,"conf.level")
## [1] 0.95

One sample (continuous)

Let’s find a confidence interval for the mean stool output for just the treatment group. We want to use the confidence interval for a one sample continuous outcome so we will need to use the formula

\(\bar{x} \pm t_{\alpha/2, df} * SE\)

All of these estimates should use the log-transformed data for our example

In our example, \(\bar{x}\) is the mean of the log_stool for the treatment group, s is the standard deviation of the log_stool samples in the treatment group, and \(t_{\alpha/2, df}\) is the value that contains 95% of the data when we have n-1 df. After we find this confidence interval, we will exponentiate it to find the confidence interval for the stool output (not on the log scale).

First we need to get the log_stool data for just the treatment group:

log_stool_treat <- diarrhea$logStool[diarrhea$Group == "Treatment"]

Now we can find \(\bar{x}\), s, SE, and t.

(x_bar <- mean(log_stool_treat))
## [1] 4.870574
(s <- sd(log_stool_treat))
## [1] 0.7448087
(n <- length(log_stool_treat)) # length() gives us n
## [1] 85
(SE <- s/sqrt(n))
## [1] 0.08078585
(df <- n - 1)
## [1] 84
(t_df <- qt(0.025, df))
## [1] -1.98861

So our 95% CI for the log_stool is

\(4.87 \pm 1.99 * 0.0808\)

(log_CI <- x_bar + c(t_df, -t_df) * SE) # CI for LOG_stool, still need to exponentiate it
## [1] 4.709923 5.031226

Therefore the 95% CI for the stool output in ml/kg for the treatment group is:

\((e^{4.71}, e^{5.03})\)

(CI <- exp(log_CI))
## [1] 111.0436 153.1206

How would you interpret this 95% confidnece interval?

Answer We are 95% confident that the true mean stool output for infants in the bismuth treatment group is between 111 ml/kg and 153 ml/kg.


Nonparametric tests

When the normality assumption is violated, another way to analyze the data is with a nonparametric test. These tests do not require a distributional assumption (such as normality) and are robust to the presence of outliers. These ‘rank-based methods’ are a powerful way to analyze data when distributional assumptions are questionable, and particularly effective in the presence of outliers. There are several nonparametric methods we can use, but the following are the ones we will be covering in this course.

  • Two-sample studies: Mann-Whitney U Test also known as the Wilcoxon Rank Sum Test
  • One-sample (or paired) studies: Wilcoxon Signed-Rank Test
  • Both continuous: Spearman’s Rank Correlation

Wilcoxon Rank Sum Test

Refer back to the Infant Diarrhea study. Instead of transforming the data or discarding outliers, we can use the Wilcoxon Rank Sum Test to test whether the treatment and control groups have different stool output values. Ranking the data minimizes the impact of outliers, and removes assumptions about the underlying distribution of the data.

# Rank-Sum Test
wilcox.test(diarrhea$Stool~diarrhea$Group)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  diarrhea$Stool by diarrhea$Group
## W = 4452, p-value = 0.005573
## alternative hypothesis: true location shift is not equal to 0

How would you interpret this result?

Answer

The study provides strong evidence that the group of infants with diarrhea who received the bismuth treatment have lower stool output compared to the infants with diarrhea in the placebo group.

Wilcoxon Signed-Rank Test

If the data we are analyzing is paired, the signed-rank test is a nonparametric procedure that takes in to account the relationship between the two groups. Let’s revisit the familiar paired data set from the Oatbran study.

# Signed Rank Test
oatbran <- read.delim('https://raw.githubusercontent.com/IowaBiostat/data-sets/main/oatbran/oatbran.txt')
wilcox.test(oatbran$CornFlakes, oatbran$OatBran, paired=TRUE)
## 
##  Wilcoxon signed rank exact test
## 
## data:  oatbran$CornFlakes and oatbran$OatBran
## V = 93, p-value = 0.008545
## alternative hypothesis: true location shift is not equal to 0
# For comparison, in case you don't remember:
t.test(oatbran$CornFlakes, oatbran$OatBran, paired=TRUE)
## 
##  Paired t-test
## 
## data:  oatbran$CornFlakes and oatbran$OatBran
## t = 3.3444, df = 13, p-value = 0.005278
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  0.1284606 0.5972537
## sample estimates:
## mean difference 
##       0.3628571

Describe the difference between the two results. What conclusions can you draw?

Answer

The results from the parametric paired t-test and the nonparametric signed rank test are consistent. The results from both tests conclude that there is strong evidence to suggest that men on the oatbran diet have lower cholesterol levels than men on the cornflake diet. It seems that the parametric assumptions hold which provides more power over the nonparametric test. We also get a straightforward confidence interval from the parametric test output.

Spearman’s Rank Correlation

Another nonparametric option is Spearman’s Rank Correlation for continuous data. This test ranks the values of both variables in the data set before computing a correlation coefficient. Let’s revisit the nhanes data for men. Remember that the two variables of height and weight are continuous. When we plot the data, we can see a clear outlier that is over 800 pounds.

nhanes <- read.delim('https://raw.githubusercontent.com/IowaBiostat/data-sets/main/nhanes-am/nhanes-am.txt')
plot(nhanes$Weight~nhanes$Height)

Let’s see how our results would have been influenced by this outlier using Spearman’s Rank Correlation.

#Spearman's Rank Correlation
cor.test(nhanes$Height, nhanes$Weight, method = "spearman")

#For comparison, here's what we got using Pearson's correlation:
cor.test(nhanes$Height, nhanes$Weight, method = "pearson")

How much of an influence did the outlier have?

Answer The Spearman’s rho correlation is higher than the Pearson correlation (0.44 and 0.41, respectively). Both correlations are comparable and we come to the same conclusion that there is overwhelming evidence that the correlation is not equal to zero. It appears that the outlier wasn’t too influential in examining correlation.

Comparing Parametric and Nonparametric Tests

Parametric advantages: When the assumptions hold, the parametric tests are more powerful and construction of the confidence intervals is straightforward.

Nonparametric advantages: There are minimal assumptions, and they are more powerful when parametric assumptions (assumptions about the distribution of the data) are invalid.

In summary, when you have continuous data, don’t automatically use a t-test. Look at the data, and if it’s skewed or contains large outliers, consider a transformation or a nonparametric option.


Practice Problem

1. Transforming Data

Suppose we have a dataset with 1000 measurements of how many cups of ice cream individuals eat every day. The individuals were separated into two groups “A” and “B”. We want to know whether there is a significant difference in ice cream consumption between the two groups. The data ‘toydata’ will be generated by running the code below.

If we plot the data as-is, we see that the measurements are highly skewed right for both groups. Using the methods we covered in this lab, perform a 2-sample t-test on whether there is a significant difference in the ice cream consumption between groups “A” and “B”, and find the confidence interval.

set.seed(58008)
icecream <- as.data.frame(matrix(c(rep("A", 40), rep("B", 60)), nrow = 100))
colnames(icecream) <- "group"
var1 <- rlnorm(40, meanlog = 0, sdlog = 1)
var2 <- rgamma(60, shape = 1, rate = 0.5)

icecream$cups <- c(var1, var2)

ggplot(icecream, aes(x = cups)) + geom_histogram(fill = "pink", color = "black", bins = 20) + facet_wrap(~group)

Answer

Since the data are right skewed, we will apply a log transformation before performing the two-sample t-test.

icecream$logcups <- log(icecream$cups) # transform
ggplot(icecream, aes(x = log(cups))) + geom_histogram(fill = "pink", color = "black", bins = 20) + facet_wrap(~group) # visualize

The log transformed data appear to look much more normal so we will proceed with the two sample t-test. To decide which one, we can analyze the sample sizes of the two groups as well as the standard deviations.

by(icecream$cups, icecream$group, sd)
## icecream$group: A
## [1] 1.513396
## ------------------------------------------------------------ 
## icecream$group: B
## [1] 1.462479
table(icecream$group)
## 
##  A  B 
## 40 60

The groups are roughly evenly sized and the standard deviations look similar. I feel comfortable using the Student’s t test, but Welch’s approach would be fine too.

t.test(icecream$logcups ~ icecream$group, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  icecream$logcups by icecream$group
## t = 0.22213, df = 98, p-value = 0.8247
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
##  -0.3773412  0.4724658
## sample estimates:
## mean in group A mean in group B 
##      0.13500594      0.08744364

We need to remember to exponentiate the confidence interval given:

ci <- t.test(icecream$logcups ~ icecream$group, var.equal = TRUE)$conf.int
exp(ci)
## [1] 0.6856821 1.6039443
## attr(,"conf.level")
## [1] 0.95

There is insufficient evidence to suggest that mean ice cream consumption is different between groups A and B.


2. Wilcoxon Sign-Rank Test

On the course website, the cystic fibrosis dataset contains paired data on the patient’s reduction in FVC for each treatment period for a drug vs. placebo. Create a histogram for the drug and placebo group and note if there are any outliers. Conduct a Wilcoxon Signed-Rank Test to determine if there is a significant difference in reduction of FVC for drug vs. placebo.

fibrosis <- read.delim('https://raw.githubusercontent.com/IowaBiostat/data-sets/main/cystic-fibrosis/cystic-fibrosis.txt')
Answer

First visualize your data

hist(fibrosis$Drug)

hist(fibrosis$Placebo)

There may be one outlier in the placebo group. (Could make boxplots to check).

We will conduct a Wilcoxon Signed-Rank Test to determine if there is a significant difference in reduction of FVC for drug vs. placebo.

wilcox.test(fibrosis$Drug, fibrosis$Placebo, paired=TRUE)
## 
##  Wilcoxon signed rank exact test
## 
## data:  fibrosis$Drug and fibrosis$Placebo
## V = 19, p-value = 0.03528
## alternative hypothesis: true location shift is not equal to 0
mean(fibrosis$Drug)
## [1] 159.7143
mean(fibrosis$Placebo)
## [1] 296.2143

There is moderate evidence to suggest that the drug leads to a smaller reduction in FVC scores than placebo in patients with Cystic Fibrosis.


3. Spearman’s Rank Correlation

Previously we looked at the nhanes data containing the heights and weights of men. For this problem, we will look at the nhanes data containing the heights and weights of women and perform a similar analysis. Using the nhanes-aw data from the website, find the Spearman’s rank correlation between height and weight.

nhanes_woman <- read.delim('https://raw.githubusercontent.com/IowaBiostat/data-sets/main/nhanes-aw/nhanes-aw.txt')
Answer

Again, let’s start by visualizing our data.

plot(nhanes_woman$Weight~nhanes_woman$Height)

There are not any super obvious or extreme outliers, but there are some points that could be concerning. We can compute Spearman’s Rank correlation and compare it to the Pearson correlation.

cor.test(nhanes_woman$Height, nhanes_woman$Weight, method = "spearman")
## Warning in cor.test.default(nhanes_woman$Height, nhanes_woman$Weight, method =
## "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  nhanes_woman$Height and nhanes_woman$Weight
## S = 2169597258, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.2996993
cor.test(nhanes_woman$Height, nhanes_woman$Weight, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  nhanes_woman$Height and nhanes_woman$Weight
## t = 15.897, df = 2647, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2600585 0.3295970
## sample estimates:
##       cor 
## 0.2952187

The correlations are very comparable and both tests come to the same conclusion. There is overwhelming evidence to suggest that the correlation is different than 0. From the data we observed, there is a moderate positive association between women’s height and weight.