Lab 14

In today’s lab we will begin to cover how to carry out an ANOVA anaylsis in R. The idea of an ANOVA analysis is a large topic that we will begin to cover the basics in this course. Then the second half of the lab will be review for your quiz on Thursday!

ANOVA analysis

To begin to dive into an ANOVA analysis we will first read in the flicker/eye color dataset from the class website. The dataset summarizes an individual’s critical flicker frequency as the highest frequency at which a flickering light source can be detected. This tudy recorded critical flicker frequency and the color of the iris for 19 subjects.

flick <- read.delim("http://myweb.uiowa.edu/pbreheny/data/flicker.txt")
attach(flick)

The first step of the ANOVA analysis is to create a linear model using the lm() function. The lm() function takes a parameter Y ~ X which is the same as other functions we have used in the past. Here with this problem we will use the lm() to find the linear model below. Once this is done we can use the function anova() to find values that we will put into an equation that will help us to tell how much of the total variation in “flicker” is explained by our model. This equation is to take the SS for Group and divide it by the sum of SS for group and SS residuals.

model1 <- lm(Flicker~Color)
anova(model1)
## Analysis of Variance Table
## 
## Response: Flicker
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## Color      2 22.997 11.4986  4.8023 0.02325 *
## Residuals 16 38.310  2.3944                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From this output we can find this calculations as 23/(38.31 + 23). This tells us that eye color explaines 37.5% of the variablitiy in critical flicker detection. However, is this a bigger difference than we would expect by chance alone? well the p-value is .02325, si it is quite unlikely that this evidence is the result of chance alone. If you are curious this p-value is based on the F statistic of 4.8023.

Pairwise comparison of means

This low p-value tells us that the null hypothesis that we reject that the mean flicker detection rate for each eye-color was the same. So now you might be curious on what eye color is best! One way to address this is to do a boxplot!

boxplot(Flicker~Color)

This boxplot indicates that the blue-eyed people have the best flikcer detection, while brown-eyed people have the worst.

However, we have to be careful here because someone may say that we do not have enough evidence to say that. So to formally test this we can carry out multiple t-test for each two group comparison.

Multiple Comparisons

Tukey’s “Honest Significant Difference”

You saw in class why it’s important to account for each separate comparison you make, in order to avoid making too many type I errors. For ANOVA, the most common method for adjusting for multiple comparisons (and the method you’ll have to use on the homework) is called the “Tukey” correction (or “Tukey’s Honest Significant Differences”). R makes this relatively easy to accomplish with the TukeyHSD() function.

fit <- aov(Flicker~Color)

TukeyHSD(fit)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Flicker ~ Color)
## 
## $Color
##                  diff        lwr       upr     p adj
## Brown-Blue  -2.579167 -4.7354973 -0.422836 0.0183579
## Green-Blue  -1.246667 -3.6643959  1.171063 0.3994319
## Green-Brown  1.332500 -0.9437168  3.608717 0.3124225

From the output we get an average difference and a lower and upper difference.

Bonferroni Adjustment

Another way is to use Bonferroni adjustment and in R we use the pairwise.t.test() function.

pairwise.t.test(Flicker,Color, p.adj = "bonf")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  Flicker and Color 
## 
##       Blue  Brown
## Brown 0.021 -    
## Green 0.606 0.451
## 
## P value adjustment method: bonferroni

Here R automatically scales your p-value so you can compare it to 0.05 like usual, but you could also calculate the Bonferroni-adjusted alpha level (here it would be .05/3 = .017). You’d still have to run the code above except with p.adj = “none”. These below are p-values and it can be seen that there is a significant difference between the blue and brown group however, not between any other set of two group comparisons.

Quiz Review

Below is an outline of things you should know from the eyes of the TA’s.

Two sample catagorical data

Chi-square

Be able to create a continguency table with the outcome as the columns and the groups as a row. Know how to ocmpute a Chi-squared test! Can almost promise this will be on the quiz. The formula for the chi-squared statistic is the sum of all the \((observed - expected)^2\) divided by the expected. Then put the value on a chi-squared dist and find the area to the right to find p-value

Fishers exact test:

Know this tests the same hypothesis as a chi-squared test. Expecially useful when any expected cell count is below 5 and is necessary when the expected counts get lower than 1.

Odds Ratio

Remember from an continguency table this is ad/bc and if odds ratio > 1, for this example lets say its 1.5: then we woulld say “The odds for (group 1) experiencing (outcome”Y“) is 1.5 times the odds of (group 2) experiencing (outcome”Y“).”

If the odds ratio is less than 1 lets say .6, we would say “The odds for (group 1) experiencing (outcome”Y“) is 0.6 times the odds of (group 2) experiencing (outcome”Y“). OR”The odds for (group 1) experiencing (outcome “Y”) are 40% lower than the odds of (group 2) experiencing (outcome “Y”)."

The CI for an odds ratio is exp(log(OR) +- Z*sqrt(1/a + 1/b + 1/c + 1/d)), where Z for a 95% CI is 1.96.

Two sample Continuous data

When this is the case we need to use Welch’s or Students test! Know the difference between the two and when to use which one!

Methods for handling outliers and poor skews! We can use Mann-Whitney rank sum test, log tranform the data, or permutation test. Know these and at least what each does!

know the different studies

Know the difference between retrospective, prospective, and cross sectional studies.

Example Questions with Answers at bottom

Example 1

In a study of 793 individuals who were in bike accidents it could be seen that when riding a bike 147 of them wore helmets at 17 of these got head injuries. The rest of the bikers did not wear helmets and 428 of them did not get a head injury!

  1. Make a continguency table for the data

  2. Without running any tests, does there appear to be a benefit to wearing a helmet? (hint: odds ratio)

  3. Make a 95% CI for this odds ratio

4)What are the expected counts for the continguency table?

  1. Calculate chi-squared statistic and explain significance

Example 2

A study compared the miles per gallon of American cars (sample 1) to Japanese cars (sample 2). The sample size for American cars was 249 with a sample mean of 20.145 and sample standard deviation of 6.415. Japanese cars had a sample size of 79, sample mean of 30.481, and sample standard deviation of 6.108. (pooled deviation is 6.343)

  1. If we assume data is normal what test do we run?

  2. Conduct a t-test comparing the two group means and interpret the results

  3. Further analysis shows that two of the American cars in the sample were getting less than 5 miles per gallon. How might this affect the test results? How might you remedy this issue?

Example 3

Try this on your own. The Predators (a hockey team) just reached the second round of the playoffs. I was curious if the Predators experienced any benefit to playing at home this season, so I gathered the data on how many goals they scored each game and whether they were home or away (regular season only). In home games they scored an average of 3.098 goals in 41 games with a sd of 1.841. In away games(41) they scored 2.237 with a sd of 1.43. The pooled sd is 1.650.

  1. which method is best in this case for testing associations between location and goals scored?

  2. It seems they did better at home could this be a difference explained by chance alone?

Answers

Example 1

  1. 13 17

428 218

2)Yes, the odds ratio is \((130*218)/(17*428) = 3.89\). If you were wearing a helmet, the odds of “no” head injury are 3.89 times the odds of “no” head injury when you were NOT wearing a helmet.

3)log(3.89) +- 1.96* sqrt(1/130+1/428+1/17+1/218) = [0.8272548, 1.8895635]

exp(c(0.8272548, 1.8895635)) = [2.287032, 6.616480]

  1. 103.4 43.6

454.6 491.4

  1. 28.33

Example 2

  1. Student’s

  2. \(t = \frac{20.145 - 30.481}{6.34*\sqrt{1/249 + 1/79}} = -12.62\)
    Using (249+79 - 2) = 326 degrees of freedom, the p-value is less than 0.05; there is a difference in MPG– American cars get fewer miles per gallon.

  3. The t-test procedure can be affected by the impact outliers have on the mean and standard error.

Assuming the measurements are accurate you might consider a non-parametric (Wilcoxon Rank Sum) approach to remove the effect of these outliers.

You can also consider a log transformation to shrink the scale of the variables.

Note: A confidence interval done on log-transformed data will reflect the ratio group means after being exponentiated.

Example 3

  1. probably Student’s

2)do t-test pvalue = .02045