Objectives

In today’s lab we will:

  1. Learn how to make stacked bar plots
  2. Compute and compare summary statistics
  3. Learn how to visualize continuous data using figures
  4. Review for Quiz 1

Stacked Barplots

This comes mostly as an “FYI” for your upcoming homework 3. You will be expected to make stacked barplots for a certain dataset. For more experienced R programmers, ggplot is the easiest way to do this. However, this section will introduce how to plot in this way with the titanic dataset in base R.

titanic <- read.delim("https://s3.amazonaws.com/pbreheny-data-sets/titanic.txt")

There are 2 major differences with the code. The first is that the height now takes a table rather than a vector. The first argument is separated by the stacks and the second argument separates by the columns. The second difference is now that the “beside” argument is set to FALSE (it is true by default). Try turning it to true and see what it does.

barplot(height = table(titanic$Survived, titanic$Class),
        beside=FALSE, col = c("blue", "red"), main = "Survival Rate Between Classes")
legend(x = "topleft", legend = c("Died", "Survived"), title = "Survival Status", fill = c("blue", "red"))

Now, lets plot the graph separated by sex, so 2 graphs total. If we want to view more than 1 graph on 1 panel, you can use the following code to make a 2x1 panel

par(mfrow=c(1,2)) # To turn it back to normal, run par(mfrow = c(1,1))

Now that we separated the panel into a 1x2 plane, we can see the following graphs side by side:

{
par(mfrow=c(1,2))

# Bar plot for the females
barplot(height = table(titanic$Survived, titanic$Class, titanic$Sex)[,,1],
        beside=FALSE, 
        col = c("blue", "red"), 
        main = "Female Survival Rate vs Classes")
legend(x = "topleft", 
       legend = c("Died", "Survived"), 
       title = "Survival Status", 
       fill = c("blue", "red"))

# Bar plot for the males
barplot(height = table(titanic$Survived, titanic$Class, titanic$Sex)[,,2],
        beside=FALSE, 
        col = c("blue", "red"), 
        main = "Male Survival Rate vs Classes")
legend(x = "topleft", 
       legend = c("Died", "Survived"), 
       title = "Survival Status", 
       fill = c("blue", "red"))
}

Summary Statistics

Today we will be using the tailgating dataset. A description of this study can be found here:

https://myweb.uiowa.edu/pbreheny/data/tailgating.html

The data can be uploaded to R using the following code.

tailgating <- read.delim("http://myweb.uiowa.edu/pbreheny/data/tailgating.txt")

We have already learned how to compute some summary statistics in R, but today we will learn how to visualize the distribution of continuous data. First, let’s take a look at the summary of distance.

summary(tailgating$Distance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.82   27.57   32.49   41.01   39.52  356.96

Standard deviation

Summary tells us most of the information we would like to know. How about the standard deviation? Use the function ‘sd’

sd(tailgating$Distance)
## [1] 44.16035
  • How would you generally interpret this? Do you think the data has a small or large spread?
    • The data seems to have a large spread based on how large the standard deviation is.

Data by drug group

Now let’s look at distance by drug group status. The ‘by’ function allows us to run a function over data set into groups.

  • The first parameter is the data of interest
  • The second parameter is the data that defines the groups
  • The third parameter is the function to run over the data
by(tailgating$Distance, tailgating$Group, summary)
## tailgating$Group: ALC
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.89   28.83   35.42   36.83   40.21   68.34 
## ------------------------------------------------------------ 
## tailgating$Group: MDMA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.01   22.32   26.83   27.61   28.46   56.61 
## ------------------------------------------------------------ 
## tailgating$Group: NODRUG
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.70   28.80   33.37   47.33   43.57  356.96 
## ------------------------------------------------------------ 
## tailgating$Group: THC
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.82   27.75   31.90   42.61   39.52  346.72
  • Summarize the differences between the groups.
    • Looking at all the groups, we see that they have roughly similar means and medians, however the max on the NODRUG and THC groups is much larger than the others. (Other answers acceptable).

Histograms

As discussed in class, we can visualize the distribution of distance using a histogram. Here is how to do this in R:

hist(tailgating$Distance)

Compare the mean and median

We could add the mean and median to the plot using the function ‘abline’ to add lines. Let’s see how the two compare.

hist(tailgating$Distance, main = "Histogram of Tailgating Distance", 
     xlab = "Following Distance")
abline(v=mean(tailgating$Distance), 
       col="steelblue4", lwd=2) #lwd makes the line thicker (line width)
abline(v=median(tailgating$Distance), col="orange", 
       lwd=2, lty = 2)#lty makes the line dashed (line type) 
legend(x = "topright", 
       legend = c("Median", "Mean"), 
       col = c("orange", "steelblue4"),
       lwd = 2, lty = c(2,1))

# the mean is the solid line and the median is the dashed line 

Change the bin size

The bin size (increments seen on the x-axis) can impact how our data looks. If these bins are large, we might not be able to see our data in detail. Above, our bin size is pretty large because the range is vast. Let’s see how the data looks if we change the bin size using the argument ‘breaks’.

As a refresher, the seq() function will give you a list of numbers where the minimum is the first argument the maximum is the second argument and the increment (OR for this purpose the bin size) is specified by the third argument.

hist(tailgating$Distance, breaks = seq(0, 400, 2)) # bin size of 2

hist(tailgating$Distance, breaks = seq(0, 400, 100)) # bin size of 100

hist(tailgating$Distance, breaks = seq(0, 400, 10)) # bin size of 10

You can now see that most of the data follows a bell-shaped, normal distribution, but there are some outliers that cause the data to be right skewed. The group with a bin size of 10 allows us to see an appropriate amount of detail in the plot.

Customize your histogram

The great thing about R is there are many options to customize your figures. Below is code for the same figure but we have added arguments to customize the x and y label (xlab & ylab, respectively). The main title function (main) allows you to create a title or you can choose to omit the default by using the argument ““. There are also many color options in R. The col function allows you to color the bars.

#customized labels, solid color & white border   
hist(tailgating$Distance, col= "pink", border="white", breaks = seq(1, 400, 10),
     xlab="Distance",
     ylab="Frequency",
     main = "")

Specify histogram by drug group

Now let’s visualize distance broken down by drug group. We can use the “par” function to view additional plots in one window.

par(mfrow=c(2,2)) # view all four histograms in a 2 by 2 window

hist(tailgating$Distance[tailgating$Group=="ALC"], col= "yellow", breaks = seq(1, 400, 10), 
     main = "", xlab = "ALC")
hist(tailgating$Distance[tailgating$Group=="MDMA"], col= "red", breaks = seq(1, 400, 10), 
     main = "", xlab = "MDMA")
hist(tailgating$Distance[tailgating$Group=="NODRUG"], col= "blue", breaks = seq(1, 400, 10), 
     main = "", xlab = "NoDrug")
hist(tailgating$Distance[tailgating$Group=="THC"], col= "green", breaks = seq(1, 400, 10), 
     main = "", xlab = "THC")

  • What groups are the main culprits of outliers?
    • The NODRUG and THC groups have the major outliers.

Box Plots

We can also plot this data using a box plot.

boxplot(tailgating$Distance ~ tailgating$Group, col=rainbow(4))

Although it is good to know there are outliers, the large range can make it difficult to see the boxes. We can remove the outliers for a better look by using the argument “outline=FALSE”.

boxplot(tailgating$Distance ~ tailgating$Group, col= rainbow(4), outline=FALSE)

Remember that the boxplot has the 0th, 25th, 50th, 75th, and 100th quantiles. We can also find quantiles in the following section.

Calculate a quantile

As an FYI, you can find specific quantiles of interest using the quantile function (2nd argument asks for what quantile you would like)

quantile(tailgating$Distance, 0.30)
##      30% 
## 28.15008

Quiz Review

Observational vs Experimental

State whether each of the following scenarios are an observation or an experimental study design

Errors

Type I Error

A Type I error is committed when a true null hypothesis is rejected. In other words, a type I error is the probability of rejecting the null hypothesis when the null is in fact true. In terms of disease detection (where the null hypothesis is no disease), this is a false positive.

Type I Error Rate (\(\alpha\))

The Type I error rate is the proportion of true hypotheses that were rejected.

Type II Error

A Type II error is committed when a false null hypothesis is not rejected. In other words, a type II error is the probability of failing to reject the null hypothesis when the null is in fact false.

Type II Error Rate (\(\beta\))

The Type II error rate is the proportion of false null hypotheses that failed to be rejected.

False Discovery Rate

The false discovery rate is the fraction of null hypothesis rejections that were incorrect.

Practice Questions:

  1. Fill in the table below using the following information: Suppose that an investigator conducts 800 experiments with the null hypothesis being true 700 times. The investigator rejected the null hypothesis when the null was true 10% of the time and failed to reject the null when the null was false 20% of the time.
True Null False Null Total
Don’t Reject 630 20 650
Reject 70 80 150
Total 700 100 800
  1. Consider a study in which researchers were interested in whether students think better when they were standing versus when they were seated, and were also interested in whether drinking/eating may effect academic performance. To test this researchers asked 7th grade students across several different middle schools around the country a series of questions and recorded whether or not they answered all questions correctly. Students were assigned at random to be either seated or standing when given the questions, and they were also randomized to be either drinking or eating during the process. Their data are presented below:
Drinking
Eating
Group Perfect.Responses Total.Responses Perfect_Responses Total_Responses
Sitting 71 134 88 104
Standing 94 142 57 90

Vocab recap

Selection bias

Instead of random sampling, certain subgroups of the population were more likely to be included than others.

Nonresponse bias

Nonresponders can differ from responders in many important ways

Generalizing from a sample to a different population

Anytime the study violates the principle of generalizing to the population that the sample was drawn from.

In each of the following examples, determine which bias(es), if any, may be present.

  • Doctors want to investigate whether Tylenol performs better than Ibuprofen in curing head-aches. They design an experiment in which participants are randomly assigned to one of the two treatments. The patients are blind to which treatment they receive.

    • No Bias
  • A parent-teacher association at an affluent school in Chicago, Illinois wanted to study how pervasive the drug culture was among high school students in the United States. To answer this question, they handed out a survey to their high school students at a school assembly.

    • Selection Bias
  • A recent investigator conducted a survey to study how long New Year’s resolutions last. The individuals who were still on track with their goals were more likely to respond.

    • Non-Response Bias