In today’s lab we will:
This comes mostly as an “FYI” for your upcoming homework 3. You will be expected to make stacked barplots for a certain dataset. For more experienced R programmers, ggplot is the easiest way to do this. However, this section will introduce how to plot in this way with the titanic dataset in base R.
titanic <- read.delim("https://s3.amazonaws.com/pbreheny-data-sets/titanic.txt")
There are 2 major differences with the code. The first is that the height now takes a table rather than a vector. The first argument is separated by the stacks and the second argument separates by the columns. The second difference is now that the “beside” argument is set to FALSE (it is true by default). Try turning it to true and see what it does.
barplot(height = table(titanic$Survived, titanic$Class),
beside=FALSE, col = c("blue", "red"), main = "Survival Rate Between Classes")
legend(x = "topleft", legend = c("Died", "Survived"), title = "Survival Status", fill = c("blue", "red"))
Now, lets plot the graph separated by sex, so 2 graphs total. If we want to view more than 1 graph on 1 panel, you can use the following code to make a 2x1 panel
par(mfrow=c(1,2)) # To turn it back to normal, run par(mfrow = c(1,1))
Now that we separated the panel into a 1x2 plane, we can see the following graphs side by side:
{
par(mfrow=c(1,2))
# Bar plot for the females
barplot(height = table(titanic$Survived, titanic$Class, titanic$Sex)[,,1],
beside=FALSE,
col = c("blue", "red"),
main = "Female Survival Rate vs Classes")
legend(x = "topleft",
legend = c("Died", "Survived"),
title = "Survival Status",
fill = c("blue", "red"))
# Bar plot for the males
barplot(height = table(titanic$Survived, titanic$Class, titanic$Sex)[,,2],
beside=FALSE,
col = c("blue", "red"),
main = "Male Survival Rate vs Classes")
legend(x = "topleft",
legend = c("Died", "Survived"),
title = "Survival Status",
fill = c("blue", "red"))
}
Today we will be using the tailgating dataset. A description of this study can be found here:
https://myweb.uiowa.edu/pbreheny/data/tailgating.html
The data can be uploaded to R using the following code.
tailgating <- read.delim("http://myweb.uiowa.edu/pbreheny/data/tailgating.txt")
We have already learned how to compute some summary statistics in R, but today we will learn how to visualize the distribution of continuous data. First, let’s take a look at the summary of distance.
summary(tailgating$Distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.82 27.57 32.49 41.01 39.52 356.96
Summary tells us most of the information we would like to know. How about the standard deviation? Use the function ‘sd’
sd(tailgating$Distance)
## [1] 44.16035
Now let’s look at distance by drug group status. The ‘by’ function allows us to run a function over data set into groups.
by(tailgating$Distance, tailgating$Group, summary)
## tailgating$Group: ALC
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.89 28.83 35.42 36.83 40.21 68.34
## ------------------------------------------------------------
## tailgating$Group: MDMA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.01 22.32 26.83 27.61 28.46 56.61
## ------------------------------------------------------------
## tailgating$Group: NODRUG
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.70 28.80 33.37 47.33 43.57 356.96
## ------------------------------------------------------------
## tailgating$Group: THC
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.82 27.75 31.90 42.61 39.52 346.72
As discussed in class, we can visualize the distribution of distance using a histogram. Here is how to do this in R:
hist(tailgating$Distance)
We could add the mean and median to the plot using the function ‘abline’ to add lines. Let’s see how the two compare.
hist(tailgating$Distance, main = "Histogram of Tailgating Distance",
xlab = "Following Distance")
abline(v=mean(tailgating$Distance),
col="steelblue4", lwd=2) #lwd makes the line thicker (line width)
abline(v=median(tailgating$Distance), col="orange",
lwd=2, lty = 2)#lty makes the line dashed (line type)
legend(x = "topright",
legend = c("Median", "Mean"),
col = c("orange", "steelblue4"),
lwd = 2, lty = c(2,1))
# the mean is the solid line and the median is the dashed line
The bin size (increments seen on the x-axis) can impact how our data looks. If these bins are large, we might not be able to see our data in detail. Above, our bin size is pretty large because the range is vast. Let’s see how the data looks if we change the bin size using the argument ‘breaks’.
As a refresher, the seq() function will give you a list of numbers where the minimum is the first argument the maximum is the second argument and the increment (OR for this purpose the bin size) is specified by the third argument.
hist(tailgating$Distance, breaks = seq(0, 400, 2)) # bin size of 2
hist(tailgating$Distance, breaks = seq(0, 400, 100)) # bin size of 100
hist(tailgating$Distance, breaks = seq(0, 400, 10)) # bin size of 10
You can now see that most of the data follows a bell-shaped, normal distribution, but there are some outliers that cause the data to be right skewed. The group with a bin size of 10 allows us to see an appropriate amount of detail in the plot.
The great thing about R is there are many options to customize your figures. Below is code for the same figure but we have added arguments to customize the x and y label (xlab & ylab, respectively). The main title function (main) allows you to create a title or you can choose to omit the default by using the argument ““. There are also many color options in R. The col function allows you to color the bars.
#customized labels, solid color & white border
hist(tailgating$Distance, col= "pink", border="white", breaks = seq(1, 400, 10),
xlab="Distance",
ylab="Frequency",
main = "")
Now let’s visualize distance broken down by drug group. We can use the “par” function to view additional plots in one window.
par(mfrow=c(2,2)) # view all four histograms in a 2 by 2 window
hist(tailgating$Distance[tailgating$Group=="ALC"], col= "yellow", breaks = seq(1, 400, 10),
main = "", xlab = "ALC")
hist(tailgating$Distance[tailgating$Group=="MDMA"], col= "red", breaks = seq(1, 400, 10),
main = "", xlab = "MDMA")
hist(tailgating$Distance[tailgating$Group=="NODRUG"], col= "blue", breaks = seq(1, 400, 10),
main = "", xlab = "NoDrug")
hist(tailgating$Distance[tailgating$Group=="THC"], col= "green", breaks = seq(1, 400, 10),
main = "", xlab = "THC")
We can also plot this data using a box plot.
boxplot(tailgating$Distance ~ tailgating$Group, col=rainbow(4))
Although it is good to know there are outliers, the large range can make it difficult to see the boxes. We can remove the outliers for a better look by using the argument “outline=FALSE”.
boxplot(tailgating$Distance ~ tailgating$Group, col= rainbow(4), outline=FALSE)
Remember that the boxplot has the 0th, 25th, 50th, 75th, and 100th quantiles. We can also find quantiles in the following section.
As an FYI, you can find specific quantiles of interest using the quantile function (2nd argument asks for what quantile you would like)
quantile(tailgating$Distance, 0.30)
## 30%
## 28.15008
State whether each of the following scenarios are an observation or an experimental study design
Assume that we are interested in assessing the effects of mothers’ smoking habits during pregnancy on the weight of their babies. To assess this question, we collected data including the mothers’ smoking habits during pregnancy and the birth weight of their babies from medical records.
Assume that we are interested in assessing whether students perform better on exams if they drink a caffeinated beverage the morning of the exam. We randomize students into two groups - one group that drinks caffeinated coffee and one that drinks decaf coffee on the day of the exam. We then assess exam performance for each group using a predetermined metric.
A Type I error is committed when a true null hypothesis is rejected. In other words, a type I error is the probability of rejecting the null hypothesis when the null is in fact true. In terms of disease detection (where the null hypothesis is no disease), this is a false positive.
The Type I error rate is the proportion of true hypotheses that were rejected.
A Type II error is committed when a false null hypothesis is not rejected. In other words, a type II error is the probability of failing to reject the null hypothesis when the null is in fact false.
The Type II error rate is the proportion of false null hypotheses that failed to be rejected.
The false discovery rate is the fraction of null hypothesis rejections that were incorrect.
True Null | False Null | Total | |
---|---|---|---|
Don’t Reject | |||
Reject | |||
Total |
Group | Perfect.Responses | Total.Responses | Perfect_Responses | Total_Responses |
---|---|---|---|---|
Sitting | 71 | 134 | 88 | 104 |
Standing | 94 | 142 | 57 | 90 |
Is this a controlled experiment or observational study?
Which group had a higher success rate (standing or sitting)?
Which type of consumption (eating or drinking) had a higher success rate?
Instead of random sampling, certain subgroups of the population were more likely to be included than others.
Nonresponders can differ from responders in many important ways
Anytime the study violates the principle of generalizing to the population that the sample was drawn from.
In each of the following examples, determine which bias(es), if any, may be present.
Doctors want to investigate whether Tylenol performs better than Ibuprofen in curing head-aches. They design an experiment in which participants are randomly assigned to one of the two treatments. The patients are blind to which treatment they receive.
A parent-teacher association at an affluent school in Chicago, Illinois wanted to study how pervasive the drug culture was among high school students in the United States. To answer this question, they handed out a survey to their high school students at a school assembly.
A recent investigator conducted a survey to study how long New Year’s resolutions last. The individuals who were still on track with their goals were more likely to respond.