In today’s lab we will:
Today we will be using the tailgating dataset. A description of this study can be found here:
https://iowabiostat.github.io/data-sets/tailgating/tailgating.html
The data can be uploaded to R using the following code.
tailgating <- read.delim('https://raw.githubusercontent.com/IowaBiostat/data-sets/main/tailgating/tailgating.txt')
We have already learned how to compute some summary statistics in R, but today we will learn how to visualize the distribution of continuous data. First, let’s take a look at the summary of distance.
summary(tailgating$Distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.82 27.57 32.49 41.01 39.52 356.96
Summary tells us most of the information we would like to know. How about the standard deviation? Use the function ‘sd’
sd(tailgating$Distance)
## [1] 44.16035
Now let’s look at distance by drug group status. The ‘by’ function allows us to run a function over data set into groups.
by(tailgating$Distance, tailgating$Group, summary)
## tailgating$Group: ALC
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.89 28.83 35.42 36.83 40.21 68.34
## ------------------------------------------------------------
## tailgating$Group: MDMA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.01 22.32 26.83 27.61 28.46 56.61
## ------------------------------------------------------------
## tailgating$Group: NODRUG
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.70 28.80 33.37 47.33 43.57 356.96
## ------------------------------------------------------------
## tailgating$Group: THC
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.82 27.75 31.90 42.61 39.52 346.72
As discussed in class, we can visualize the distribution of distance using a histogram. Here is how to do this in R:
hist(tailgating$Distance)
We could add the mean and median to the plot using the function ‘abline’ to add lines. Let’s see how the two compare.
hist(tailgating$Distance, main = "Histogram of Tailgating Distance",
xlab = "Following Distance")
abline(v=mean(tailgating$Distance),
col="steelblue4", lwd=2) #lwd makes the line thicker (line width)
abline(v=median(tailgating$Distance), col="orange",
lwd=2, lty = 2)#lty makes the line dashed (line type)
legend(x = "topright",
legend = c("Median", "Mean"),
col = c("orange", "steelblue4"),
lwd = 2, lty = c(2,1))
# the mean is the solid line and the median is the dashed line
The bin size (increments seen on the x-axis) can impact how our data looks. If these bins are large, we might not be able to see our data in detail. Above, our bin size is pretty large because the range is vast. Let’s see how the data looks if we change the bin size using the argument ‘breaks’.
As a refresher, the seq() function will give you a list of numbers where the minimum is the first argument the maximum is the second argument and the increment (OR for this purpose the bin size) is specified by the third argument.
hist(tailgating$Distance, breaks = seq(0, 400, 2)) # bin size of 2
hist(tailgating$Distance, breaks = seq(0, 400, 100)) # bin size of 100
hist(tailgating$Distance, breaks = seq(0, 400, 10)) # bin size of 10
You can now see that most of the data follows a bell-shaped, normal distribution, but there are some outliers that cause the data to be right skewed. The group with a bin size of 10 allows us to see an appropriate amount of detail in the plot.
The great thing about R is there are many options to customize your figures. Below is code for the same figure but we have added arguments to customize the x and y label (xlab & ylab, respectively). The main title function (main) allows you to create a title or you can choose to omit the default by using the argument ““. There are also many color options in R. The col function allows you to color the bars.
#customized labels, solid color & white border
hist(tailgating$Distance, col= "pink", border="white", breaks = seq(1, 400, 10),
xlab="Distance",
ylab="Frequency",
main = "")
Now let’s visualize distance broken down by drug group. We can use the “par” function to view additional plots in one window.
par(mfrow=c(2,2)) # view all four histograms in a 2 by 2 window
hist(tailgating$Distance[tailgating$Group=="ALC"], col= "yellow", breaks = seq(1, 400, 10),
main = "", xlab = "ALC")
hist(tailgating$Distance[tailgating$Group=="MDMA"], col= "red", breaks = seq(1, 400, 10),
main = "", xlab = "MDMA")
hist(tailgating$Distance[tailgating$Group=="NODRUG"], col= "blue", breaks = seq(1, 400, 10),
main = "", xlab = "NoDrug")
hist(tailgating$Distance[tailgating$Group=="THC"], col= "green", breaks = seq(1, 400, 10),
main = "", xlab = "THC")
We can also plot this data using a box plot. First we can make a single plot across all groups.
boxplot(tailgating$Distance,
ylab = "Distance")
Then, we can make side by side boxplots for each group.
boxplot(tailgating$Distance ~ tailgating$Group,
col=rainbow(4),
xlab = "Group",
ylab = "Distance")
Although it is good to know there are outliers, the large range can make it difficult to see the boxes. We can remove the outliers for a better look by using the argument “outline=FALSE”.
boxplot(tailgating$Distance ~ tailgating$Group, col= rainbow(4), outline=FALSE)
Remember that the boxplot has the 0th, 25th, 50th, 75th, and 100th
quantiles. We can also find quantiles in the following section.
As an FYI, you can find specific quantiles of interest using the quantile function (2nd argument asks for what quantile you would like)
quantile(tailgating$Distance, 0.30)
## 30%
## 28.15008
State whether each of the following scenarios are an observation or an experimental study design
A Type I error is committed when a true null hypothesis is rejected. In other words, a type I error is the probability of rejecting the null hypothesis when the null is in fact true. In terms of disease detection (where the null hypothesis is no disease), this is a false positive.
The Type I error rate is the proportion of true hypotheses that were rejected.
A Type II error is committed when a false null hypothesis is not rejected. In other words, a type II error is the probability of failing to reject the null hypothesis when the null is in fact false.
The Type II error rate is the proportion of false null hypotheses that failed to be rejected.
The false discovery rate is the fraction of null hypothesis rejections that were incorrect.
Instead of random sampling, certain subgroups of the population were more likely to be included than others.
Nonresponders can differ from responders in many important ways
Anytime the study violates the principle of generalizing to the population that the sample was drawn from.
In each of the following examples, determine which bias(es), if any, may be present.
Doctors want to investigate whether Tylenol performs better than Ibuprofen in curing head-aches. They design a controlled experiment in which participants are randomly assigned to one of the two treatments. The patients are blind to which treatment they receive.
Answer
A parent-teacher association at an affluent school in Chicago, Illinois wanted to study how pervasive the drug culture was among high school students in the United States. To answer this question, they handed out a survey to their high school students at a school assembly.
Answer
True Null | False Null | Total | |
---|---|---|---|
Don’t Reject | |||
Reject | |||
Total |
True Null | False Null | Total | |
---|---|---|---|
Don’t Reject | 630 | 20 | 650 |
Reject | 70 | 80 | 150 |
Total | 700 | 100 | 800 |
Group | Perfect.Responses | Total.Responses | Perfect_Responses | Total_Responses |
---|---|---|---|---|
Sitting | 71 | 134 | 88 | 104 |
Standing | 94 | 142 | 57 | 90 |