In today’s lab we will:
For a bar chart, each column represents a group defined by a categorical variable. If you have data that is categorical (like we see in the ‘titanic’ dataset), you will want to use a bar chart to display information. A simple way to create barplots is to use the barplot function in R.
counts <- c(8, 15, 11, 16)
barplot(height = counts,
names.arg = c("I", "II", "III", "IV"),
col = "blue",
main = "Distribution of Cancer Stage",
xlab = "Stage",
ylab = "Frequency")
You will notice that there are a lot of parameters here (height, names.arg, etc.). The best way to determine how each parameter is used is to use the “?” command in the console.
?barplot
Notice that in the documentation, a lot of the parameters have default values. Therefore, for the barplot function, “height” is the only required parameter; using all other parameters is optional.
There is no real limit on how many parameters you can put in one of these functions, but it is easiest to read if you separate them onto different lines.
See http://www.statmethods.net/advgraphs/parameters.html for point, color, and line options and there corresponding codings
Note: Not all plots have the same arguments. The help function is a great way to see what arguments are part of a function
Once you run your plotting command and the figure pops up, you can still add either lines or text to the plot using these functions:
# a: intercept
# b: slope
# h: the y-value(s) for horizontal line(s)
# v: the x-value(s) for vertical line(s)
abline(a, b, h, v)
# x: x coordinate for the text
# y: y coordinate for the text
# "Text": The text to be written
text(x, y, "Text")
These functions are called \(\bf{after}\) the plot function is called.
Note that to clear the previous lines/text, run the plot function again. All previous lines and text will remain on the graph until a new plot is made.
The legend function in R can be used to add a legend to a plot. There are a variety of arguments that this legend function can take. Some of the most commonly used are described below.
counts <- c(8, 15, 11, 16)
barplot(height = counts,
names.arg = c("I", "II", "III", "IV"),
col = "pink",
main = "Distribution of Cancer Stage",
xlab = "Stage",
ylab = "Frequency")
abline(h = c(8, 15), lty = 2, lwd = 4, col = c("blue", "red"))
stages <- c("Stage 1 Freq", "Stage 2 Freq")
legend(x = "topleft",
legend = stages,
lty = 2,
lwd = 4,
col = c("blue", "red"))
(The usage of a legend and lines on this specific plot isn’t particularly useful to us, but we do it anyway to show how you can add these elements to a plot).
Usually you can right-click on a plot in RStudio or R to copy them and then easily paste them into a Word/Google document. You can also click “Export” in the plots window and choose “Save as Image” to save it to your folder or “Copy to Clipboard” to paste into your own Word document. If this is giving you trouble, you can always save a plot to a pdf using the form below (after you have set a working directory):
# setwd("H:/HawkID/BIOS4120Labs") # set working directory to the folder where you want the image saved
pdf("Name_of_Plot.pdf", height = 3.5, width = 5.25)
## INSERT PLOT CODE HERE
dev.off()
This will save a pdf of the plot you set up in the folder you are working.
This comes mostly as an “FYI” for your upcoming homework 3. You will be expected to make stacked barplots for a certain dataset. For more experienced R programmers, ggplot is the easiest way to do this. We will refer to the supplemental lab document to briefly introduce and show example plots using ggplot. However, this section will introduce how to plot in this way with the titanic dataset in base R.
titanic <- read.delim('https://raw.githubusercontent.com/IowaBiostat/data-sets/main/titanic/titanic.txt')
There is 1 major difference with the code. The height now takes a table rather than a vector. The first argument is separated by the stacks and the second argument separates by the columns.
barplot(height = table(titanic$Survived, titanic$Class),
col = c("blue", "red"),
main = "Survival Between Classes")
legend(x = "topleft",
legend = c("Died", "Survived"),
title = "Survival Status",
fill = c("blue", "red"))
Now, lets plot the graph separated by sex, so 2 graphs total. If we want to view more than 1 graph on 1 panel, you can use the following code to make a 2x1 panel
par(mfrow=c(1,2)) # To turn it back to normal, run par(mfrow = c(1,1))
Now that we separated the panel into a 1x2 plane, we can see the following graphs side by side:
par(mfrow=c(1,2))
# Bar plot for the females
barplot(height = table(titanic$Survived, titanic$Class, titanic$Sex)[,,1],
col = c("blue", "red"),
main = "Female Survival vs Classes")
legend(x = "topleft",
legend = c("Died", "Survived"),
title = "Survival Status",
fill = c("blue", "red"),
cex = 0.8) # shrinks size of legend
# Bar plot for the males
barplot(height = table(titanic$Survived, titanic$Class, titanic$Sex)[,,2],
col = c("blue", "red"),
main = "Male Survival vs Classes")
legend(x = "topleft",
legend = c("Died", "Survived"),
title = "Survival Status",
fill = c("blue", "red"),
cex = 0.8)
There is close relationship between confidence intervals and hypothesis testing. All values within a constructed 95% interval are considered “plausible” values for the parameter that we are estimating. Values outside the interval are rejected as unlikely and improbable.
If you were to repeat the process of creating a confidence interval an infinite number of times, 95% of the interval estimates for \(\mu\) will contain the true parameter value, \(\mu\). We treat the population mean \(\mu\) as being fixed. Any particular interval may or may not contain the true population mean \(\mu\).
We say that we are “95% confident” that the interval contains the true population \(\mu\) because the procedure used to construct this interval produces a correct interval estimate 95% of the time.
We DO NOT say there is a 95% probability that \(\mu\) lies between these two values. (\(\mu\) is fixed)
In class, you learned that there are a lot of wrong ways to think about the hypothesis testing process. The courtroom is a helpful example that illustrates the correct usage of p-values and hypothesis tests. Let’s look at it in terms of “innocent until proven guilty”: As the person analyzing data, you are the judge. The hypothesis test is the trial, and the null hypothesis is the defendant.
If the evidence presented doesn’t prove the defendant is guilty beyond a reasonable doubt, you still have not proved that the defendant is innocent. (We never say that we accept the null hypothesis)
So how would that verdict be announced? It enters the court record as “Not guilty.” That phrase is perfect: “Not guilty” doesn’t mean the defendant is innocent, because that has not been proven. It just means the prosecution couldn’t prove its case to the necessary, “beyond a reasonable doubt” standard. It failed to convince the judge to abandon the assumption of innocence.
If you follow that rationale, then you can see that “failure to reject the null” is just the statistical equivalent of “not guilty.” In a trial, the burden of proof falls to the prosecution. When analyzing data, the entire burden of proof falls to the sample data you’ve collected. This is why our sampling procedure is so important. Just as “not guilty” is not the same thing as “innocent,” neither is “failing to reject” the same as “accepting” the null hypothesis.
This method of thinking about hypothesis tests will come in handy when we start formally testing our own hypotheses.
If the value of the parameter specified by the null hypothesis (for instance Ho = 0) is contained within the 95% interval, then the null hypothesis cannot be rejected at the 0.05 level. If the value specified by the null hypothesis is not in the interval, then the null hypothesis can be rejected at the 0.05 level. Likewise, for a 99% confidence interval, if the value specified by the null hypothesis is in the interval, then the null hypothesis cannot be rejected at the 0.01 level.
*Disclaimer: though these concepts have a strong relationship, one method is not a substitute for the other. On future homework and quizzes, you will need to know how to do both methods.
In lab last week we worked with the titanic dataset. Today we are wanting to know whether sex played a significant role in the survival rates of the passengers on-board. Therefore, we want to compare survival rates between males and females.
Define the null hypothesis for this study on the ‘titanic’ dataset?
Say for example that we have the following null hypothesis \(H_o:\mu_{female}\) = 0.5. We obtain a 95% confidence interval (0.415, 0.481). Remember that interpretation of this confidence interval states that we are 95% confident that the true population \(\mu\) lies within this interval. Would we reject or retain the null hypothesis?
Suppose we have a test with an alpha level of 0.05. If we find a p-value of 0.03, we can reject the null hypothesis.
From our results in question 3, the 95% confidence interval would contain the specified null hypothesis.
Suppose we have a test with an alpha level of 0.01. If we find a p-value of 0.03, we can reject the null hypothesis.
From our results in question 5, the 99% confidence interval would contain the specified null hypothesis.
Ho: average survival rate for females = average survival rate for males
We are interested in the amount of people who eat at Iowa City restaurants downtown. On the night of this study, there were 16 people at Shorts, 21 at Donnelly’s Pub, 10 at Blue Moose, and 13 at Joe’s Place.
Hint: Use the ‘c()’ function
people <- c(16, 21, 10, 13)
par(mfrow = c(1,1)) # turns plot region back to "normal"
barplot(people)
barplot(people,
main = "Iowa City Night Life")
barplot(people,
main = "Iowa City Night Life",
ylim = c(0,25))
To do this first create a vector named “restaurants” that contains the names of the restaurants.
## Create a vector named "bars" that contains the names of the bars.
restaurants <- c("Shorts", "Donnelly's Pub", "Blue Moose", "Joe's Place")
Now add the bar categories and axis labels
## Add the bar categories to the bar plot
barplot(people,
main = "Iowa City Night Life",
ylim = c(0,25),
names.arg = restaurants)
## Label the x and y-axis
barplot(people,
main = "Iowa City Night Life",
ylim = c(0,25),
names.arg = restaurants,
ylab = "Number of People",
xlab = "Restaurants in Iowa City")
Let’s say that the threshold of having fun is 10 people.
# Make the line green, dotted line
barplot(people,
main = "Iowa City Night Life",
ylim = c(0,25),
names.arg = restaurants,
ylab = "Number of People",
col = "orange")
abline(h = 10, col = "green", lty = 2, lwd = 2)
## Label the line
text(x = 3, y = 11, labels = "Fun Threshold")
barplot(people,
main = "Iowa City Night Life",
ylim = c(0,25),
names.arg = restaurants,
ylab = "Number of People",
col = "orange")
abline(h = 10, col = "green", lty = 2, lwd = 2)
legend(x = "topright",
legend = "Fun Threshold",
lty = 2,
lwd = 2,
col = "green")
Hint: Use the “pdf” and “dev.off” functions.
pdf("Bar-Plot-01.pdf", height = 3.5, width = 7.25)
barplot(people,
main = "Iowa City Night Life",
ylim = c(0,25),
names.arg = restaurants,
ylab = "Number of People",
col = "orange")
abline(h = 10, col = "green", lty = 2)
legend(x = "topright",
legend = "Fun Threshold",
lty = 2,
col = "green")
dev.off()
## png
## 2