Objectives

In today’s lab we will:

  1. Use RStudio to create graphics on categorical data
  2. Learn how to make stacked bar plots
  3. Briefly introduce ggplot using supplemental lab document
  4. Discuss the relationship between hypothesis testing and confidence intervals

Plotting Categorical Data

Bar Charts/Plots – ‘barplot()’

For a bar chart, each column represents a group defined by a categorical variable. If you have data that is categorical (like we see in the ‘titanic’ dataset), you will want to use a bar chart to display information. A simple way to create barplots is to use the barplot function in R.

counts <- c(8, 15, 11, 16)
barplot(height = counts, 
        names.arg = c("I", "II", "III", "IV"), 
        col = "blue", 
        main = "Distribution of Cancer Stage", 
        xlab = "Stage",
        ylab = "Frequency")

Determining Parameters for Plotting Functions

You will notice that there are a lot of parameters here (height, names.arg, etc.). The best way to determine how each parameter is used is to use the “?” command in the console.

?barplot

Notice that in the documentation, a lot of the parameters have default values. Therefore, for the barplot function, “height” is the only required parameter; using all other parameters is optional.

There is no real limit on how many parameters you can put in one of these functions, but it is easiest to read if you separate them onto different lines.

Common Graphical Parameters

  • xlab, ylab – the label on the x-axis and y-axis, respectively
  • xlim, ylim – a vector representing the x and y limits, respectively
  • pch – an integer representing the type of plot points. You can also create your own plot points with quotes. e.g. “|”
  • lty – an integer representing the type of line
  • main – sets a title for the plot
  • col – sets the color for points, lines, or graphics for your plot
  • names.arg – the names of the bars

See http://www.statmethods.net/advgraphs/parameters.html for point, color, and line options and there corresponding codings

Note: Not all plots have the same arguments. The help function is a great way to see what arguments are part of a function

Adding Lines and Text to Plots

Once you run your plotting command and the figure pops up, you can still add either lines or text to the plot using these functions:

# a: intercept
# b: slope
# h: the y-value(s) for horizontal line(s)
# v: the x-value(s) for vertical line(s)
abline(a, b, h, v) 

# x: x coordinate for the text
# y: y coordinate for the text
# "Text": The text to be written
text(x, y, "Text")

These functions are called \(\bf{after}\) the plot function is called.

Note that to clear the previous lines/text, run the plot function again. All previous lines and text will remain on the graph until a new plot is made.

Creating Legends

The legend function in R can be used to add a legend to a plot. There are a variety of arguments that this legend function can take. Some of the most commonly used are described below.

  • you can set x equal to a specified keyword (“bottomright”, “bottom”, “bottomleft”, “left”, “topleft”, “top”, “topright”, “right”) to indicate placement of the legend or you can use the x and y arguments to specify x and y coordinates for legend position.
  • the legend argument takes a character expression to be included in the legend
  • lwd takes an integer to indicate the line width
  • the col argument dictates the color of the points or lines that appear in the legend
counts <- c(8, 15, 11, 16)
barplot(height = counts, 
        names.arg = c("I", "II", "III", "IV"), 
        col = "pink", 
        main = "Distribution of Cancer Stage", 
        xlab = "Stage",
        ylab = "Frequency")
abline(h = c(8, 15), lty = 2, lwd = 4, col = c("blue", "red"))

stages <- c("Stage 1 Freq", "Stage 2 Freq")
legend(x = "topleft", 
       legend = stages,
       lty = 2,
       lwd = 4,
       col = c("blue", "red"))

(The usage of a legend and lines on this specific plot isn’t particularly useful to us, but we do it anyway to show how you can add these elements to a plot).

Saving Plots

Usually you can right-click on a plot in RStudio or R to copy them and then easily paste them into a Word/Google document. You can also click “Export” in the plots window and choose “Save as Image” to save it to your folder or “Copy to Clipboard” to paste into your own Word document. If this is giving you trouble, you can always save a plot to a pdf using the form below (after you have set a working directory):

# setwd("H:/HawkID/BIOS4120Labs") # set working directory to the folder where you want the image saved

pdf("Name_of_Plot.pdf", height = 3.5, width = 5.25)

## INSERT PLOT CODE HERE

dev.off()

This will save a pdf of the plot you set up in the folder you are working.

Stacked Barplots

This comes mostly as an “FYI” for your upcoming homework 3. You will be expected to make stacked barplots for a certain dataset. For more experienced R programmers, ggplot is the easiest way to do this. We will refer to the supplemental lab document to briefly introduce and show example plots using ggplot. However, this section will introduce how to plot in this way with the titanic dataset in base R.

titanic <- read.delim('https://raw.githubusercontent.com/IowaBiostat/data-sets/main/titanic/titanic.txt')

There is 1 major difference with the code. The height now takes a table rather than a vector. The first argument is separated by the stacks and the second argument separates by the columns.

barplot(height = table(titanic$Survived, titanic$Class),
        col = c("blue", "red"),
        main = "Survival Between Classes")
legend(x = "topleft",
       legend = c("Died", "Survived"),
       title = "Survival Status",
       fill = c("blue", "red"))

Now, lets plot the graph separated by sex, so 2 graphs total. If we want to view more than 1 graph on 1 panel, you can use the following code to make a 2x1 panel

par(mfrow=c(1,2)) # To turn it back to normal, run par(mfrow = c(1,1))

Now that we separated the panel into a 1x2 plane, we can see the following graphs side by side:

par(mfrow=c(1,2))

# Bar plot for the females
barplot(height = table(titanic$Survived, titanic$Class, titanic$Sex)[,,1],
        col = c("blue", "red"), 
        main = "Female Survival vs Classes")
legend(x = "topleft", 
       legend = c("Died", "Survived"), 
       title = "Survival Status", 
       fill = c("blue", "red"),
       cex = 0.8) # shrinks size of legend
# Bar plot for the males
barplot(height = table(titanic$Survived, titanic$Class, titanic$Sex)[,,2],
        col = c("blue", "red"), 
        main = "Male Survival vs Classes")
legend(x = "topleft", 
       legend = c("Died", "Survived"), 
       title = "Survival Status", 
       fill = c("blue", "red"),
       cex = 0.8)

Hypothesis Testing and Confidence Intervals

There is close relationship between confidence intervals and hypothesis testing. All values within a constructed 95% interval are considered “plausible” values for the parameter that we are estimating. Values outside the interval are rejected as unlikely and improbable.

Interpreting Confidence Intervals

If you were to repeat the process of creating a confidence interval an infinite number of times, 95% of the interval estimates for \(\mu\) will contain the true parameter value, \(\mu\). We treat the population mean \(\mu\) as being fixed. Any particular interval may or may not contain the true population mean \(\mu\).

  • We say that we are “95% confident” that the interval contains the true population \(\mu\) because the procedure used to construct this interval produces a correct interval estimate 95% of the time.

  • We DO NOT say there is a 95% probability that \(\mu\) lies between these two values. (\(\mu\) is fixed)

Visualization

Hypothesis Testing Analogy - “Null Until Proven Otherwise”

In class, you learned that there are a lot of wrong ways to think about the hypothesis testing process. The courtroom is a helpful example that illustrates the correct usage of p-values and hypothesis tests. Let’s look at it in terms of “innocent until proven guilty”: As the person analyzing data, you are the judge. The hypothesis test is the trial, and the null hypothesis is the defendant.

If the evidence presented doesn’t prove the defendant is guilty beyond a reasonable doubt, you still have not proved that the defendant is innocent. (We never say that we accept the null hypothesis)

So how would that verdict be announced? It enters the court record as “Not guilty.” That phrase is perfect: “Not guilty” doesn’t mean the defendant is innocent, because that has not been proven. It just means the prosecution couldn’t prove its case to the necessary, “beyond a reasonable doubt” standard. It failed to convince the judge to abandon the assumption of innocence.

If you follow that rationale, then you can see that “failure to reject the null” is just the statistical equivalent of “not guilty.” In a trial, the burden of proof falls to the prosecution. When analyzing data, the entire burden of proof falls to the sample data you’ve collected. This is why our sampling procedure is so important. Just as “not guilty” is not the same thing as “innocent,” neither is “failing to reject” the same as “accepting” the null hypothesis.

This method of thinking about hypothesis tests will come in handy when we start formally testing our own hypotheses.

Source: http://blog.minitab.com/blog/understanding-statistics/things-statisticians-say-failure-to-reject-the-null-hypothesis

Relationship between Confidence Intervals & Hypothesis Testing

If the value of the parameter specified by the null hypothesis (for instance Ho = 0) is contained within the 95% interval, then the null hypothesis cannot be rejected at the 0.05 level. If the value specified by the null hypothesis is not in the interval, then the null hypothesis can be rejected at the 0.05 level. Likewise, for a 99% confidence interval, if the value specified by the null hypothesis is in the interval, then the null hypothesis cannot be rejected at the 0.01 level.

*Disclaimer: though these concepts have a strong relationship, one method is not a substitute for the other. On future homework and quizzes, you will need to know how to do both methods.

Practice Problems and Solutions

In lab last week we worked with the titanic dataset. Today we are wanting to know whether sex played a significant role in the survival rates of the passengers on-board. Therefore, we want to compare survival rates between males and females.

  1. Define the null hypothesis for this study on the ‘titanic’ dataset?

  2. Say for example that we have the following null hypothesis \(H_o:\mu_{female}\) = 0.5. We obtain a 95% confidence interval (0.415, 0.481). Remember that interpretation of this confidence interval states that we are 95% confident that the true population \(\mu\) lies within this interval. Would we reject or retain the null hypothesis?

True or False?

  1. Suppose we have a test with an alpha level of 0.05. If we find a p-value of 0.03, we can reject the null hypothesis.

  2. From our results in question 3, the 95% confidence interval would contain the specified null hypothesis.

  3. Suppose we have a test with an alpha level of 0.01. If we find a p-value of 0.03, we can reject the null hypothesis.

  4. From our results in question 5, the 99% confidence interval would contain the specified null hypothesis.

Solutions

Problem 1

Ho: average survival rate for females = average survival rate for males

Problem 2 Reject the null hypothesis
Problem 3 True
Problem 4 False
Problem 5 False
Problem 6 True

Bar Plot Exercise

Note: If you would like more practice creating and annotating simple bar plots, you can work through this exercise on your own. Assignment 3 will ask you to make stacked bar plots, so it may be more useful to familiarize yourself with and practice making those instead.

We are interested in the amount of people who eat at Iowa City restaurants downtown. On the night of this study, there were 16 people at Shorts, 21 at Donnelly’s Pub, 10 at Blue Moose, and 13 at Joe’s Place.

Step One: Create a vector named “people” of the counts.

Hint: Use the ‘c()’ function

Click to reveal code
people <- c(16, 21, 10, 13)

Step Two: Create a basic bar plot of “people” using the ‘barplot()’ function.

Click to reveal code
par(mfrow = c(1,1)) # turns plot region back to "normal"
barplot(people)

What improvements could be made?

  • Main title
  • Make the y axis longer than the tallest bar
  • Label the bars
  • Label the x axis
  • Color

Step Three: Give the bar plot a main title.

Click to reveal code
barplot(people, 
        main = "Iowa City Night Life")

Step Four: Adjust the y axis to be high enough.

Click to reveal code
barplot(people, 
        main = "Iowa City Night Life",
        ylim = c(0,25))

Step Five: Give the axes appropriate names and labels.

To do this first create a vector named “restaurants” that contains the names of the restaurants.

## Create a vector named "bars" that contains the names of the bars.

restaurants <- c("Shorts", "Donnelly's Pub", "Blue Moose", "Joe's Place")

Now add the bar categories and axis labels

Click to reveal code
## Add the bar categories to the bar plot

barplot(people, 
        main = "Iowa City Night Life",
        ylim = c(0,25),
        names.arg = restaurants)

## Label the x and y-axis

barplot(people, 
        main = "Iowa City Night Life",
        ylim = c(0,25),
        names.arg = restaurants,
        ylab = "Number of People",
        xlab = "Restaurants in Iowa City")

Step Six: Add a green, dashed line labeled “Fun Threshold”

Let’s say that the threshold of having fun is 10 people.

Click to reveal code
# Make the line green, dotted line
barplot(people, 
        main = "Iowa City Night Life",
        ylim = c(0,25),
        names.arg = restaurants,
        ylab = "Number of People",
        col = "orange")

abline(h = 10, col = "green", lty = 2, lwd = 2)

## Label the line

text(x = 3, y = 11, labels = "Fun Threshold")

Step Seven: Create a legend describing the line instead of the text function.

Click to reveal code
barplot(people, 
        main = "Iowa City Night Life",
        ylim = c(0,25),
        names.arg = restaurants,
        ylab = "Number of People",
        col = "orange")

abline(h = 10, col = "green", lty = 2, lwd = 2)

legend(x = "topright",
       legend = "Fun Threshold",
       lty = 2,
       lwd = 2,
       col = "green")

Step Eight: Save the plot.

Hint: Use the “pdf” and “dev.off” functions.

Click to reveal code
pdf("Bar-Plot-01.pdf", height = 3.5, width = 7.25)

barplot(people, 
        main = "Iowa City Night Life",
        ylim = c(0,25),
        names.arg = restaurants,
        ylab = "Number of People",
        col = "orange")

abline(h = 10, col = "green", lty = 2)

legend(x = "topright",
       legend = "Fun Threshold",
       lty = 2,
       col = "green")

dev.off()
## png 
##   2