In today’s lab we will begin by reviewing the summary statistics that were spoke on in lecture. From here we will show you how to create plots such as histograms, scatter plots, and box plots that will be essential for your next homework. Then at the end of the lab there is a review section for your quiz on Thursday February 9th.

Summary Statistics

If you go to the class website you will find a data set named tips which we will be using today. In this data there are variables on 244 tips that were received over a period of a few months. The data can be uploaded to R using the following code.

tips <- read.delim("http://myweb.uiowa.edu/pbreheny/data/tips.txt")

The summary statistics that were brought up in class such as mean, median, standard deviation, quantile, minimum, and maximum are all functions that R can compute for us. For reference on how they are called please refer back to lab 2.

In many data sets you will want to know information about a certain subset of the data set. For example using the tips data set, if we wanted the mean of the total bills, but only when the variable time is night we can use the mean function on the variable TotBill[Time == “Night”] as shown below:

mean(tips$TotBill[tips$Time == "Night"])
## [1] 20.79716

The sign “==” in the problem sets a True/False to whether the time is “Night”. If True then that TotBill influences the mean, where as if it is false it is left out.

With this information can you find the mean total bill of the night and compare it to the total bill during the day? What differences do you see and why might this be?

Histograms

With data such as we have it might be helpful to somehow plot it. For this Tips data we will use a histogram function which can be called by using hist() as shown below:

hist(tips$TotBill)

From this histogram we can see that many of the bills are around $15, but some are higher than $50. After looking at this we can now think about splitting up the histogram by time of day. In this code below we do just that and give the histogram color by using “col” and we make the histograms split at the same x values by using the break function.

hist(tips$TotBill[tips$Time=="Day"], col = "red", breaks = seq(1, 60, 5))

hist(tips$TotBill[tips$Time=="Night"], col = "blue", breaks = seq(1, 60, 5))

From this what can we see changes about the total bills from the day to the night hour?

Bar Charts

If you have data in which the data is catagorical we will want to use a bar chart which is made using the code below. In order to do this you must first do the step require(lattice) to load the package that is needed. As you will see there are numerous different bar charts shown below and we will speak on these during the lab session.

titanic <- read.delim("http://myweb.uiowa.edu/pbreheny/data/titanic.txt")
table1 <- table(titanic$Class)
require(lattice)
## Loading required package: lattice
barchart(table1)

barchart(table(titanic$Survived))

barchart(table(titanic$Class, titanic$Survived))

barchart(table(titanic$Class, titanic$Survived), horizontal = FALSE)

barchart(table(titanic$Class,titanic$Sex,titanic$Age,titanic$Survived),auto.key=TRUE)

barchart(table(titanic$Class,titanic$Sex,titanic$Age,titanic$Survived),auto.key=TRUE,scales="free")

Box Plots

We can also plot this data using a box plot the code for this is, the “ylab” gives us the option to write a y axis label:

boxplot(tips$TotBill ~ tips$Time, ylab = "Total Bill")

Looking at box plots we can see where the mean lies in relation to the median, we can also tell if the data is skewed, as well as see if there are any outliers in the data. Speak on these with the box plots shown.

Scatter Plots

When we are curious about the connection between two continuous variables we will want to use a scatter plot which is done using the plot function as shown below:

plot(tips$TotBill, tips$Tip)

Looking at this plot we can see a positive association between the variables and that there is a good amount of variation. Also if you look closely at the plot you can see horizonatal lines beginning to form, why do you think this is?

If you want practice creating and reading plots

  1. Make a scatter plot that compares the total bill to the tip on only those who smoke. Now on those who don’t smoke.

  2. Make a histogram of those who smoke versus those who do not smoke.

Quiz Review

Selection bias

Instead of random sampling, certain subgroups of the population were more likely to be included than others.

Nonresponse bias

Nonresponders can differ from responders in many important ways

Perception bias

The perception of benefit from a treatment (placebo effect)

Confirmation bias (touched on, but not named in notes)

The tendency to interpret new evidence as confirmation of one’s existing beliefs or theories
(If doctors think that the polio vaccine causes polio, a patient with a borderline instance of disease is more likely to be diagnosed with polio if the doctor knows that the vaccine was administered.)

Confounding

The two things being studied are both highly correlated to a third thing.
(Think ice cream sales and murder rates both being related to weather.)

In each of the following examples, determine which bias(es) may be present. If possible, determine which direction the bias may skew the results. Then, state the null and alternative hypotheses.

A doctor wanted to investigate whether Tylenol is better than Ibuprofen in curing head-aches, so he designed an experiment in which he randomly selected which treatment he would give people and blinded them to which one they got. He then noted how much their condition improved in either case.

A statistician who was also a Subway enthusiast was heartbroken to nd out that his footlong sandwich was only 11 inches long. He sets out to determine what the true mean sandwich length is by measuring his Subway lunch every day for a month. He hopes to gather enough evidence to prove false advertising.

A parent-teacher association for schools in Austin, Minnesota were wondering how pervasive drug culture was among their high school students, compared to the national average. In order to gain a handle on the situation, they handed out a survey to the students at a school assembly during homecoming week.

In a randomized controlled double blind study, 12 people in the treatment group died before receiving the treatment, so the researchers decided to omit them from the data analysis.

Know your errors:

Type 1 Error: This is when the null hypothesis is true but we reject it.

Type 2 error: This is when the null hypothesis is false but we fail to reject it.

Question: If I did 1000 experiments with an alpha equal to .10 how many type one errors can I expect to have?