In today’s lab we will:
First, let’s read in the ‘titanic’ dataset and compute some summary statistics.
titanic <- read.delim("https://s3.amazonaws.com/pbreheny-data-sets/titanic.txt")
summary(titanic)
## Class Sex Age Survived
## Length:2201 Length:2201 Length:2201 Length:2201
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
By default, when the summary() function encounters categorical data, it produces a table for that column, as evidenced above, when it created 4 separate tables. A better way to look at the variables in the dataset would be to use the table() function. Recall from lab 1 that we can extract vectors from the dataset using the ‘$’ symbol:
table(titanic$Class)
##
## 1st 2nd 3rd Crew
## 325 285 706 885
But the table function is more versatile than that. For example, we can create tables using 2 variables:
table(titanic$Class,titanic$Survived)
##
## Died Survived
## 1st 122 203
## 2nd 167 118
## 3rd 528 178
## Crew 673 212
If we give the function more than two variables, it creates multiple tables (one for each level):
table(titanic$Class, titanic$Survived, titanic$Sex)
## , , = Female
##
##
## Died Survived
## 1st 4 141
## 2nd 13 93
## 3rd 106 90
## Crew 3 20
##
## , , = Male
##
##
## Died Survived
## 1st 118 62
## 2nd 154 25
## 3rd 422 88
## Crew 670 192
It is recommended to keep the number of variables down to 2 or 3, as more than that begins to get a bit cluttered and confusing.
If we save a table, we can use square brackets to access individual numbers [row,column]. When the row or column entry is left blank using this type of indexing, R will take all the rows or columns (whichever one was left blank):
class_surv <- table(titanic$Class,titanic$Survived)
class_surv
##
## Died Survived
## 1st 122 203
## 2nd 167 118
## 3rd 528 178
## Crew 673 212
class_surv[3,2]
## [1] 178
# The 3 indicates the third row and the 2 indicates the second column,
# so this is the number of 3rd class passengers who survived.
class_surv[4, 1]
## [1] 673
class_surv[,2]
## 1st 2nd 3rd Crew
## 203 118 178 212
# Note that the column position is left blank. This extracts all rows and 2nd column.
# In the context of the data, this shows us the number of people that survived in each class.
class_surv[3,]
## Died Survived
## 528 178
# Note that the row position is left blank. This extracts the 3rd row (3rd class) and all columns.
# In the context of the data, this shows us the number of people who died and survived from the 3rd class.
Additionally, we can use the “==” to access data with certain specifications. For instance, if we wanted to access only the data in which people survived, we could use the following code:
justSurv <- titanic[titanic$Survived == "Survived",]
# Note that the name of the column you want should be in quotations.
# ie, == "Survived",] is different from == Survived,]. The latter will produce an error code.
# Or if we only want 2nd class travelers.
justSecond <- titanic[titanic$Class == "2nd",]
We can also use prop.table() to get the proportions for each cell of a table:
prop.table(class_surv, 1) # Gives proportions for each class (by row)
##
## Died Survived
## 1st 0.3753846 0.6246154
## 2nd 0.5859649 0.4140351
## 3rd 0.7478754 0.2521246
## Crew 0.7604520 0.2395480
prop.table(class_surv, 2) # Gives proportions for the survival groups (by column)
##
## Died Survived
## 1st 0.08187919 0.28551336
## 2nd 0.11208054 0.16596343
## 3rd 0.35436242 0.25035162
## Crew 0.45167785 0.29817159
Let’s investigate Titanic survival rates based on class.
(For the sake of practice, we will do this manually and then with an R function.)
Recall the steps for computing a weighted average:
Identify the outcome, the group you are interested in, and the confounder that you are adjusting for
Calculate \(w_1, w_2, ..., w_n\), the overall proportion of observations that belong to each level of the confounder
Calculate \(\bar{x}_1,\bar{x}_2, ..., \bar{x}_n\), the average (or percentage) for that group at each level of the confounder
Calculate the weight average \(\bar{x} = \Sigma_iw_i\bar{x}_i\)
, , Sex = Female
Survived
Class Survived Total
1st 141 145
2nd 93 106
3rd 90 196
Crew 20 23
, , Sex = Male
Survived
Class Survived Total
1st 62 180
2nd 25 179
3rd 88 510
Crew 192 862
From “class_surv” above, calculate the overall percentages of survival for each class.
Now, create a table listing the percentage of passengers in each class who survived, broken down by sex.
Create a table listing the total proportion of male and female passengers on the ship.
Finally, construct a weighted average of the percentage of passengers in each class who survived, controlling for the effect of sex (i.e., report one number for each class).
Do any of these results surprise you? What changed in Part a when compared to Part d? What conclusions can we draw from this?
## Part a
## 1st 2nd 3rd Crew
## 0.6246154 0.4140351 0.2521246 0.2395480
## Part b
##
## Female Male
## 1st 0.9724138 0.3444444
## 2nd 0.8773585 0.1396648
## 3rd 0.4591837 0.1725490
## Crew 0.8695652 0.2227378
## Part c
##
## Female Male
## 0.2135393 0.7864607
## Part d
## 1st : 0.479
## 2nd : 0.297
## 3rd : 0.234
## Crew: 0.361
We can also use R to solve these problems. There is no simple function that allows you to calculate the weighted average. Below are a few of ways to do this. Note that the first method could introduce mistakes since you are inputting values individually and also could be a lengthy process depending on the structure of the dataset. Typically, we would prefer to use the second method which is more efficient and has less opportunity for error.
overallSex <- prop.table(table(titanic$Sex))
#First Method
firstclass <- c(141, 62) / c(145, 180) # From table provided
(first_method1 <- weighted.mean(firstclass, overallSex))
## [1] 0.4785406
#Second Method
classtable <- table(titanic$Sex,titanic$Class,titanic$Survived)
classes <- prop.table(classtable, 1:2)[,,2]
(first_method2 <- weighted.mean(classes[,1], overallSex))
## [1] 0.4785406
Now let’s say that we want to investigate the difference in survival by sex for the Titanic data set. Construct a weighted average of the percentage of passengers for each sex who survived, controlling for the effect of class. For the sake of learning the process do not use the weighted.mean() function except to check your answer
## , , Class = 1st
##
## Survived
## Sex Survived Total
## Female 141 145
## Male 62 180
##
## , , Class = 2nd
##
## Survived
## Sex Survived Total
## Female 93 106
## Male 25 179
##
## , , Class = 3rd
##
## Survived
## Sex Survived Total
## Female 90 196
## Male 88 510
##
## , , Class = Crew
##
## Survived
## Sex Survived Total
## Female 20 23
## Male 192 862
First we find the proportion of people who survived by sex:
## Female Male
## 0.7319149 0.2120162
Then we find the percentage of passengers of each sex who survived broken down by class:
##
## 1st 2nd 3rd Crew
## Female 0.9724138 0.8773585 0.4591837 0.8695652
## Male 0.3444444 0.1396648 0.1725490 0.2227378
Next we create a table listing the total proportion of passengers on the ship in each class:
##
## 1st 2nd 3rd Crew
## 0.1476602 0.1294866 0.3207633 0.4020900
Finally we construct a weighted average of the percentage of passengers of each sex who survived, controlling for the effect of class:
## F : 0.754
## M : 0.214
Then we can check our answer using the weighted.mean() function:
overallclass <- prop.table(table(titanic$Class))
#Second Method
practice_table <- table(titanic$Class,titanic$Sex,titanic$Survived)
sexes <- prop.table(practice_table, 1:2)[,,2]
(weighted.mean(sexes[,1], overallclass))
## [1] 0.7541256
(weighted.mean(sexes[,2], overallclass))
## [1] 0.2138535