Introduction

Overview

The goal of this lab is two-fold:

  • Learn how to work with large, complex data sets
  • Learn about genomic data

Genomic data tends to be large and complex, so these two skills go hand-in-hand, although there are certainly lots of other types of data that are large and complex besides genomic data. We will be working on data from GTEx, the Genotype-Tissue Expression project.

The main idea of the GTEx project is to understand how the expression levels of genes varies depending on tissue. The DNA sequence of your genes is the same in all of your tissues, but what those tissues do with that DNA varies greatly. Muscle cells, for examples, make a lot of actin and myosin, the proteins that make up muscle cells. In biological jargon, muscle tissue expresses the actin and myosin genes at a high level. Liver cells, on the other hand, make lots of enzymes and express actin and myosin at much lower levels.

Our primary tool for working with these objects (merging, subsetting, cross-referencing them, etc.) is going to be the R package data.table; we’ll also use magrittr. If data.table is not installed on your machine, you’ll have to install it with install.packages('data.table'); if you’re on a Windows machine, you’ll also need to install curl to run the code below: install.packages('curl'). Once it’s installed (you only need to install it once, but will have to load it every time you open R), you load it with:

library(data.table)
library(magrittr)

Getting the data

I prepared the following two files:

  • Expression data: Approximately 1.6 gigabytes of expression data from the GTEx project; the entire project contains hundreds of gigabytes of data, but 1.6 GB seemed about right for our purposes today. Note that the size of the .rds file, which is compressed, is only 211 Mb, but when you load it into R, it consumes 1.6 GB of RAM. You can certainly download the full set of expression data from GTEx, but it’s too large to fit into RAM on many machines. The file I am providing is a subset of that data, restricted to protein-coding genes.
  • Gene metadata: An additional file containing some metadata about the genes and what they do.

Before proceeding, you may wish to set up a folder for this project run everything from that working directory, but this is up to you.

download.file('https://d1ypx1ckp5bo16.cloudfront.net/big-data-lab/expression.rds', 'expression.rds')
download.file('https://d1ypx1ckp5bo16.cloudfront.net/big-data-lab/gdt.rds', 'gdt.rds')
E <- readRDS('expression.rds')
gdt <- readRDS('gdt.rds')

In addition, let’s download two data sets from the GTEx portal itself:

samp <- fread('https://storage.googleapis.com/gtex_analysis_v7/annotations/GTEx_v7_Annotations_SampleAttributesDS.txt')
subj <- fread('https://storage.googleapis.com/gtex_analysis_v7/annotations/GTEx_v7_Annotations_SubjectPhenotypesDS.txt')

How the data is organized

The bulk of the data is contained in E, an enormous matrix:

class(E)
# [1] "matrix" "array"
dim(E)
# [1] 11688 18851
E[1:10, 1:4]
#                          ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457
# GTEX-1117F-0226-SM-5GZZ7               0               0               0           7.584
# GTEX-1117F-0426-SM-5EGHI               0               0               0           1.850
# GTEX-1117F-0526-SM-5EGHJ               0               0               0           5.418
# GTEX-1117F-0626-SM-5N9CS               0               0               0           7.931
# GTEX-1117F-0726-SM-5GIEN               0               0               0           2.327
# GTEX-1117F-1326-SM-5EGHH               0               0               0           2.890
# GTEX-1117F-2226-SM-5N9CH               0               0               0           6.386
# GTEX-1117F-2426-SM-5EGGH               0               0               0           9.010
# GTEX-1117F-2526-SM-5GZY6               0               0               0          11.380
# GTEX-1117F-2826-SM-5GZXL               0               0               0           7.251

Each row of E corresponds to a sample; each column corresponds to a gene. For example, the above tells us that in sample GTEX-1117F-0226-SM-5GZZ7, the expression level of gene ENSG00000000457 was 7.584. What is gene ENSG00000000457? More information on the genes is provided in the object gdt:

class(gdt)
# [1] "data.table" "data.frame"
dim(gdt)
# [1] 18851     3
gdt[1:10,]
#                  ID   Symbol                                                 Description
#  1: ENSG00000000003   TSPAN6                                               tetraspanin 6
#  2: ENSG00000000005     TNMD                                                 tenomodulin
#  3: ENSG00000000419     DPM1 dolichyl-phosphate mannosyltransferase subunit 1, catalytic
#  4: ENSG00000000457    SCYL3                                    SCY1 like pseudokinase 3
#  5: ENSG00000000460 C1orf112                         chromosome 1 open reading frame 112
#  6: ENSG00000000938      FGR              FGR proto-oncogene, Src family tyrosine kinase
#  7: ENSG00000000971      CFH                                         complement factor H
#  8: ENSG00000001036    FUCA2                                        alpha-L-fucosidase 2
#  9: ENSG00000001084     GCLC                 glutamate-cysteine ligase catalytic subunit
# 10: ENSG00000001167     NFYA                nuclear transcription factor Y subunit alpha

How do I learn more about sample GTEX-1117F-0226-SM-5GZZ7? For example, what tissue did it come from? That information is contained in the object samp:

class(samp)
# [1] "data.table" "data.frame"
dim(samp)
# [1] 15598    63
samp[1:10, c("SAMPID", "SMTS", "SMTSD", "SMGEBTCHD")]
#                       SAMPID           SMTS                        SMTSD  SMGEBTCHD
#  1: GTEX-1117F-0003-SM-58Q7G          Blood                  Whole Blood 01/19/2014
#  2: GTEX-1117F-0003-SM-5DWSB          Blood                  Whole Blood 01/28/2014
#  3: GTEX-1117F-0003-SM-6WBT7          Blood                  Whole Blood  8/11/2014
#  4: GTEX-1117F-0226-SM-5GZZ7 Adipose Tissue       Adipose - Subcutaneous 03/05/2014
#  5: GTEX-1117F-0426-SM-5EGHI         Muscle            Muscle - Skeletal 02/08/2014
#  6: GTEX-1117F-0526-SM-5EGHJ   Blood Vessel              Artery - Tibial 02/08/2014
#  7: GTEX-1117F-0626-SM-5N9CS   Blood Vessel            Artery - Coronary 03/22/2014
#  8: GTEX-1117F-0726-SM-5GIEN          Heart     Heart - Atrial Appendage 02/27/2014
#  9: GTEX-1117F-1326-SM-5EGHH Adipose Tissue Adipose - Visceral (Omentum) 02/08/2014
# 10: GTEX-1117F-2226-SM-5N9CH          Ovary                        Ovary 03/22/2014

Finally, there is also information about the subjects that these samples came from; that information is contained in the object subj:

class(subj)
# [1] "data.table" "data.frame"
dim(subj)
# [1] 752   4
subj
#          SUBJID SEX   AGE DTHHRDY
#   1: GTEX-1117F   2 60-69       4
#   2: GTEX-111CU   1 50-59       0
#   3: GTEX-111FC   1 60-69       1
#   4: GTEX-111VG   1 60-69       3
#   5: GTEX-111YS   1 60-69       0
#  ---                             
# 748:  GTEX-ZYY3   2 60-69       4
# 749:  GTEX-ZZ64   1 20-29       0
# 750:  GTEX-ZZPT   1 50-59       4
# 751:  GTEX-ZZPU   2 50-59       0
# 752:      K-562   2 50-59      NA

There is much more subject-level data that was collected on these subjects, such as manner of death, race, exact age, etc. However, these are the only variables that can be publicly shared.

It is difficult/impossible to determine what many of the variables in samp/subj mean just from the variable name. I’ll explain all the ones that I refer to in the lab, but if you’re curious about what other ones mean, consult the data dictionaries:

How data.table works

data.table has many advantages over basic R data frames. To get a sense of what it can do, consider the following lines of code, which creates new columns called Date in the samp data table, formatted as a date variable (the original is just coded as a character string) and Age in the subj data table, which is a string that represents a range (which we are replacing my its midpoint):

samp[, Date := as.Date(SMGEBTCHD, format="%m/%d/%Y")]
subj[, Age := as.numeric(substr(AGE, 1, 2)) + 5]

We could do this with a data frame too, but some important differences are:

  • We can refer to the variable SMGEBTCHD, which only exists in the samp object, without having to write samp$SMGEBTCHD. This is extremely convenient.
  • This isn’t obvious from the above code, but data.table avoids making any extra copies of the data while it is executing the command. Many operations with data frames create additional temporary copies; this doesn’t really matter if the data frame is small, but for big data, this can be very slow and potentially even crash your machine if you run out of memory.

In addition, as we will see, data.table has some neat syntax to allow you to perform complex operations very simply. To illustrate its use, I’m going to create a smaller data table from samp for code illustration purposes:

DT <- samp[, .(Date, SMTS, SMTSD, SMMAPRT)]

Subsetting

data.table is organized somewhat similarly to another popular database management language called SQL. It’s organizing principle is sometimes summarized in shorthand as DT[i,j,by], where

  • i: Subset
  • j: Do/compute
  • by: Group/aggregate

For example, to find the subset of samples with mapping rates below 60%, we could write:

DT[SMMAPRT < 0.6]
#            Date        SMTS                                    SMTSD   SMMAPRT
#   1: 2012-04-02       Brain Brain - Anterior cingulate cortex (BA24) 0.1503121
#   2: 2012-04-02       Brain                     Brain - Hypothalamus 0.5982282
#   3: 2012-06-09       Brain                           Brain - Cortex 0.1136317
#   4: 2012-04-02       Brain                 Brain - Substantia nigra 0.4861057
#   5: 2012-04-02       Brain          Brain - Caudate (basal ganglia) 0.4257365
#  ---                                                                          
# 238: 2012-06-08 Bone Marrow         Cells - Leukemia cell line (CML) 0.2173972
# 239: 2012-06-02 Bone Marrow         Cells - Leukemia cell line (CML) 0.3541893
# 240: 2012-05-27 Bone Marrow         Cells - Leukemia cell line (CML) 0.4001249
# 241: 2012-06-09 Bone Marrow         Cells - Leukemia cell line (CML) 0.5073986
# 242: 2012-06-01 Bone Marrow         Cells - Leukemia cell line (CML) 0.3599809

We could subset by column as well:

DT[, .(SMTS, SMTSD)]
#                  SMTS                            SMTSD
#     1:          Blood                      Whole Blood
#     2:          Blood                      Whole Blood
#     3:          Blood                      Whole Blood
#     4: Adipose Tissue           Adipose - Subcutaneous
#     5:         Muscle                Muscle - Skeletal
#    ---                                                
# 15594:    Bone Marrow Cells - Leukemia cell line (CML)
# 15595:    Bone Marrow Cells - Leukemia cell line (CML)
# 15596:    Bone Marrow Cells - Leukemia cell line (CML)
# 15597:    Bone Marrow Cells - Leukemia cell line (CML)
# 15598:    Bone Marrow Cells - Leukemia cell line (CML)

But the “do” operation can also be more complex than that:

DT[, .(AvgMR = mean(SMMAPRT, na.rm=TRUE))]
#       AvgMR
# 1: 0.967304

The power of data.table starts to emerge when we combine these operations:

DT[Date < '2012-02-17', .(AvgMR = mean(SMMAPRT, na.rm=TRUE))]
#        AvgMR
# 1: 0.8452967
DT[Date > '2012-02-17', .(AvgMR = mean(SMMAPRT, na.rm=TRUE))]
#        AvgMR
# 1: 0.9684423

Exercise: What is the average age of donors who died suddenly (DTHHRDY=1)? After long illnesses (DTHHRDY=4)?


Aggregations

Finally, the by option in data.table lets us loop these operations over groups of samples. For example, we can look at average mapping rate by year:

DT[, .(.N, AvgMR = mean(SMMAPRT, na.rm=TRUE)), year(Date)]
#    year    N     AvgMR
# 1: 2014 8746 0.9888829
# 2: 2013 3126 0.9881517
# 3: 2015  695 0.9876827
# 4:   NA   58 0.9912968
# 5: 2011  230 0.8352740
# 6: 2012 2743 0.8341782

Or average mapping rate by tissue type:

DT[, .(.N, AvgMR = mean(SMMAPRT, na.rm=TRUE)), SMTS]
#                SMTS    N     AvgMR
#  1:           Blood 2561 0.9263712
#  2:  Adipose Tissue  886 0.9638252
#  3:          Muscle  718 0.9523384
#  4:    Blood Vessel 1041 0.9674999
#  5:           Heart  727 0.9615041
#  6:           Ovary  138 0.9882082
#  7:          Uterus  117 0.9873945
#  8:          Vagina  124 0.9881847
#  9:          Breast  306 0.9819909
# 10:            Skin 1362 0.9702419
# 11:  Salivary Gland  104 0.9874789
# 12:           Brain 2076 0.9590078
# 13:   Adrenal Gland  205 0.9892859
# 14:         Thyroid  564 0.9454281
# 15:            Lung  607 0.9452809
# 16:          Spleen  169 0.9872840
# 17:        Pancreas  268 0.9850086
# 18:       Esophagus 1111 0.9881868
# 19:         Stomach  272 0.9845115
# 20:           Colon  539 0.9886610
# 21: Small Intestine  143 0.9884259
# 22:        Prostate  159 0.9836195
# 23:          Testis  284 0.9876906
# 24:           Nerve  498 0.9427457
# 25:       Pituitary  192 0.9835776
# 26:           Liver  188 0.9883874
# 27:          Kidney   50 0.9878707
# 28:  Fallopian Tube    7 0.9887148
# 29:         Bladder   11 0.9889152
# 30:    Cervix Uteri   11 0.9873638
# 31:     Bone Marrow  160 0.9528658
#                SMTS    N     AvgMR

In the above, note that .N is a special variable that counts the number of rows in each subset of the data we are doing our calculations by. So, for example, we now know that there were 169 spleen samples, and that they had an average map rate of 98.7%.

Finally, we can put all of these things together to, for example, calculate the average map rate for each subtype of blood vessel:

DT[SMTS == 'Skin', .(.N, AvgMR = mean(SMMAPRT, na.rm=TRUE)), SMTSD]
#                                  SMTSD   N     AvgMR
# 1: Skin - Not Sun Exposed (Suprapubic) 408 0.9812895
# 2:      Skin - Sun Exposed (Lower leg) 592 0.9472045
# 3:     Cells - Transformed fibroblasts 362 0.9909403

Piping

It is often useful to take the results of one data table operation and run another data table operation on it. This could be done by running two separate lines of code and saving the intermediate results as an R object, but this is (a) inefficient and (b) extremely error-prone. A more concise alternative is to “pipe” the results of one operation to the next. For example, suppose we want to take our table of average map rates by tissue type and sort by map rate:

DT[, .(.N, AvgMR = mean(SMMAPRT, na.rm=TRUE)), SMTS][order(AvgMR)]
#                SMTS    N     AvgMR
#  1:           Blood 2561 0.9263712
#  2:           Nerve  498 0.9427457
#  3:            Lung  607 0.9452809
#  4:         Thyroid  564 0.9454281
#  5:          Muscle  718 0.9523384
#  6:     Bone Marrow  160 0.9528658
#  7:           Brain 2076 0.9590078
#  8:           Heart  727 0.9615041
#  9:  Adipose Tissue  886 0.9638252
# 10:    Blood Vessel 1041 0.9674999
# 11:            Skin 1362 0.9702419
# 12:          Breast  306 0.9819909
# 13:       Pituitary  192 0.9835776
# 14:        Prostate  159 0.9836195
# 15:         Stomach  272 0.9845115
# 16:        Pancreas  268 0.9850086
# 17:          Spleen  169 0.9872840
# 18:    Cervix Uteri   11 0.9873638
# 19:          Uterus  117 0.9873945
# 20:  Salivary Gland  104 0.9874789
# 21:          Testis  284 0.9876906
# 22:          Kidney   50 0.9878707
# 23:          Vagina  124 0.9881847
# 24:       Esophagus 1111 0.9881868
# 25:           Ovary  138 0.9882082
# 26:           Liver  188 0.9883874
# 27: Small Intestine  143 0.9884259
# 28:           Colon  539 0.9886610
# 29:  Fallopian Tube    7 0.9887148
# 30:         Bladder   11 0.9889152
# 31:   Adrenal Gland  205 0.9892859
#                SMTS    N     AvgMR

For readability, it is often best to separate these onto different lines:

DT[, .(.N, AvgMR = mean(SMMAPRT, na.rm=TRUE)), SMTS] %>%
  .[order(AvgMR)] %>%
  .[1:5]
#       SMTS    N     AvgMR
# 1:   Blood 2561 0.9263712
# 2:   Nerve  498 0.9427457
# 3:    Lung  607 0.9452809
# 4: Thyroid  564 0.9454281
# 5:  Muscle  718 0.9523384

Note that we can also pipe the results to other R functions:

DT[, .(.N, AvgMR = mean(SMMAPRT, na.rm=TRUE)), SMTS] %>%
  .[order(AvgMR)] %>%
  use_series('AvgMR') %>%
  median()
# [1] 0.9850086

Again, you can accomplish this same task in completely separate lines of code, saving intermediate results every time, but this is (a) harder to read, (b) harder to modify, and (c) a constant source of errors, because you will inevitably change the intermediate variable in one place and forget to change it in another.


Exercise: What are the most and least common subtypes of brain samples? Produce an ordered list.


Merging

Suppose we wanted to know something like the average age of the donors for each tissue type. We cannot obtain this information directly (yet), because the age information is in the subj data table, while the tissue type information is in the samp data table. To answer this kind of question, we will have to merge the two data sets.

In order to merge two data sets, we need to have a key: a column that appears in both data tables and that tells us which rows to combine. In our case, we don’t exactly have such a column, but we can clearly create a subject ID from the sample ID by cutting off everything after the second dash:

samp[, SUBJID := gsub('([^-]*)-([^-]*)-.*', '\\1-\\2', SAMPID)]

This involves regular expressions; if you don’t know what they are, don’t worry about it. Once we’ve created the column, we can merge the two data sets using merge:

sdt <- merge(samp, subj, by='SUBJID')

We have just combined the sample information and subject information into a single data set which I’m calling sdt for “samples data table”. To get a better understanding for what just happened, let’s look here:

dim(samp)
# [1] 15598    65
dim(subj)
# [1] 752   5
dim(sdt)
# [1] 15598    69
samp[SUBJID=='GTEX-WWTW', .(SUBJID, SAMPID, SMTS, SMTSD)]
#       SUBJID                  SAMPID           SMTS                               SMTSD
# 1: GTEX-WWTW GTEX-WWTW-0002-SM-4MVNH          Blood Cells - EBV-transformed lymphocytes
# 2: GTEX-WWTW GTEX-WWTW-0008-SM-4MVPE           Skin     Cells - Transformed fibroblasts
# 3: GTEX-WWTW GTEX-WWTW-0626-SM-4MVNS         Muscle                   Muscle - Skeletal
# 4: GTEX-WWTW GTEX-WWTW-1326-SM-4MVNT Adipose Tissue        Adipose - Visceral (Omentum)
subj[SUBJID=='GTEX-WWTW', .(SUBJID, SEX, AGE, DTHHRDY)]
#       SUBJID SEX   AGE DTHHRDY
# 1: GTEX-WWTW   1 50-59       0
sdt[SUBJID=='GTEX-WWTW', .(SUBJID, SAMPID, SMTS, SEX, AGE, DTHHRDY)]
#       SUBJID                  SAMPID           SMTS SEX   AGE DTHHRDY
# 1: GTEX-WWTW GTEX-WWTW-0002-SM-4MVNH          Blood   1 50-59       0
# 2: GTEX-WWTW GTEX-WWTW-0008-SM-4MVPE           Skin   1 50-59       0
# 3: GTEX-WWTW GTEX-WWTW-0626-SM-4MVNS         Muscle   1 50-59       0
# 4: GTEX-WWTW GTEX-WWTW-1326-SM-4MVNT Adipose Tissue   1 50-59       0

This is sometimes called a “one-to-many” merge: a single entry in subj is being matched and merged with many entries in samp. But each of the GTEX-WWTW samples is coming from the same person, so they are all coming from a man in his fifties who died after being on a ventilator. There are more complicated issues that come up in merging, like what to do when records can’t be matched, but none of those issues are present in this example, thankfully.

One issue that does come up, however, is the fact that there are subjects in sdt that we do not have any expression data on. So let’s amend our earlier assignment to also subset by samples that are in the expression matrix:

sdt <- merge(samp, subj, by='SUBJID')[SAMPID %in% rownames(E)]

Note that sdt matches up with E now:

dim(sdt)
# [1] 11688    69
dim(E)
# [1] 11688 18851
all.equal(sdt$SAMPID, rownames(E))
# [1] TRUE

Exercise: So, what is the average age of the donors for each tissue type?


Expression vs tissue type

OK, now that we are familiar with how to use data.table, let’s get back to the central biological question behind the GTEx project: how does the expression of genes vary with tissue type?

Initial attempt

Let’s say we decide to start by asking the question: what genes are the most highly expressed in the pancreas? We might think about approach the question like this:

  1. Subset E to include just pancreas samples
  2. Take the mean of each gene in the pancreas subset
  3. Sort and look at the top genes

Let’s see how well this approach works. Subsetting E is pretty straightforward – we just use data.table to get a list of the sample IDs that belong to pancreas.

To take the mean of each gene, we can use a function called apply. apply is a simple but very useful function in R that applies a specified function to every row or table in a matrix. For example, suppose we had a matrix called X. To find the mean of each row:

apply(X, 1, mean)

Or, to find the median of each column:

apply(X, 2, median)

OK, so let’s execute our plan. I’ll combine steps 1 and 2 into the following line of code:

pancreasMeans <- apply(E[sdt[SMTS=="Pancreas"]$SAMPID,], 2, mean)

Now, let’s see what those top genes are:

(topGenes <- head(sort(pancreasMeans, decreasing=TRUE), n=20))
# ENSG00000115386 ENSG00000137392 ENSG00000142789 ENSG00000172023 ENSG00000153002 ENSG00000219073 
#      65328.1290      54361.7903      28047.9032      23745.4786      17186.9113      10602.2218 
# ENSG00000142615 ENSG00000172016 ENSG00000243480 ENSG00000164266 ENSG00000162438 ENSG00000231500 
#       6051.7357       5972.2494       4901.6599       4768.4274       3741.7242       2362.6488 
# ENSG00000143954 ENSG00000215704 ENSG00000122406 ENSG00000177954 ENSG00000163682 ENSG00000168028 
#       2128.6467       1380.4270       1140.2210        978.6609        849.6633        822.7073 
# ENSG00000143947 ENSG00000188846 
#        821.2524        806.2048
gdt[names(topGenes)]
#                  ID Symbol                                 Description
#  1: ENSG00000115386  REG1A          regenerating family member 1 alpha
#  2: ENSG00000137392   CLPS                                    colipase
#  3: ENSG00000142789 CELA3A chymotrypsin like elastase family member 3A
#  4: ENSG00000172023  REG1B           regenerating family member 1 beta
#  5: ENSG00000153002   CPB1                         carboxypeptidase B1
#  6: ENSG00000219073 CELA3B chymotrypsin like elastase family member 3B
#  7: ENSG00000142615 CELA2A chymotrypsin like elastase family member 2A
#  8: ENSG00000172016  REG3A          regenerating family member 3 alpha
#  9: ENSG00000243480  AMY2A              amylase, alpha 2A (pancreatic)
# 10: ENSG00000164266 SPINK1    serine peptidase inhibitor, Kazal type 1
# 11: ENSG00000162438   CTRC                              chymotrypsin C
# 12: ENSG00000231500  RPS18                       ribosomal protein S18
# 13: ENSG00000143954  REG3G          regenerating family member 3 gamma
# 14: ENSG00000215704 CELA2B chymotrypsin like elastase family member 2B
# 15: ENSG00000122406   RPL5                        ribosomal protein L5
# 16: ENSG00000177954  RPS27                       ribosomal protein S27
# 17: ENSG00000163682   RPL9                        ribosomal protein L9
# 18: ENSG00000168028   RPSA                        ribosomal protein SA
# 19: ENSG00000143947 RPS27A                      ribosomal protein S27a
# 20: ENSG00000188846  RPL14                       ribosomal protein L14

At the top of the list are the REG genes and the chymotrypsin genes; this makes sense as these genes are directly related to the function of the pancreas. However, there are also a bunch of “ribosomal protein” results. These are perhaps not so interesting – they indicate that pancreas cells are making ribosomes, but all cells do that. So perhaps these genes are high in all tissues, not just pancreas. In order to see whether this is the case, I’m going to create a function called tissuePlot:

tissuePlot <- function(gene) {
  op <- par(mar=c(5,7,2,1))
  boxplot(E[,gene] ~ sdt$SMTS, horizontal=TRUE, las=1, col='gray', xlab='Expression', ylab='')
  mtext(gdt[gene, "Symbol", with=FALSE], line=0.5)
  par(op)
}
tissuePlot("ENSG00000231500")

So the pancreas makes a lot of RPS18, but plenty of other tissues do too. Let’s see how, say, colipase compares:

tissuePlot("ENSG00000137392")

This seems much more relevant as a pancreas-specific gene – and in fact, the pancreas is the only organ that produces colipase. At this point, it’s worth taking a moment to think about how we should define a better metric that prioritizes genes like CLPS over genes like RPS18. I explore two ideas in this lab, but there are many other reasonable approaches to addressing this question.

Revised approach A

One simple approach is based on calculating the difference in means between the tissue of interest and all other tissues. We could define a tissue-specific gene as one which is more highly expressed in that tissue than any other tissue, and then rank genes by the difference between the tissue of interest and the tissue with the second-highest expression.

To implement this idea, let’s define a scoring function, then apply it to all the rows of E. Note that I am using data.table and the aggregation features in my scoring function.

score <- function(x, tissue) {
  sdt[, TEMP := x]
  means <- sdt[, .(Avg = mean(TEMP)), SMTS]
  means[SMTS==tissue]$Avg - max(means[SMTS!=tissue]$Avg)
}
a <- apply(E, 2, score, tissue='Pancreas')

Now let’s look at the top genes according to this metric:

(topGenes <- head(sort(a, decreasing=TRUE), n=20))
# ENSG00000115386 ENSG00000137392 ENSG00000142789 ENSG00000172023 ENSG00000153002 ENSG00000219073 
#     57690.45916     54303.59989     27995.70975     20523.49230     17104.19415     10582.93090 
# ENSG00000142615 ENSG00000243480 ENSG00000164266 ENSG00000162438 ENSG00000172016 ENSG00000143954 
#      6040.26096      4891.72496      4380.88762      3736.90332      2900.71688      2018.09118 
# ENSG00000215704 ENSG00000240038 ENSG00000115263 ENSG00000114204 ENSG00000124713 ENSG00000142677 
#      1376.90391       727.10583       352.17229       264.68390       248.36380       112.81369 
# ENSG00000204140 ENSG00000196975 
#       104.86277        56.95739
gdt[names(topGenes)]
#                  ID   Symbol                                 Description
#  1: ENSG00000115386    REG1A          regenerating family member 1 alpha
#  2: ENSG00000137392     CLPS                                    colipase
#  3: ENSG00000142789   CELA3A chymotrypsin like elastase family member 3A
#  4: ENSG00000172023    REG1B           regenerating family member 1 beta
#  5: ENSG00000153002     CPB1                         carboxypeptidase B1
#  6: ENSG00000219073   CELA3B chymotrypsin like elastase family member 3B
#  7: ENSG00000142615   CELA2A chymotrypsin like elastase family member 2A
#  8: ENSG00000243480    AMY2A              amylase, alpha 2A (pancreatic)
#  9: ENSG00000164266   SPINK1    serine peptidase inhibitor, Kazal type 1
# 10: ENSG00000162438     CTRC                              chymotrypsin C
# 11: ENSG00000172016    REG3A          regenerating family member 3 alpha
# 12: ENSG00000143954    REG3G          regenerating family member 3 gamma
# 13: ENSG00000215704   CELA2B chymotrypsin like elastase family member 2B
# 14: ENSG00000240038    AMY2B              amylase, alpha 2B (pancreatic)
# 15: ENSG00000115263      GCG                                    glucagon
# 16: ENSG00000114204 SERPINI2                    serpin family I member 2
# 17: ENSG00000124713     GNMT                 glycine N-methyltransferase
# 18: ENSG00000142677  IL22RA1     interleukin 22 receptor subunit alpha 1
# 19: ENSG00000204140   CLPSL1                             colipase like 1
# 20: ENSG00000196975    ANXA4                                  annexin A4

Note that all of the “ribosomal protein” genes are gone. Try plotting some of the other genes on this list. Do they look similar to colipase?

How many pancreas-specific genes are there, according to our working definition?

sum(a>0)
# [1] 48

Exercise: Try applying this approach to another tissue, such as brain. How many brain-specific genes are there?


Revised approach B

An alternative approach is to look at how many standard deviations above the rest a certain tissue is. This approach opens the door for some genes with more modest abundance to show up at the top of our list, provided that they are not made in any other tissues. Again, let’s define a score and apply it:

score <- function(x, tissue) {
  sdt[, TEMP := x]
  ms <- sdt[, .(Avg = mean(TEMP), SD = sd(TEMP)), SMTS == tissue]
  (ms[SMTS==TRUE]$Avg-ms[SMTS==FALSE]$Avg)/max(ms[SMTS==FALSE]$SD, 1)
}
b <- apply(E, 2, score, tissue='Pancreas')

And look at the top genes according to this approach:

(topGenes <- head(sort(b, decreasing=TRUE), n=20))
# ENSG00000137392 ENSG00000162438 ENSG00000153002 ENSG00000243480 ENSG00000142789 ENSG00000215704 
#       208.42385       198.08592       177.06821       144.42881       142.22279       142.07116 
# ENSG00000219073 ENSG00000142615 ENSG00000114204 ENSG00000204140 ENSG00000143954 ENSG00000240038 
#       130.11153       125.69585       125.67201        80.72087        57.57300        55.12126 
# ENSG00000164266 ENSG00000185176 ENSG00000115386 ENSG00000115263 ENSG00000184945 ENSG00000172023 
#        49.88636        47.19105        40.32896        36.27333        32.35585        28.08859 
# ENSG00000080293 ENSG00000138472 
#        18.38351        15.54598
gdt[names(topGenes)]
#                  ID   Symbol                                 Description
#  1: ENSG00000137392     CLPS                                    colipase
#  2: ENSG00000162438     CTRC                              chymotrypsin C
#  3: ENSG00000153002     CPB1                         carboxypeptidase B1
#  4: ENSG00000243480    AMY2A              amylase, alpha 2A (pancreatic)
#  5: ENSG00000142789   CELA3A chymotrypsin like elastase family member 3A
#  6: ENSG00000215704   CELA2B chymotrypsin like elastase family member 2B
#  7: ENSG00000219073   CELA3B chymotrypsin like elastase family member 3B
#  8: ENSG00000142615   CELA2A chymotrypsin like elastase family member 2A
#  9: ENSG00000114204 SERPINI2                    serpin family I member 2
# 10: ENSG00000204140   CLPSL1                             colipase like 1
# 11: ENSG00000143954    REG3G          regenerating family member 3 gamma
# 12: ENSG00000240038    AMY2B              amylase, alpha 2B (pancreatic)
# 13: ENSG00000164266   SPINK1    serine peptidase inhibitor, Kazal type 1
# 14: ENSG00000185176   AQP12B                               aquaporin 12B
# 15: ENSG00000115386    REG1A          regenerating family member 1 alpha
# 16: ENSG00000115263      GCG                                    glucagon
# 17: ENSG00000184945   AQP12A                               aquaporin 12A
# 18: ENSG00000172023    REG1B           regenerating family member 1 beta
# 19: ENSG00000080293     SCTR                           secretin receptor
# 20: ENSG00000138472   GUCA1C              guanylate cyclase activator 1C

Many of the same genes show up, but REG1A, for example, moves down the list and genes like AQP12B move up. We can get a sense for why this happens by plotting these two genes:

par(mfrow=c(1,2))
tissuePlot('ENSG00000115386')
tissuePlot('ENSG00000185176')

REG1A is much more highly expressed, but also found in the small intestine. AQP12B, on the other hand, is almost never expressed in any other tissue.

With this measure, we might define a pancreas-specific gene as one that is, say, 3 standard deviations above the rest. How many pancreas-specific genes are there, according to our working definition?

sum(b>3)
# [1] 43

Both approaches (A and B) seem reasonable; how much overlap is there between them?

length(intersect(names(b[b>3]), names(a[a>0])))
# [1] 38

So, quite a bit of overlap: of the 43 pancreas-specific genes according to definition B, 38 are also pancreas-specific according to definition A. Perhaps the best way of defining tissue-specific genes is to combine definitions A and B? It’s an interesting idea, but in the interest of time, we’ll move on.

Expression vs subject characteristics

Something we haven’t looked at yet is the extent to which gene expression within a tissue depends on subject characteristics. Let’s look at a gene that I happen to know something about, called SCN5A. SCN5A is a heart gene involved in controlling the electrical channels that keep the heart beating regularly; many cardiac diseases are caused by mutations in SCN5A.

tissuePlot('ENSG00000183873')

I collaborate with researchers in the cardiology department here at the University of Iowa, and one of their concerns with using this data from GTEx was whether it makes sense to pool all of these samples or not, given that they come from a variety of subjects and were donated in a variety of ways. In particular, their concern was that expression might be different in a healthy, living heart than in a diseased, dying heart.

This concept can be measured in a number of ways, but one of those ways is the “Hardy scale”, which categorizes how long a subject was dying prior to the collection of the sample:

  1. On a ventilator immediately before death
  2. Deaths due to accident, blunt force trauma or suicide
  3. Sudden unexpected deaths of people who had been reasonably healthy (e.g., from heart attack)
  4. Intermediate death (between 2 and 4)
  5. Death after a long illness (e.g., cancer)

Does SCN5A expression depend on Hardy classification?

sdt[, SCN5A := E[,'ENSG00000183873']]
boxplot(SCN5A ~ DTHHRDY, sdt[SMTS == "Heart"], col='gray', las=1, xlab='Hardy scale', ylab='SCN5A')

Very much so – there is a clear trend here in which SCN5A expression in the heart seems to decline as a subject dies.

Further explorations

There are certainly lots of other questions we could explore here – other tissues, other subject characteristics, as well as a whole world of gene characteristics and annotations that we’re not going to get into. If there’s time left in lab, I encourage you to explore the data set and ask your own questions. For example,

  • Look at other tissues
  • Look at other subject / sample characteristics
  • Explore alternative measures for defining tissue-specific genes
  • Look at subtissues; for example, are there genes specific not just to the brain, but to the cortex specifically?

Hopefully this lab has given you some experience with a useful tool in data.table, shown the kinds of operations and issues that arise in big data sets, and exposed you to gene expression, a common type of genomic data. If you have questions, please let me know!