The goal of this lab is two-fold:
Genomic data tends to be large and complex, so these two skills go hand-in-hand, although there are certainly lots of other types of data that are large and complex besides genomic data. We will be working on data from GTEx, the Genotype-Tissue Expression project.
The main idea of the GTEx project is to understand how the expression levels of genes varies depending on tissue. The DNA sequence of your genes is the same in all of your tissues, but what those tissues do with that DNA varies greatly. Muscle cells, for examples, make a lot of actin and myosin, the proteins that make up muscle cells. In biological jargon, muscle tissue expresses the actin and myosin genes at a high level. Liver cells, on the other hand, make lots of enzymes and express actin and myosin at much lower levels.
Our primary tool for working with these objects (merging, subsetting,
cross-referencing them, etc.) is going to be the R package
data.table
; we’ll also use magrittr
. If
data.table
is not installed on your machine, you’ll have to
install it with install.packages('data.table')
; if you’re
on a Windows machine, you’ll also need to install curl
to
run the code below: install.packages('curl')
. Once it’s
installed (you only need to install it once, but will have to load it
every time you open R), you load it with:
library(data.table)
library(magrittr)
I prepared the following two files:
.rds
file, which is compressed, is only 211 Mb, but
when you load it into R, it consumes 1.6 GB of RAM. You can certainly
download the full
set of expression data from GTEx, but it’s too large to fit into RAM
on many machines. The file I am providing is a subset of that data,
restricted to protein-coding genes.Before proceeding, you may wish to set up a folder for this project run everything from that working directory, but this is up to you.
download.file('https://d1ypx1ckp5bo16.cloudfront.net/big-data-lab/expression.rds', 'expression.rds')
download.file('https://d1ypx1ckp5bo16.cloudfront.net/big-data-lab/gdt.rds', 'gdt.rds')
E <- readRDS('expression.rds')
gdt <- readRDS('gdt.rds')
In addition, let’s download two data sets from the GTEx portal itself:
samp <- fread('https://storage.googleapis.com/adult-gtex/annotations/v7/metadata-files/GTEx_v7_Annotations_SampleAttributesDS.txt')
subj <- fread('https://storage.googleapis.com/adult-gtex/annotations/v7/metadata-files/GTEx_v7_Annotations_SubjectPhenotypesDS.txt')
The bulk of the data is contained in E
, an enormous
matrix:
class(E)
# [1] "matrix" "array"
dim(E)
# [1] 11688 18851
E[1:10, 1:4]
# ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457
# GTEX-1117F-0226-SM-5GZZ7 0 0 0 7.584
# GTEX-1117F-0426-SM-5EGHI 0 0 0 1.850
# GTEX-1117F-0526-SM-5EGHJ 0 0 0 5.418
# GTEX-1117F-0626-SM-5N9CS 0 0 0 7.931
# GTEX-1117F-0726-SM-5GIEN 0 0 0 2.327
# GTEX-1117F-1326-SM-5EGHH 0 0 0 2.890
# GTEX-1117F-2226-SM-5N9CH 0 0 0 6.386
# GTEX-1117F-2426-SM-5EGGH 0 0 0 9.010
# GTEX-1117F-2526-SM-5GZY6 0 0 0 11.380
# GTEX-1117F-2826-SM-5GZXL 0 0 0 7.251
Each row of E
corresponds to a sample; each column
corresponds to a gene. For example, the above tells us that in sample
GTEX-1117F-0226-SM-5GZZ7, the expression level of gene ENSG00000000457
was 7.584. What is gene ENSG00000000457? More information on the genes
is provided in the object gdt
:
class(gdt)
# [1] "data.table" "data.frame"
dim(gdt)
# [1] 18851 3
gdt[1:10,]
# Key: <ID>
# ID Symbol Description
# <char> <char> <char>
# 1: ENSG00000000003 TSPAN6 tetraspanin 6
# 2: ENSG00000000005 TNMD tenomodulin
# 3: ENSG00000000419 DPM1 dolichyl-phosphate mannosyltransferase subunit 1, catalytic
# 4: ENSG00000000457 SCYL3 SCY1 like pseudokinase 3
# 5: ENSG00000000460 C1orf112 chromosome 1 open reading frame 112
# 6: ENSG00000000938 FGR FGR proto-oncogene, Src family tyrosine kinase
# 7: ENSG00000000971 CFH complement factor H
# 8: ENSG00000001036 FUCA2 alpha-L-fucosidase 2
# 9: ENSG00000001084 GCLC glutamate-cysteine ligase catalytic subunit
# 10: ENSG00000001167 NFYA nuclear transcription factor Y subunit alpha
How do I learn more about sample GTEX-1117F-0226-SM-5GZZ7? For
example, what tissue did it come from? That information is contained in
the object samp
:
class(samp)
# [1] "data.table" "data.frame"
dim(samp)
# [1] 15598 63
samp[1:10, c("SAMPID", "SMTS", "SMTSD", "SMGEBTCHD")]
# SAMPID SMTS SMTSD SMGEBTCHD
# <char> <char> <char> <char>
# 1: GTEX-1117F-0003-SM-58Q7G Blood Whole Blood 01/19/2014
# 2: GTEX-1117F-0003-SM-5DWSB Blood Whole Blood 01/28/2014
# 3: GTEX-1117F-0003-SM-6WBT7 Blood Whole Blood 8/11/2014
# 4: GTEX-1117F-0226-SM-5GZZ7 Adipose Tissue Adipose - Subcutaneous 03/05/2014
# 5: GTEX-1117F-0426-SM-5EGHI Muscle Muscle - Skeletal 02/08/2014
# 6: GTEX-1117F-0526-SM-5EGHJ Blood Vessel Artery - Tibial 02/08/2014
# 7: GTEX-1117F-0626-SM-5N9CS Blood Vessel Artery - Coronary 03/22/2014
# 8: GTEX-1117F-0726-SM-5GIEN Heart Heart - Atrial Appendage 02/27/2014
# 9: GTEX-1117F-1326-SM-5EGHH Adipose Tissue Adipose - Visceral (Omentum) 02/08/2014
# 10: GTEX-1117F-2226-SM-5N9CH Ovary Ovary 03/22/2014
Finally, there is also information about the subjects that these
samples came from; that information is contained in the object
subj
:
class(subj)
# [1] "data.table" "data.frame"
dim(subj)
# [1] 752 4
subj
# SUBJID SEX AGE DTHHRDY
# <char> <int> <char> <int>
# 1: GTEX-1117F 2 60-69 4
# 2: GTEX-111CU 1 50-59 0
# 3: GTEX-111FC 1 60-69 1
# 4: GTEX-111VG 1 60-69 3
# 5: GTEX-111YS 1 60-69 0
# ---
# 748: GTEX-ZYY3 2 60-69 4
# 749: GTEX-ZZ64 1 20-29 0
# 750: GTEX-ZZPT 1 50-59 4
# 751: GTEX-ZZPU 2 50-59 0
# 752: K-562 2 50-59 NA
There is much more subject-level data that was collected on these subjects, such as manner of death, race, exact age, etc. However, these are the only variables that can be publicly shared.
It is difficult/impossible to determine what many of the variables in samp/subj mean just from the variable name. I’ll explain all the ones that I refer to in the lab, but if you’re curious about what other ones mean, consult the data dictionaries:
data.table
has many advantages over basic R data frames.
To get a sense of what it can do, consider the following lines of code,
which creates new columns called Date
in the
samp
data table, formatted as a date variable (the original
is just coded as a character string) and Age
in the
subj
data table, which is a string that represents a range
(which we are replacing my its midpoint):
samp[, Date := as.Date(SMGEBTCHD, format="%m/%d/%Y")]
subj[, Age := as.numeric(substr(AGE, 1, 2)) + 5]
We could do this with a data frame too, but some important differences are:
SMGEBTCHD
, which only
exists in the samp
object, without having to write
samp$SMGEBTCHD
. This is extremely convenient.data.table
avoids making any extra copies of the data while it is executing the
command. Many operations with data frames create additional temporary
copies; this doesn’t really matter if the data frame is small, but for
big data, this can be very slow and potentially even crash your machine
if you run out of memory.In addition, as we will see, data.table
has some neat
syntax to allow you to perform complex operations very simply. To
illustrate its use, I’m going to create a smaller data table from
samp
for code illustration purposes:
DT <- samp[, .(Date, SMTS, SMTSD, SMMAPRT)]
data.table
is organized somewhat similarly to another
popular database management language called SQL. It’s organizing
principle is sometimes summarized in shorthand as
DT[i,j,by]
, where
i
: Subsetj
: Do/computeby
: Group/aggregateFor example, to find the subset of samples with mapping rates below 60%, we could write:
DT[SMMAPRT < 0.6]
# Date SMTS SMTSD SMMAPRT
# <Date> <char> <char> <num>
# 1: 2012-04-02 Brain Brain - Anterior cingulate cortex (BA24) 0.1503121
# 2: 2012-04-02 Brain Brain - Hypothalamus 0.5982282
# 3: 2012-06-09 Brain Brain - Cortex 0.1136317
# 4: 2012-04-02 Brain Brain - Substantia nigra 0.4861057
# 5: 2012-04-02 Brain Brain - Caudate (basal ganglia) 0.4257365
# ---
# 238: 2012-06-08 Bone Marrow Cells - Leukemia cell line (CML) 0.2173972
# 239: 2012-06-02 Bone Marrow Cells - Leukemia cell line (CML) 0.3541893
# 240: 2012-05-27 Bone Marrow Cells - Leukemia cell line (CML) 0.4001249
# 241: 2012-06-09 Bone Marrow Cells - Leukemia cell line (CML) 0.5073986
# 242: 2012-06-01 Bone Marrow Cells - Leukemia cell line (CML) 0.3599809
We could subset by column as well:
DT[, .(SMTS, SMTSD)]
# SMTS SMTSD
# <char> <char>
# 1: Blood Whole Blood
# 2: Blood Whole Blood
# 3: Blood Whole Blood
# 4: Adipose Tissue Adipose - Subcutaneous
# 5: Muscle Muscle - Skeletal
# ---
# 15594: Bone Marrow Cells - Leukemia cell line (CML)
# 15595: Bone Marrow Cells - Leukemia cell line (CML)
# 15596: Bone Marrow Cells - Leukemia cell line (CML)
# 15597: Bone Marrow Cells - Leukemia cell line (CML)
# 15598: Bone Marrow Cells - Leukemia cell line (CML)
But the “do” operation can also be more complex than that:
DT[, .(AvgMR = mean(SMMAPRT, na.rm=TRUE))]
# AvgMR
# <num>
# 1: 0.967304
The power of data.table
starts to emerge when we combine
these operations:
DT[Date < '2012-02-17', .(AvgMR = mean(SMMAPRT, na.rm=TRUE))]
# AvgMR
# <num>
# 1: 0.8452967
DT[Date > '2012-02-17', .(AvgMR = mean(SMMAPRT, na.rm=TRUE))]
# AvgMR
# <num>
# 1: 0.9684423
Exercise: What is the average age of donors who died suddenly (DTHHRDY=1)? After long illnesses (DTHHRDY=4)?
Finally, the by
option in data.table
lets
us loop these operations over groups of samples. For example, we can
look at average mapping rate by year:
DT[, .(.N, AvgMR = mean(SMMAPRT, na.rm=TRUE)), year(Date)]
# year N AvgMR
# <int> <int> <num>
# 1: 2014 8746 0.9888829
# 2: 2013 3126 0.9881517
# 3: 2015 695 0.9876827
# 4: NA 58 0.9912968
# 5: 2011 230 0.8352740
# 6: 2012 2743 0.8341782
Or average mapping rate by tissue type:
DT[, .(.N, AvgMR = mean(SMMAPRT, na.rm=TRUE)), SMTS]
# SMTS N AvgMR
# <char> <int> <num>
# 1: Blood 2561 0.9263712
# 2: Adipose Tissue 886 0.9638252
# 3: Muscle 718 0.9523384
# 4: Blood Vessel 1041 0.9674999
# 5: Heart 727 0.9615041
# 6: Ovary 138 0.9882082
# 7: Uterus 117 0.9873945
# 8: Vagina 124 0.9881847
# 9: Breast 306 0.9819909
# 10: Skin 1362 0.9702419
# 11: Salivary Gland 104 0.9874789
# 12: Brain 2076 0.9590078
# 13: Adrenal Gland 205 0.9892859
# 14: Thyroid 564 0.9454281
# 15: Lung 607 0.9452809
# 16: Spleen 169 0.9872840
# 17: Pancreas 268 0.9850086
# 18: Esophagus 1111 0.9881868
# 19: Stomach 272 0.9845115
# 20: Colon 539 0.9886610
# 21: Small Intestine 143 0.9884259
# 22: Prostate 159 0.9836195
# 23: Testis 284 0.9876906
# 24: Nerve 498 0.9427457
# 25: Pituitary 192 0.9835776
# 26: Liver 188 0.9883874
# 27: Kidney 50 0.9878707
# 28: Fallopian Tube 7 0.9887148
# 29: Bladder 11 0.9889152
# 30: Cervix Uteri 11 0.9873638
# 31: Bone Marrow 160 0.9528658
# SMTS N AvgMR
In the above, note that .N
is a special variable that
counts the number of rows in each subset of the data we are doing our
calculations by. So, for example, we now know that there were 169 spleen
samples, and that they had an average map rate of 98.7%.
Finally, we can put all of these things together to, for example, calculate the average map rate for each subtype of blood vessel:
DT[SMTS == 'Skin', .(.N, AvgMR = mean(SMMAPRT, na.rm=TRUE)), SMTSD]
# SMTSD N AvgMR
# <char> <int> <num>
# 1: Skin - Not Sun Exposed (Suprapubic) 408 0.9812895
# 2: Skin - Sun Exposed (Lower leg) 592 0.9472045
# 3: Cells - Transformed fibroblasts 362 0.9909403
It is often useful to take the results of one data table operation and run another data table operation on it. This could be done by running two separate lines of code and saving the intermediate results as an R object, but this is (a) inefficient and (b) extremely error-prone. A more concise alternative is to “pipe” the results of one operation to the next. For example, suppose we want to take our table of average map rates by tissue type and sort by map rate:
DT[, .(.N, AvgMR = mean(SMMAPRT, na.rm=TRUE)), SMTS][order(AvgMR)]
# SMTS N AvgMR
# <char> <int> <num>
# 1: Blood 2561 0.9263712
# 2: Nerve 498 0.9427457
# 3: Lung 607 0.9452809
# 4: Thyroid 564 0.9454281
# 5: Muscle 718 0.9523384
# 6: Bone Marrow 160 0.9528658
# 7: Brain 2076 0.9590078
# 8: Heart 727 0.9615041
# 9: Adipose Tissue 886 0.9638252
# 10: Blood Vessel 1041 0.9674999
# 11: Skin 1362 0.9702419
# 12: Breast 306 0.9819909
# 13: Pituitary 192 0.9835776
# 14: Prostate 159 0.9836195
# 15: Stomach 272 0.9845115
# 16: Pancreas 268 0.9850086
# 17: Spleen 169 0.9872840
# 18: Cervix Uteri 11 0.9873638
# 19: Uterus 117 0.9873945
# 20: Salivary Gland 104 0.9874789
# 21: Testis 284 0.9876906
# 22: Kidney 50 0.9878707
# 23: Vagina 124 0.9881847
# 24: Esophagus 1111 0.9881868
# 25: Ovary 138 0.9882082
# 26: Liver 188 0.9883874
# 27: Small Intestine 143 0.9884259
# 28: Colon 539 0.9886610
# 29: Fallopian Tube 7 0.9887148
# 30: Bladder 11 0.9889152
# 31: Adrenal Gland 205 0.9892859
# SMTS N AvgMR
For readability, it is often best to separate these onto different lines:
DT[, .(.N, AvgMR = mean(SMMAPRT, na.rm=TRUE)), SMTS] %>%
.[order(AvgMR)] %>%
.[1:5]
# SMTS N AvgMR
# <char> <int> <num>
# 1: Blood 2561 0.9263712
# 2: Nerve 498 0.9427457
# 3: Lung 607 0.9452809
# 4: Thyroid 564 0.9454281
# 5: Muscle 718 0.9523384
Note that we can also pipe the results to other R functions:
DT[, .(.N, AvgMR = mean(SMMAPRT, na.rm=TRUE)), SMTS] %>%
.[order(AvgMR)] %>%
use_series('AvgMR') %>%
median()
# [1] 0.9850086
Again, you can accomplish this same task in completely separate lines of code, saving intermediate results every time, but this is (a) harder to read, (b) harder to modify, and (c) a constant source of errors, because you will inevitably change the intermediate variable in one place and forget to change it in another.
Exercise: What are the most and least common subtypes of brain samples? Produce an ordered list.
Suppose we wanted to know something like the average age of the
donors for each tissue type. We cannot obtain this information directly
(yet), because the age information is in the subj
data
table, while the tissue type information is in the samp
data table. To answer this kind of question, we will have to merge the
two data sets.
In order to merge two data sets, we need to have a key: a column that appears in both data tables and that tells us which rows to combine. In our case, we don’t exactly have such a column, but we can clearly create a subject ID from the sample ID by cutting off everything after the second dash:
samp[, SUBJID := gsub('([^-]*)-([^-]*)-.*', '\\1-\\2', SAMPID)]
This involves regular expressions; if you don’t know what they are,
don’t worry about it. Once we’ve created the column, we can merge the
two data sets using merge
:
sdt <- merge(samp, subj, by='SUBJID')
We have just combined the sample information and subject information
into a single data set which I’m calling sdt
for “samples
data table”. To get a better understanding for what just happened, let’s
look here:
dim(samp)
# [1] 15598 65
dim(subj)
# [1] 752 5
dim(sdt)
# [1] 15598 69
samp[SUBJID=='GTEX-WWTW', .(SUBJID, SAMPID, SMTS, SMTSD)]
# SUBJID SAMPID SMTS SMTSD
# <char> <char> <char> <char>
# 1: GTEX-WWTW GTEX-WWTW-0002-SM-4MVNH Blood Cells - EBV-transformed lymphocytes
# 2: GTEX-WWTW GTEX-WWTW-0008-SM-4MVPE Skin Cells - Transformed fibroblasts
# 3: GTEX-WWTW GTEX-WWTW-0626-SM-4MVNS Muscle Muscle - Skeletal
# 4: GTEX-WWTW GTEX-WWTW-1326-SM-4MVNT Adipose Tissue Adipose - Visceral (Omentum)
subj[SUBJID=='GTEX-WWTW', .(SUBJID, SEX, AGE, DTHHRDY)]
# SUBJID SEX AGE DTHHRDY
# <char> <int> <char> <int>
# 1: GTEX-WWTW 1 50-59 0
sdt[SUBJID=='GTEX-WWTW', .(SUBJID, SAMPID, SMTS, SEX, AGE, DTHHRDY)]
# Key: <SUBJID>
# SUBJID SAMPID SMTS SEX AGE DTHHRDY
# <char> <char> <char> <int> <char> <int>
# 1: GTEX-WWTW GTEX-WWTW-0002-SM-4MVNH Blood 1 50-59 0
# 2: GTEX-WWTW GTEX-WWTW-0008-SM-4MVPE Skin 1 50-59 0
# 3: GTEX-WWTW GTEX-WWTW-0626-SM-4MVNS Muscle 1 50-59 0
# 4: GTEX-WWTW GTEX-WWTW-1326-SM-4MVNT Adipose Tissue 1 50-59 0
This is sometimes called a “one-to-many” merge: a single entry in
subj
is being matched and merged with many entries in
samp
. But each of the GTEX-WWTW
samples is
coming from the same person, so they are all coming from a man in his
fifties who died after being on a ventilator. There are more complicated
issues that come up in merging, like what to do when records can’t be
matched, but none of those issues are present in this example,
thankfully.
One issue that does come up, however, is the fact that there are
subjects in sdt
that we do not have any expression data on.
So let’s amend our earlier assignment to also subset by samples that are
in the expression matrix:
sdt <- merge(samp, subj, by='SUBJID')[SAMPID %in% rownames(E)]
Note that sdt
matches up with E
now:
dim(sdt)
# [1] 11688 69
dim(E)
# [1] 11688 18851
all.equal(sdt$SAMPID, rownames(E))
# [1] TRUE
Exercise: So, what is the average age of the donors for each tissue type?
OK, now that we are familiar with how to use data.table
,
let’s get back to the central biological question behind the GTEx
project: how does the expression of genes vary with tissue type?
Let’s say we decide to start by asking the question: what genes are the most highly expressed in the pancreas? We might think about approach the question like this:
E
to include just pancreas samplesLet’s see how well this approach works. Subsetting E is pretty
straightforward – we just use data.table
to get a list of
the sample IDs that belong to pancreas.
To take the mean of each gene, we can use a function called
apply
. apply
is a simple but very useful
function in R that applies a specified function to every row or table in
a matrix. For example, suppose we had a matrix called X
. To
find the mean of each row:
apply(X, 1, mean)
Or, to find the median of each column:
apply(X, 2, median)
OK, so let’s execute our plan. I’ll combine steps 1 and 2 into the following line of code:
pancreasMeans <- apply(E[sdt[SMTS=="Pancreas"]$SAMPID,], 2, mean)
Now, let’s see what those top genes are:
(topGenes <- head(sort(pancreasMeans, decreasing=TRUE), n=20))
# ENSG00000115386 ENSG00000137392 ENSG00000142789 ENSG00000172023 ENSG00000153002 ENSG00000219073
# 65328.1290 54361.7903 28047.9032 23745.4786 17186.9113 10602.2218
# ENSG00000142615 ENSG00000172016 ENSG00000243480 ENSG00000164266 ENSG00000162438 ENSG00000231500
# 6051.7357 5972.2494 4901.6599 4768.4274 3741.7242 2362.6488
# ENSG00000143954 ENSG00000215704 ENSG00000122406 ENSG00000177954 ENSG00000163682 ENSG00000168028
# 2128.6467 1380.4270 1140.2210 978.6609 849.6633 822.7073
# ENSG00000143947 ENSG00000188846
# 821.2524 806.2048
gdt[names(topGenes)]
# ID Symbol Description
# <char> <char> <char>
# 1: ENSG00000115386 REG1A regenerating family member 1 alpha
# 2: ENSG00000137392 CLPS colipase
# 3: ENSG00000142789 CELA3A chymotrypsin like elastase family member 3A
# 4: ENSG00000172023 REG1B regenerating family member 1 beta
# 5: ENSG00000153002 CPB1 carboxypeptidase B1
# 6: ENSG00000219073 CELA3B chymotrypsin like elastase family member 3B
# 7: ENSG00000142615 CELA2A chymotrypsin like elastase family member 2A
# 8: ENSG00000172016 REG3A regenerating family member 3 alpha
# 9: ENSG00000243480 AMY2A amylase, alpha 2A (pancreatic)
# 10: ENSG00000164266 SPINK1 serine peptidase inhibitor, Kazal type 1
# 11: ENSG00000162438 CTRC chymotrypsin C
# 12: ENSG00000231500 RPS18 ribosomal protein S18
# 13: ENSG00000143954 REG3G regenerating family member 3 gamma
# 14: ENSG00000215704 CELA2B chymotrypsin like elastase family member 2B
# 15: ENSG00000122406 RPL5 ribosomal protein L5
# 16: ENSG00000177954 RPS27 ribosomal protein S27
# 17: ENSG00000163682 RPL9 ribosomal protein L9
# 18: ENSG00000168028 RPSA ribosomal protein SA
# 19: ENSG00000143947 RPS27A ribosomal protein S27a
# 20: ENSG00000188846 RPL14 ribosomal protein L14
# ID Symbol Description
At the top of the list are the REG genes and the chymotrypsin genes;
this makes sense as these genes are directly related to the function of
the pancreas. However, there are also a bunch of “ribosomal protein”
results. These are perhaps not so interesting – they indicate that
pancreas cells are making ribosomes, but all cells do that. So perhaps
these genes are high in all tissues, not just pancreas. In
order to see whether this is the case, I’m going to create a function
called tissuePlot
:
tissuePlot <- function(gene) {
op <- par(mar=c(5,7,2,1))
boxplot(E[,gene] ~ sdt$SMTS, horizontal=TRUE, las=1, col='gray', xlab='Expression', ylab='')
mtext(gdt[gene, "Symbol", with=FALSE], line=0.5)
par(op)
}
tissuePlot("ENSG00000231500")
So the pancreas makes a lot of RPS18, but plenty of other tissues do too. Let’s see how, say, colipase compares:
tissuePlot("ENSG00000137392")
This seems much more relevant as a pancreas-specific gene – and in fact, the pancreas is the only organ that produces colipase. At this point, it’s worth taking a moment to think about how we should define a better metric that prioritizes genes like CLPS over genes like RPS18. I explore two ideas in this lab, but there are many other reasonable approaches to addressing this question.
One simple approach is based on calculating the difference in means between the tissue of interest and all other tissues. We could define a tissue-specific gene as one which is more highly expressed in that tissue than any other tissue, and then rank genes by the difference between the tissue of interest and the tissue with the second-highest expression.
To implement this idea, let’s define a scoring function, then apply
it to all the rows of E
. Note that I am using
data.table
and the aggregation features in my scoring
function.
score <- function(x, tissue) {
sdt[, TEMP := x]
means <- sdt[, .(Avg = mean(TEMP)), SMTS]
means[SMTS==tissue]$Avg - max(means[SMTS!=tissue]$Avg)
}
a <- apply(E, 2, score, tissue='Pancreas')
Now let’s look at the top genes according to this metric:
(topGenes <- head(sort(a, decreasing=TRUE), n=20))
# ENSG00000115386 ENSG00000137392 ENSG00000142789 ENSG00000172023 ENSG00000153002 ENSG00000219073
# 57690.45916 54303.59989 27995.70975 20523.49230 17104.19415 10582.93090
# ENSG00000142615 ENSG00000243480 ENSG00000164266 ENSG00000162438 ENSG00000172016 ENSG00000143954
# 6040.26096 4891.72496 4380.88762 3736.90332 2900.71688 2018.09118
# ENSG00000215704 ENSG00000240038 ENSG00000115263 ENSG00000114204 ENSG00000124713 ENSG00000142677
# 1376.90391 727.10583 352.17229 264.68390 248.36380 112.81369
# ENSG00000204140 ENSG00000196975
# 104.86277 56.95739
gdt[names(topGenes)]
# ID Symbol Description
# <char> <char> <char>
# 1: ENSG00000115386 REG1A regenerating family member 1 alpha
# 2: ENSG00000137392 CLPS colipase
# 3: ENSG00000142789 CELA3A chymotrypsin like elastase family member 3A
# 4: ENSG00000172023 REG1B regenerating family member 1 beta
# 5: ENSG00000153002 CPB1 carboxypeptidase B1
# 6: ENSG00000219073 CELA3B chymotrypsin like elastase family member 3B
# 7: ENSG00000142615 CELA2A chymotrypsin like elastase family member 2A
# 8: ENSG00000243480 AMY2A amylase, alpha 2A (pancreatic)
# 9: ENSG00000164266 SPINK1 serine peptidase inhibitor, Kazal type 1
# 10: ENSG00000162438 CTRC chymotrypsin C
# 11: ENSG00000172016 REG3A regenerating family member 3 alpha
# 12: ENSG00000143954 REG3G regenerating family member 3 gamma
# 13: ENSG00000215704 CELA2B chymotrypsin like elastase family member 2B
# 14: ENSG00000240038 AMY2B amylase, alpha 2B (pancreatic)
# 15: ENSG00000115263 GCG glucagon
# 16: ENSG00000114204 SERPINI2 serpin family I member 2
# 17: ENSG00000124713 GNMT glycine N-methyltransferase
# 18: ENSG00000142677 IL22RA1 interleukin 22 receptor subunit alpha 1
# 19: ENSG00000204140 CLPSL1 colipase like 1
# 20: ENSG00000196975 ANXA4 annexin A4
# ID Symbol Description
Note that all of the “ribosomal protein” genes are gone. Try plotting some of the other genes on this list. Do they look similar to colipase?
How many pancreas-specific genes are there, according to our working definition?
sum(a>0)
# [1] 48
Exercise: Try applying this approach to another tissue, such as brain. How many brain-specific genes are there?
An alternative approach is to look at how many standard deviations above the rest a certain tissue is. This approach opens the door for some genes with more modest abundance to show up at the top of our list, provided that they are not made in any other tissues. Again, let’s define a score and apply it:
score <- function(x, tissue) {
sdt[, TEMP := x]
ms <- sdt[, .(Avg = mean(TEMP), SD = sd(TEMP)), SMTS == tissue]
(ms[SMTS==TRUE]$Avg-ms[SMTS==FALSE]$Avg)/max(ms[SMTS==FALSE]$SD, 1)
}
b <- apply(E, 2, score, tissue='Pancreas')
And look at the top genes according to this approach:
(topGenes <- head(sort(b, decreasing=TRUE), n=20))
# ENSG00000137392 ENSG00000162438 ENSG00000153002 ENSG00000243480 ENSG00000142789 ENSG00000215704
# 208.42385 198.08592 177.06821 144.42881 142.22279 142.07116
# ENSG00000219073 ENSG00000142615 ENSG00000114204 ENSG00000204140 ENSG00000143954 ENSG00000240038
# 130.11153 125.69585 125.67201 80.72087 57.57300 55.12126
# ENSG00000164266 ENSG00000185176 ENSG00000115386 ENSG00000115263 ENSG00000184945 ENSG00000172023
# 49.88636 47.19105 40.32896 36.27333 32.35585 28.08859
# ENSG00000080293 ENSG00000138472
# 18.38351 15.54598
gdt[names(topGenes)]
# ID Symbol Description
# <char> <char> <char>
# 1: ENSG00000137392 CLPS colipase
# 2: ENSG00000162438 CTRC chymotrypsin C
# 3: ENSG00000153002 CPB1 carboxypeptidase B1
# 4: ENSG00000243480 AMY2A amylase, alpha 2A (pancreatic)
# 5: ENSG00000142789 CELA3A chymotrypsin like elastase family member 3A
# 6: ENSG00000215704 CELA2B chymotrypsin like elastase family member 2B
# 7: ENSG00000219073 CELA3B chymotrypsin like elastase family member 3B
# 8: ENSG00000142615 CELA2A chymotrypsin like elastase family member 2A
# 9: ENSG00000114204 SERPINI2 serpin family I member 2
# 10: ENSG00000204140 CLPSL1 colipase like 1
# 11: ENSG00000143954 REG3G regenerating family member 3 gamma
# 12: ENSG00000240038 AMY2B amylase, alpha 2B (pancreatic)
# 13: ENSG00000164266 SPINK1 serine peptidase inhibitor, Kazal type 1
# 14: ENSG00000185176 AQP12B aquaporin 12B
# 15: ENSG00000115386 REG1A regenerating family member 1 alpha
# 16: ENSG00000115263 GCG glucagon
# 17: ENSG00000184945 AQP12A aquaporin 12A
# 18: ENSG00000172023 REG1B regenerating family member 1 beta
# 19: ENSG00000080293 SCTR secretin receptor
# 20: ENSG00000138472 GUCA1C guanylate cyclase activator 1C
# ID Symbol Description
Many of the same genes show up, but REG1A, for example, moves down the list and genes like AQP12B move up. We can get a sense for why this happens by plotting these two genes:
par(mfrow=c(1,2))
tissuePlot('ENSG00000115386')
tissuePlot('ENSG00000185176')
REG1A is much more highly expressed, but also found in the small intestine. AQP12B, on the other hand, is almost never expressed in any other tissue.
With this measure, we might define a pancreas-specific gene as one that is, say, 3 standard deviations above the rest. How many pancreas-specific genes are there, according to our working definition?
sum(b>3)
# [1] 43
Both approaches (A and B) seem reasonable; how much overlap is there between them?
length(intersect(names(b[b>3]), names(a[a>0])))
# [1] 38
So, quite a bit of overlap: of the 43 pancreas-specific genes according to definition B, 38 are also pancreas-specific according to definition A. Perhaps the best way of defining tissue-specific genes is to combine definitions A and B? It’s an interesting idea, but in the interest of time, we’ll move on.
Something we haven’t looked at yet is the extent to which gene expression within a tissue depends on subject characteristics. Let’s look at a gene that I happen to know something about, called SCN5A. SCN5A is a heart gene involved in controlling the electrical channels that keep the heart beating regularly; many cardiac diseases are caused by mutations in SCN5A.
tissuePlot('ENSG00000183873')
I collaborate with researchers in the cardiology department here at the University of Iowa, and one of their concerns with using this data from GTEx was whether it makes sense to pool all of these samples or not, given that they come from a variety of subjects and were donated in a variety of ways. In particular, their concern was that expression might be different in a healthy, living heart than in a diseased, dying heart.
This concept can be measured in a number of ways, but one of those ways is the “Hardy scale”, which categorizes how long a subject was dying prior to the collection of the sample:
Does SCN5A expression depend on Hardy classification?
sdt[, SCN5A := E[,'ENSG00000183873']]
boxplot(SCN5A ~ DTHHRDY, sdt[SMTS == "Heart"], col='gray', las=1, xlab='Hardy scale', ylab='SCN5A')
Very much so – there is a clear trend here in which SCN5A expression in the heart seems to decline as a subject dies.
There are certainly lots of other questions we could explore here – other tissues, other subject characteristics, as well as a whole world of gene characteristics and annotations that we’re not going to get into. If there’s time left in lab, I encourage you to explore the data set and ask your own questions. For example,
Hopefully this lab has given you some experience with a useful tool
in data.table
, shown the kinds of operations and issues
that arise in big data sets, and exposed you to gene expression, a
common type of genomic data. If you have questions, please let me
know!