# install.packages("curl")
# install.packages("readr")
library(dplyr)
library(ggplot2)
library(lubridate)
Working with big data in genomics
Introduction
Overview
The goal of this lab is three-fold:
- Get practice manipulating large, complex data sets
- Learn about genomic data
Genomic data tends to be large and complex, and dimension reduction is important, so these skills go hand-in-hand. We will be working on data from GTEx, the Genotype-Tissue Expression project.
The main idea of the GTEx project is to understand how the expression levels of genes varies depending on tissue. The DNA sequence of your genes is the same in all of your tissues, but what those tissues do with that DNA varies greatly. Muscle cells, for examples, make a lot of actin and myosin, the proteins that make up muscle cells. In biological jargon, muscle tissue expresses the actin and myosin genes at a high level. Liver cells, on the other hand, make lots of enzymes and express actin and myosin at much lower levels.
Since previous instructors have covered it already, we’ll use dplyr to working with these objects: merging, subsetting, cross-referencing them, and so on (the other widely used package is data.table). If you’re on a Windows machine, you’ll need to install curl (as well as dplyr, ggplot2, lubridate, and readr, although you might already have them installed):
Getting the data
I prepared the following two files:
- Expression data: Approximately 1.6 gigabytes of expression data from the GTEx project; the entire project contains hundreds of gigabytes of data, but 1.6 GB seemed about right for our purposes today. Note that the size of the
.rds
file, which is compressed, is only 211 Mb, but when you load it into R, it consumes 1.6 GB of RAM. You can certainly download the full set of expression data from GTEx, but it’s too large to fit into RAM on many machines. The file I am providing is a subset of that data, restricted to protein-coding genes. - Gene metadata: An additional file containing some metadata about the genes and what they do.
In addition, let’s download two data sets from the GTEx portal itself:
<- readr::read_tsv("https://storage.googleapis.com/adult-gtex/annotations/v7/metadata-files/GTEx_v7_Annotations_SampleAttributesDS.txt")
samp <- readr::read_tsv("https://storage.googleapis.com/adult-gtex/annotations/v7/metadata-files/GTEx_v7_Annotations_SubjectPhenotypesDS.txt") subj
How the data is organized
The bulk of the data is contained in E
, an enormous matrix:
class(E)
# [1] "matrix" "array"
dim(E)
# [1] 11688 18851
1:10, 1:4]
E[# ENSG00000000003 ENSG00000000005 ENSG00000000419
# GTEX-1117F-0226-SM-5GZZ7 0 0 0
# GTEX-1117F-0426-SM-5EGHI 0 0 0
# GTEX-1117F-0526-SM-5EGHJ 0 0 0
# GTEX-1117F-0626-SM-5N9CS 0 0 0
# GTEX-1117F-0726-SM-5GIEN 0 0 0
# GTEX-1117F-1326-SM-5EGHH 0 0 0
# GTEX-1117F-2226-SM-5N9CH 0 0 0
# GTEX-1117F-2426-SM-5EGGH 0 0 0
# GTEX-1117F-2526-SM-5GZY6 0 0 0
# GTEX-1117F-2826-SM-5GZXL 0 0 0
# ENSG00000000457
# GTEX-1117F-0226-SM-5GZZ7 7.584
# GTEX-1117F-0426-SM-5EGHI 1.850
# GTEX-1117F-0526-SM-5EGHJ 5.418
# GTEX-1117F-0626-SM-5N9CS 7.931
# GTEX-1117F-0726-SM-5GIEN 2.327
# GTEX-1117F-1326-SM-5EGHH 2.890
# GTEX-1117F-2226-SM-5N9CH 6.386
# GTEX-1117F-2426-SM-5EGGH 9.010
# GTEX-1117F-2526-SM-5GZY6 11.380
# GTEX-1117F-2826-SM-5GZXL 7.251
Each row of E
corresponds to a sample; each column corresponds to a gene. For example, the above tells us that in sample GTEX-1117F-0226-SM-5GZZ7, the expression level of gene ENSG00000000457 was 7.584. What is gene ENSG00000000457? More information on the genes is provided in the object gdt
:
class(gdt)
# [1] "data.table" "data.frame"
dim(gdt)
# [1] 18851 3
1:10,]
gdt[# ID Symbol
# 1 ENSG00000000003 TSPAN6
# 2 ENSG00000000005 TNMD
# 3 ENSG00000000419 DPM1
# 4 ENSG00000000457 SCYL3
# 5 ENSG00000000460 C1orf112
# 6 ENSG00000000938 FGR
# 7 ENSG00000000971 CFH
# 8 ENSG00000001036 FUCA2
# 9 ENSG00000001084 GCLC
# 10 ENSG00000001167 NFYA
# Description
# 1 tetraspanin 6
# 2 tenomodulin
# 3 dolichyl-phosphate mannosyltransferase subunit 1, catalytic
# 4 SCY1 like pseudokinase 3
# 5 chromosome 1 open reading frame 112
# 6 FGR proto-oncogene, Src family tyrosine kinase
# 7 complement factor H
# 8 alpha-L-fucosidase 2
# 9 glutamate-cysteine ligase catalytic subunit
# 10 nuclear transcription factor Y subunit alpha
How do I learn more about sample GTEX-1117F-0226-SM-5GZZ7? For example, what tissue did it come from? That information is contained in the object samp
:
class(samp)
# [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
dim(samp)
# [1] 15598 63
1:10, c("SAMPID", "SMTS", "SMTSD", "SMGEBTCHD")]
samp[# # A tibble: 10 × 4
# SAMPID SMTS SMTSD SMGEBTCHD
# <chr> <chr> <chr> <chr>
# 1 GTEX-1117F-0003-SM-58Q7G Blood Whole Blood 01/19/20…
# 2 GTEX-1117F-0003-SM-5DWSB Blood Whole Blood 01/28/20…
# 3 GTEX-1117F-0003-SM-6WBT7 Blood Whole Blood 8/11/2014
# 4 GTEX-1117F-0226-SM-5GZZ7 Adipose Tissue Adipose - Subcutaneous 03/05/20…
# 5 GTEX-1117F-0426-SM-5EGHI Muscle Muscle - Skeletal 02/08/20…
# 6 GTEX-1117F-0526-SM-5EGHJ Blood Vessel Artery - Tibial 02/08/20…
# 7 GTEX-1117F-0626-SM-5N9CS Blood Vessel Artery - Coronary 03/22/20…
# 8 GTEX-1117F-0726-SM-5GIEN Heart Heart - Atrial Appendage 02/27/20…
# 9 GTEX-1117F-1326-SM-5EGHH Adipose Tissue Adipose - Visceral (Omentu… 02/08/20…
# 10 GTEX-1117F-2226-SM-5N9CH Ovary Ovary 03/22/20…
Finally, there is also information about the subjects that these samples came from; that information is contained in the object subj
:
class(subj)
# [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
dim(subj)
# [1] 752 4
subj# # A tibble: 752 × 4
# SUBJID SEX AGE DTHHRDY
# <chr> <dbl> <chr> <dbl>
# 1 GTEX-1117F 2 60-69 4
# 2 GTEX-111CU 1 50-59 0
# 3 GTEX-111FC 1 60-69 1
# 4 GTEX-111VG 1 60-69 3
# 5 GTEX-111YS 1 60-69 0
# 6 GTEX-1122O 2 60-69 0
# 7 GTEX-1128S 2 60-69 2
# 8 GTEX-113IC 1 60-69 NA
# 9 GTEX-113JC 2 50-59 2
# 10 GTEX-117XS 1 60-69 2
# # ℹ 742 more rows
There is much more subject-level data that was collected on these subjects, such as manner of death, race, exact age, etc. However, these are the only variables that can be publicly shared.
It is difficult/impossible to determine what many of the variables in samp/subj mean just from the variable name. I’ll explain all the ones that I refer to in the lab, but if you’re curious about what other ones mean, consult the data dictionaries:
Primary operations
To start, we’re just going to briefly review some of the main dplyr operations. One of the most important is mutate()
, which creates new variables (or modifies existing ones). In particular, let’s create new columns called Date
in the samp
data table, formatted as a date variable (the original is just coded as a character string) and Age
in the subj
data table (the original is a string that represents a range, which we are replacing my its midpoint):
# Create Age and Date variables
<- mutate(samp, Date = as.Date(SMGEBTCHD, format = "%m/%d/%Y"))
samp <- mutate(subj, Age = as.numeric(substr(AGE, 1, 2)) + 5) subj
Now, let’s create a smaller data set from samp
just for illustration purposes:
<- samp |> select(Date, SMTS, SMTSD, SMMAPRT) dat
Subsetting
One extremely common task is to extract subsets (filter) the data. For example, to find the subset of samples with mapping rates below 60%, we could write:
# Samples with map rate under 60%
filter(dat, SMMAPRT < 0.6)
# # A tibble: 242 × 4
# Date SMTS SMTSD SMMAPRT
# <date> <chr> <chr> <dbl>
# 1 2012-04-02 Brain Brain - Anterior cingulate cortex (BA24) 0.150
# 2 2012-04-02 Brain Brain - Hypothalamus 0.598
# 3 2012-06-09 Brain Brain - Cortex 0.114
# 4 2012-04-02 Brain Brain - Substantia nigra 0.486
# 5 2012-04-02 Brain Brain - Caudate (basal ganglia) 0.426
# 6 2012-05-27 Brain Brain - Spinal cord (cervical c-1) 0.264
# 7 2012-04-02 Brain Brain - Hippocampus 0.329
# 8 2012-06-09 Skin Skin - Sun Exposed (Lower leg) 0.365
# 9 2012-04-02 Thyroid Thyroid 0.573
# 10 2012-04-02 Blood Whole Blood 0.460
# # ℹ 232 more rows
To get summaries, we use summarize()
:
# Average map rate
summarize(dat, avg_mr = mean(SMMAPRT, na.rm = TRUE))
# # A tibble: 1 × 1
# avg_mr
# <dbl>
# 1 0.967
The power of tools like dplyr (and data.table and SQL, which have similar functions) becomes more clear when you combine subsetting and summarization, tying them together with pipes:
# Combine subset with calculation
|> filter(Date < "2012-02-17") |>
dat summarize(avg_mr = mean(SMMAPRT, na.rm = TRUE))
# # A tibble: 1 × 1
# avg_mr
# <dbl>
# 1 0.845
|> filter(Date >= "2012-02-17") |>
dat summarize(avg_mr = mean(SMMAPRT, na.rm = TRUE))
# # A tibble: 1 × 1
# avg_mr
# <dbl>
# 1 0.968
Exercise: What is the average age of donors who died suddenly (DTHHRDY=1)? After long illnesses (DTHHRDY=4)?
Aggregations
One final extremely useful operation is Finally, the by
option in data.table
lets us loop these operations over groups of samples. For example, we can look at average mapping rate by year:
# Map rate by year
|>
dat group_by(year(Date)) |>
summarize(
n = n(),
avg_mr = mean(SMMAPRT, na.rm = TRUE)
)# # A tibble: 6 × 3
# `year(Date)` n avg_mr
# <dbl> <int> <dbl>
# 1 2011 230 0.835
# 2 2012 2743 0.834
# 3 2013 3126 0.988
# 4 2014 8746 0.989
# 5 2015 695 0.988
# 6 NA 58 0.991
Or average mapping rate by tissue type:
|>
dat group_by(SMTS) |>
summarize(
n = n(),
avg_mr = mean(SMMAPRT, na.rm = TRUE)
)# # A tibble: 31 × 3
# SMTS n avg_mr
# <chr> <int> <dbl>
# 1 Adipose Tissue 886 0.964
# 2 Adrenal Gland 205 0.989
# 3 Bladder 11 0.989
# 4 Blood 2561 0.926
# 5 Blood Vessel 1041 0.967
# 6 Bone Marrow 160 0.953
# 7 Brain 2076 0.959
# 8 Breast 306 0.982
# 9 Cervix Uteri 11 0.987
# 10 Colon 539 0.989
# # ℹ 21 more rows
In the above, note that n()
is a special function that counts the number of rows in each subset of the data we are doing our calculations by. So, for example, we now know that there were 160 bone marrow samples, and that they had an average map rate of 95.3%.
We can put all of these things together to, for example, calculate the average map rate for each subtype of skin sample:
# Combine subset, calculation, and grouping
|>
dat filter(SMTS == "Skin") |>
group_by(SMTSD) |>
summarize(
n = n(),
avg_mr = mean(SMMAPRT, na.rm = TRUE)
)# # A tibble: 3 × 3
# SMTSD n avg_mr
# <chr> <int> <dbl>
# 1 Cells - Transformed fibroblasts 362 0.991
# 2 Skin - Not Sun Exposed (Suprapubic) 408 0.981
# 3 Skin - Sun Exposed (Lower leg) 592 0.947
We can continue to add steps to our pipeline. For example, maybe we want to see all the sample types, and have them ordered:
|>
dat group_by(SMTS) |>
summarize(
n = n(),
avg_mr = mean(SMMAPRT, na.rm = TRUE)
|>
) arrange(avg_mr) |>
print(n = Inf)
# # A tibble: 31 × 3
# SMTS n avg_mr
# <chr> <int> <dbl>
# 1 Blood 2561 0.926
# 2 Nerve 498 0.943
# 3 Lung 607 0.945
# 4 Thyroid 564 0.945
# 5 Muscle 718 0.952
# 6 Bone Marrow 160 0.953
# 7 Brain 2076 0.959
# 8 Heart 727 0.962
# 9 Adipose Tissue 886 0.964
# 10 Blood Vessel 1041 0.967
# 11 Skin 1362 0.970
# 12 Breast 306 0.982
# 13 Pituitary 192 0.984
# 14 Prostate 159 0.984
# 15 Stomach 272 0.985
# 16 Pancreas 268 0.985
# 17 Spleen 169 0.987
# 18 Cervix Uteri 11 0.987
# 19 Uterus 117 0.987
# 20 Salivary Gland 104 0.987
# 21 Testis 284 0.988
# 22 Kidney 50 0.988
# 23 Vagina 124 0.988
# 24 Esophagus 1111 0.988
# 25 Ovary 138 0.988
# 26 Liver 188 0.988
# 27 Small Intestine 143 0.988
# 28 Colon 539 0.989
# 29 Fallopian Tube 7 0.989
# 30 Bladder 11 0.989
# 31 Adrenal Gland 205 0.989
Exercise: What are the most and least common subtypes of brain samples? Produce an ordered list.
Merging
Suppose we wanted to know something like the average age of the donors for each tissue type. We cannot obtain this information directly (yet), because the age information is in the subj
data table, while the tissue type information is in the samp
data table. To answer this kind of question, we will have to merge the two data sets.
In order to merge two data sets, we need to have a key: a column that appears in both data tables and that tells us which rows to combine. In our case, we don’t exactly have such a column, but we can clearly create a subject ID from the sample ID by cutting off everything after the second dash:
# Creating SUBJID column
<- mutate(samp, SUBJID = gsub('([^-]*)-([^-]*)-.*', '\\1-\\2', SAMPID)) samp
This involves regular expressions; if you don’t know what they are, don’t worry about it. Once we’ve created the column, we can merge the two data sets using inner_join()
:
# Merging samp and subj
<- inner_join(samp, subj, by = "SUBJID") sdt
We have just combined the sample information and subject information into a single data set which I’m calling sdt
for “samples data table”. To get a better understanding for what just happened, let’s look here:
# Let's see what we did
dim(samp)
# [1] 15598 65
dim(subj)
# [1] 752 5
dim(sdt)
# [1] 15598 69
|> filter(SUBJID == "GTEX-WWTW") |> select(SUBJID, SAMPID, SMTS, SMTSD)
samp # # A tibble: 4 × 4
# SUBJID SAMPID SMTS SMTSD
# <chr> <chr> <chr> <chr>
# 1 GTEX-WWTW GTEX-WWTW-0002-SM-4MVNH Blood Cells - EBV-transformed lymp…
# 2 GTEX-WWTW GTEX-WWTW-0008-SM-4MVPE Skin Cells - Transformed fibrobla…
# 3 GTEX-WWTW GTEX-WWTW-0626-SM-4MVNS Muscle Muscle - Skeletal
# 4 GTEX-WWTW GTEX-WWTW-1326-SM-4MVNT Adipose Tissue Adipose - Visceral (Omentum)
|> filter(SUBJID == "GTEX-WWTW") |> select(SUBJID, SEX, AGE, DTHHRDY)
subj # # A tibble: 1 × 4
# SUBJID SEX AGE DTHHRDY
# <chr> <dbl> <chr> <dbl>
# 1 GTEX-WWTW 1 50-59 0
|> filter(SUBJID == "GTEX-WWTW") |> select(SUBJID, SAMPID, SMTS, SEX, AGE, DTHHRDY)
sdt # # A tibble: 4 × 6
# SUBJID SAMPID SMTS SEX AGE DTHHRDY
# <chr> <chr> <chr> <dbl> <chr> <dbl>
# 1 GTEX-WWTW GTEX-WWTW-0002-SM-4MVNH Blood 1 50-59 0
# 2 GTEX-WWTW GTEX-WWTW-0008-SM-4MVPE Skin 1 50-59 0
# 3 GTEX-WWTW GTEX-WWTW-0626-SM-4MVNS Muscle 1 50-59 0
# 4 GTEX-WWTW GTEX-WWTW-1326-SM-4MVNT Adipose Tissue 1 50-59 0
This is sometimes called a “one-to-many” merge: a single entry in subj
is being matched and merged with many entries in samp
. But each of the GTEX-WWTW
samples is coming from the same person, so they are all coming from a man in his fifties who died after being on a ventilator. There are more complicated issues that come up in merging, like what to do when records can’t be matched, but none of those issues are present in this example, thankfully.
One issue that does come up, however, is the fact that there are subjects in sdt
that we do not have any expression data on. So let’s amend our earlier assignment to also subset by samples that are in the expression matrix:
# Merging: revised
<- inner_join(samp, subj, by = "SUBJID") |>
sdt filter(SAMPID %in% rownames(E))
Note that sdt
matches up with E
now:
dim(sdt)
# [1] 11688 69
dim(E)
# [1] 11688 18851
all.equal(sdt$SAMPID, rownames(E))
# [1] TRUE
Exercise: So, what is the average age of the donors for each tissue type?
Plotting
Instead of obtaining numerical summaries like we did earlier, plotting is also extremely useful. Thankfully, because filter()
, mutate()
, and so on return data frames, they work very well with ggplot2. I’m not going to cover this exhaustively, but let’s just create a few plots to see how well they work together.
For example, earlier we calculated the average age of donors at different levels of the Hardy scale. Let’s create a plot instead. There are two small issues: * Hardy scale is recorded as a number even though it’s really more of a category (factor) * Age is always 45, 55, etc., so let’s add some random noise to it (this is called “jittering”)
So, let’s mutate these variables before we plot them (note that we are not modifying the original data set, just creating a temporary version for plotting purposes):
|>
sdt mutate(Hardy = factor(DTHHRDY)) |>
mutate(Age = Age + runif(length(Age), -4.5, 4.5)) |>
ggplot(aes(Hardy, Age)) +
geom_boxplot()
What if, instead of Age, we wanted to plot expression? Here it gets a little tricky because expression is in a separate matrix, so we need to merge the expression data for the gene of interest in first. Let’s write a little function to do that:
<- function(gene) {
add_gene data.frame(SAMPID = rownames(E), expression = E[, gene]) |>
inner_join(sdt, by = "SAMPID")
}
Now let’s look at how expression of the gene SCN5A changes by Hardy level. Because the gene expression is highly skewed, I’m going to plot it on the log scale. Because you can’t take the log of zero, I’m going to add 1 first (another call to mutate()
:
<- filter(gdt, Symbol == "SCN5A")$ID
gene_id add_gene(gene_id) |>
mutate(Hardy = factor(DTHHRDY)) |>
mutate(expression = expression + 1) |>
ggplot(aes(Hardy, expression)) +
geom_boxplot() +
ylab("SCN5A") +
scale_y_log10()
It looks like this gene has very low expression and doesn’t vary by Hardy scale. However, I happen to know that SCN5A is a calcium channel gene mainly expressed in the heart, so let’s filter to look at just heart samples:
add_gene(gene_id) |>
filter(SMTS == "Heart") |>
mutate(Hardy = factor(DTHHRDY)) |>
mutate(expression = expression + 1) |>
ggplot(aes(Hardy, expression)) +
geom_boxplot() +
ylab("SCN5A") +
scale_y_log10()
Now a pattern has emerged: SCN5A is noticeably more highly expressed in the hearts of individuals who died suddenly compared to those who suffered long, debilitating illnesses.
Expression by tissue
OK, now let’s get back to the central biological question behind the GTEx project: how does the expression of genes vary with tissue type?
Initial attempt
Let’s say we decide to start by asking the question: what genes are the most highly expressed in the pancreas? We might think about approach the question like this:
- Subset
E
to include just pancreas samples - Take the mean of each gene in the pancreas subset
- Sort and look at the top genes
<- filter(sdt, SMTS == "Pancreas") |> pull(SAMPID)
panc_id <- E[panc_id,] |>
panc_means colMeans() |>
::enframe(name = "ID", value = "mean") tibble
The last step of calling enframe()
converts the vector into a tabular data frame (tbl_df
); this will allows us to integrate the results into our dplyr pipelines.
Now, let’s see what those top genes are:
# Top genes: Initial approach
inner_join(panc_means, gdt) |>
arrange(-mean) |>
print(n = 20)
# # A tibble: 18,851 × 4
# ID mean Symbol Description
# <chr> <dbl> <chr> <chr>
# 1 ENSG00000115386 65328. REG1A regenerating family member 1 alpha
# 2 ENSG00000137392 54362. CLPS colipase
# 3 ENSG00000142789 28048. CELA3A chymotrypsin like elastase family member 3A
# 4 ENSG00000172023 23745. REG1B regenerating family member 1 beta
# 5 ENSG00000153002 17187. CPB1 carboxypeptidase B1
# 6 ENSG00000219073 10602. CELA3B chymotrypsin like elastase family member 3B
# 7 ENSG00000142615 6052. CELA2A chymotrypsin like elastase family member 2A
# 8 ENSG00000172016 5972. REG3A regenerating family member 3 alpha
# 9 ENSG00000243480 4902. AMY2A amylase, alpha 2A (pancreatic)
# 10 ENSG00000164266 4768. SPINK1 serine peptidase inhibitor, Kazal type 1
# 11 ENSG00000162438 3742. CTRC chymotrypsin C
# 12 ENSG00000231500 2363. RPS18 ribosomal protein S18
# 13 ENSG00000143954 2129. REG3G regenerating family member 3 gamma
# 14 ENSG00000215704 1380. CELA2B chymotrypsin like elastase family member 2B
# 15 ENSG00000122406 1140. RPL5 ribosomal protein L5
# 16 ENSG00000177954 979. RPS27 ribosomal protein S27
# 17 ENSG00000163682 850. RPL9 ribosomal protein L9
# 18 ENSG00000168028 823. RPSA ribosomal protein SA
# 19 ENSG00000143947 821. RPS27A ribosomal protein S27a
# 20 ENSG00000188846 806. RPL14 ribosomal protein L14
# # ℹ 18,831 more rows
At the top of the list are the REG genes and the chymotrypsin genes; this makes sense as these genes are directly related to the function of the pancreas. However, there are also a bunch of “ribosomal protein” results. These are perhaps not so interesting – they indicate that pancreas cells are making ribosomes, but all cells do that. So perhaps these genes are high in all tissues, not just pancreas. In order to see whether this is the case, I’m going to create a function called tissue_plot
:
<- function(gene) {
tissue_plot <- gdt |> filter(ID == gene) |> pull(Symbol)
symbol add_gene(gene) |>
ggplot(aes(expression, SMTS)) +
geom_boxplot() +
xlab(symbol)
}tissue_plot("ENSG00000231500")
So the pancreas makes a lot of RPS18, but plenty of other tissues do too. Let’s see how, say, colipase compares:
# Tissue_plot: Colipase
tissue_plot("ENSG00000137392")
This seems much more relevant as a pancreas-specific gene – and in fact, the pancreas is the only organ that produces colipase. At this point, it’s worth taking a moment to think about how we should define a better metric that prioritizes genes like CLPS over genes like RPS18.
Let’s refine that approach
There are many ways to answer this question – no single right or wrong answer. However, it seems that something which will help us in our efforts is to create a matrix of the expression for each gene, for each tissue. In other words, we created a pancreas means vector above, but let’s repeat that for every tissue.
First, let’s take our earlier code and turn it into a function that can accept any tissue (as opposed to hard-coding pancreas):
<- function(tissue) {
tissue_mean <- filter(sdt, SMTS == tissue) |> pull(SAMPID)
tissue_id |> colMeans()
E[tissue_id,] }
Now, we need to call this function for every tissue and stack the results. The easiest way to do this is with sapply()
:
<- unique(sdt$SMTS) |> sapply(tissue_mean) tm
At this point, as I said, there are lots of good ways to answer this question. For example, one approach might be what I’ll call the “pancreas fraction” (the fraction of total expression that occurs in pancreas):
<- tm[, 'Pancreas'] / rowMeans(tm)
pf ::enframe(pf, name = "ID", value = "fraction") |>
tibbleinner_join(gdt, by = "ID") |>
arrange(-fraction) |>
print(n = 20)
# # A tibble: 18,851 × 4
# ID fraction Symbol Description
# <chr> <dbl> <chr> <chr>
# 1 ENSG00000137392 29.8 CLPS colipase
# 2 ENSG00000162438 29.8 CTRC chymotrypsin C
# 3 ENSG00000243480 29.8 AMY2A amylase, alpha 2A (pancreatic)
# 4 ENSG00000219073 29.7 CELA3B chymotrypsin like elastase family member 3B
# 5 ENSG00000142789 29.7 CELA3A chymotrypsin like elastase family member 3A
# 6 ENSG00000142615 29.7 CELA2A chymotrypsin like elastase family member 2A
# 7 ENSG00000153002 29.6 CPB1 carboxypeptidase B1
# 8 ENSG00000215704 29.6 CELA2B chymotrypsin like elastase family member 2B
# 9 ENSG00000204140 29.2 CLPSL1 colipase like 1
# 10 ENSG00000185176 29.1 AQP12B aquaporin 12B
# 11 ENSG00000184945 29.0 AQP12A aquaporin 12A
# 12 ENSG00000114204 28.3 SERPINI2 serpin family I member 2
# 13 ENSG00000187733 28.2 AMY1C amylase, alpha 1C (salivary)
# 14 ENSG00000237763 27.6 AMY1A amylase, alpha 1A (salivary)
# 15 ENSG00000143954 26.9 REG3G regenerating family member 3 gamma
# 16 ENSG00000115386 26.4 REG1A regenerating family member 1 alpha
# 17 ENSG00000174876 26.3 AMY1B amylase, alpha 1B (salivary)
# 18 ENSG00000172023 26.3 REG1B regenerating family member 1 beta
# 19 ENSG00000164266 25.1 SPINK1 serine peptidase inhibitor, Kazal type 1
# 20 ENSG00000115263 24.9 GCG glucagon
# # ℹ 18,831 more rows
Many of the same genes show up as before, but the ribosomal proteins (highly expressed in all tissue) have been weeded out; all of these genes seem completely reasonable.
Do you have a different idea in mind? Awesome! Try it out and let me know how it goes!