Working with big data in genomics

Author

Patrick Breheny

Download the R code

Introduction

Overview

The goal of this lab is three-fold:

  • Get practice manipulating large, complex data sets
  • Learn about genomic data

Genomic data tends to be large and complex, and dimension reduction is important, so these skills go hand-in-hand. We will be working on data from GTEx, the Genotype-Tissue Expression project.

The main idea of the GTEx project is to understand how the expression levels of genes varies depending on tissue. The DNA sequence of your genes is the same in all of your tissues, but what those tissues do with that DNA varies greatly. Muscle cells, for examples, make a lot of actin and myosin, the proteins that make up muscle cells. In biological jargon, muscle tissue expresses the actin and myosin genes at a high level. Liver cells, on the other hand, make lots of enzymes and express actin and myosin at much lower levels.

Since previous instructors have covered it already, we’ll use dplyr to working with these objects: merging, subsetting, cross-referencing them, and so on (the other widely used package is data.table). If you’re on a Windows machine, you’ll need to install curl (as well as dplyr, ggplot2, lubridate, and readr, although you might already have them installed):

# install.packages("curl")
# install.packages("readr")
library(dplyr)
library(ggplot2)
library(lubridate)

Getting the data

I prepared the following two files:

  • Expression data: Approximately 1.6 gigabytes of expression data from the GTEx project; the entire project contains hundreds of gigabytes of data, but 1.6 GB seemed about right for our purposes today. Note that the size of the .rds file, which is compressed, is only 211 Mb, but when you load it into R, it consumes 1.6 GB of RAM. You can certainly download the full set of expression data from GTEx, but it’s too large to fit into RAM on many machines. The file I am providing is a subset of that data, restricted to protein-coding genes.
  • Gene metadata: An additional file containing some metadata about the genes and what they do.

In addition, let’s download two data sets from the GTEx portal itself:

samp <- readr::read_tsv("https://storage.googleapis.com/adult-gtex/annotations/v7/metadata-files/GTEx_v7_Annotations_SampleAttributesDS.txt")
subj <- readr::read_tsv("https://storage.googleapis.com/adult-gtex/annotations/v7/metadata-files/GTEx_v7_Annotations_SubjectPhenotypesDS.txt")

How the data is organized

The bulk of the data is contained in E, an enormous matrix:

class(E)
# [1] "matrix" "array"
dim(E)
# [1] 11688 18851
E[1:10, 1:4]
#                          ENSG00000000003 ENSG00000000005 ENSG00000000419
# GTEX-1117F-0226-SM-5GZZ7               0               0               0
# GTEX-1117F-0426-SM-5EGHI               0               0               0
# GTEX-1117F-0526-SM-5EGHJ               0               0               0
# GTEX-1117F-0626-SM-5N9CS               0               0               0
# GTEX-1117F-0726-SM-5GIEN               0               0               0
# GTEX-1117F-1326-SM-5EGHH               0               0               0
# GTEX-1117F-2226-SM-5N9CH               0               0               0
# GTEX-1117F-2426-SM-5EGGH               0               0               0
# GTEX-1117F-2526-SM-5GZY6               0               0               0
# GTEX-1117F-2826-SM-5GZXL               0               0               0
#                          ENSG00000000457
# GTEX-1117F-0226-SM-5GZZ7           7.584
# GTEX-1117F-0426-SM-5EGHI           1.850
# GTEX-1117F-0526-SM-5EGHJ           5.418
# GTEX-1117F-0626-SM-5N9CS           7.931
# GTEX-1117F-0726-SM-5GIEN           2.327
# GTEX-1117F-1326-SM-5EGHH           2.890
# GTEX-1117F-2226-SM-5N9CH           6.386
# GTEX-1117F-2426-SM-5EGGH           9.010
# GTEX-1117F-2526-SM-5GZY6          11.380
# GTEX-1117F-2826-SM-5GZXL           7.251

Each row of E corresponds to a sample; each column corresponds to a gene. For example, the above tells us that in sample GTEX-1117F-0226-SM-5GZZ7, the expression level of gene ENSG00000000457 was 7.584. What is gene ENSG00000000457? More information on the genes is provided in the object gdt:

class(gdt)
# [1] "data.table" "data.frame"
dim(gdt)
# [1] 18851     3
gdt[1:10,]
#                 ID   Symbol
# 1  ENSG00000000003   TSPAN6
# 2  ENSG00000000005     TNMD
# 3  ENSG00000000419     DPM1
# 4  ENSG00000000457    SCYL3
# 5  ENSG00000000460 C1orf112
# 6  ENSG00000000938      FGR
# 7  ENSG00000000971      CFH
# 8  ENSG00000001036    FUCA2
# 9  ENSG00000001084     GCLC
# 10 ENSG00000001167     NFYA
#                                                    Description
# 1                                                tetraspanin 6
# 2                                                  tenomodulin
# 3  dolichyl-phosphate mannosyltransferase subunit 1, catalytic
# 4                                     SCY1 like pseudokinase 3
# 5                          chromosome 1 open reading frame 112
# 6               FGR proto-oncogene, Src family tyrosine kinase
# 7                                          complement factor H
# 8                                         alpha-L-fucosidase 2
# 9                  glutamate-cysteine ligase catalytic subunit
# 10                nuclear transcription factor Y subunit alpha

How do I learn more about sample GTEX-1117F-0226-SM-5GZZ7? For example, what tissue did it come from? That information is contained in the object samp:

class(samp)
# [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
dim(samp)
# [1] 15598    63
samp[1:10, c("SAMPID", "SMTS", "SMTSD", "SMGEBTCHD")]
# # A tibble: 10 × 4
#    SAMPID                   SMTS           SMTSD                       SMGEBTCHD
#    <chr>                    <chr>          <chr>                       <chr>    
#  1 GTEX-1117F-0003-SM-58Q7G Blood          Whole Blood                 01/19/20…
#  2 GTEX-1117F-0003-SM-5DWSB Blood          Whole Blood                 01/28/20…
#  3 GTEX-1117F-0003-SM-6WBT7 Blood          Whole Blood                 8/11/2014
#  4 GTEX-1117F-0226-SM-5GZZ7 Adipose Tissue Adipose - Subcutaneous      03/05/20…
#  5 GTEX-1117F-0426-SM-5EGHI Muscle         Muscle - Skeletal           02/08/20…
#  6 GTEX-1117F-0526-SM-5EGHJ Blood Vessel   Artery - Tibial             02/08/20…
#  7 GTEX-1117F-0626-SM-5N9CS Blood Vessel   Artery - Coronary           03/22/20…
#  8 GTEX-1117F-0726-SM-5GIEN Heart          Heart - Atrial Appendage    02/27/20…
#  9 GTEX-1117F-1326-SM-5EGHH Adipose Tissue Adipose - Visceral (Omentu… 02/08/20…
# 10 GTEX-1117F-2226-SM-5N9CH Ovary          Ovary                       03/22/20…

Finally, there is also information about the subjects that these samples came from; that information is contained in the object subj:

class(subj)
# [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
dim(subj)
# [1] 752   4
subj
# # A tibble: 752 × 4
#    SUBJID       SEX AGE   DTHHRDY
#    <chr>      <dbl> <chr>   <dbl>
#  1 GTEX-1117F     2 60-69       4
#  2 GTEX-111CU     1 50-59       0
#  3 GTEX-111FC     1 60-69       1
#  4 GTEX-111VG     1 60-69       3
#  5 GTEX-111YS     1 60-69       0
#  6 GTEX-1122O     2 60-69       0
#  7 GTEX-1128S     2 60-69       2
#  8 GTEX-113IC     1 60-69      NA
#  9 GTEX-113JC     2 50-59       2
# 10 GTEX-117XS     1 60-69       2
# # ℹ 742 more rows

There is much more subject-level data that was collected on these subjects, such as manner of death, race, exact age, etc. However, these are the only variables that can be publicly shared.

It is difficult/impossible to determine what many of the variables in samp/subj mean just from the variable name. I’ll explain all the ones that I refer to in the lab, but if you’re curious about what other ones mean, consult the data dictionaries:

Primary operations

To start, we’re just going to briefly review some of the main dplyr operations. One of the most important is mutate(), which creates new variables (or modifies existing ones). In particular, let’s create new columns called Date in the samp data table, formatted as a date variable (the original is just coded as a character string) and Age in the subj data table (the original is a string that represents a range, which we are replacing my its midpoint):

# Create Age and Date variables
samp <- mutate(samp, Date = as.Date(SMGEBTCHD, format = "%m/%d/%Y"))
subj <- mutate(subj, Age = as.numeric(substr(AGE, 1, 2)) + 5)

Now, let’s create a smaller data set from samp just for illustration purposes:

dat <- samp |> select(Date, SMTS, SMTSD, SMMAPRT)

Subsetting

One extremely common task is to extract subsets (filter) the data. For example, to find the subset of samples with mapping rates below 60%, we could write:

# Samples with map rate under 60%
filter(dat, SMMAPRT < 0.6)
# # A tibble: 242 × 4
#    Date       SMTS    SMTSD                                    SMMAPRT
#    <date>     <chr>   <chr>                                      <dbl>
#  1 2012-04-02 Brain   Brain - Anterior cingulate cortex (BA24)   0.150
#  2 2012-04-02 Brain   Brain - Hypothalamus                       0.598
#  3 2012-06-09 Brain   Brain - Cortex                             0.114
#  4 2012-04-02 Brain   Brain - Substantia nigra                   0.486
#  5 2012-04-02 Brain   Brain - Caudate (basal ganglia)            0.426
#  6 2012-05-27 Brain   Brain - Spinal cord (cervical c-1)         0.264
#  7 2012-04-02 Brain   Brain - Hippocampus                        0.329
#  8 2012-06-09 Skin    Skin - Sun Exposed (Lower leg)             0.365
#  9 2012-04-02 Thyroid Thyroid                                    0.573
# 10 2012-04-02 Blood   Whole Blood                                0.460
# # ℹ 232 more rows

To get summaries, we use summarize():

# Average map rate
summarize(dat, avg_mr = mean(SMMAPRT, na.rm = TRUE))
# # A tibble: 1 × 1
#   avg_mr
#    <dbl>
# 1  0.967

The power of tools like dplyr (and data.table and SQL, which have similar functions) becomes more clear when you combine subsetting and summarization, tying them together with pipes:

# Combine subset with calculation
dat |> filter(Date < "2012-02-17") |>
  summarize(avg_mr = mean(SMMAPRT, na.rm = TRUE))
# # A tibble: 1 × 1
#   avg_mr
#    <dbl>
# 1  0.845
dat |> filter(Date >= "2012-02-17") |>
  summarize(avg_mr = mean(SMMAPRT, na.rm = TRUE))
# # A tibble: 1 × 1
#   avg_mr
#    <dbl>
# 1  0.968

Exercise: What is the average age of donors who died suddenly (DTHHRDY=1)? After long illnesses (DTHHRDY=4)?


Aggregations

One final extremely useful operation is Finally, the by option in data.table lets us loop these operations over groups of samples. For example, we can look at average mapping rate by year:

# Map rate by year
dat |>
  group_by(year(Date)) |>
  summarize(
    n = n(),
    avg_mr = mean(SMMAPRT, na.rm = TRUE)
  )
# # A tibble: 6 × 3
#   `year(Date)`     n avg_mr
#          <dbl> <int>  <dbl>
# 1         2011   230  0.835
# 2         2012  2743  0.834
# 3         2013  3126  0.988
# 4         2014  8746  0.989
# 5         2015   695  0.988
# 6           NA    58  0.991

Or average mapping rate by tissue type:

dat |>
  group_by(SMTS) |>
  summarize(
    n = n(),
    avg_mr = mean(SMMAPRT, na.rm = TRUE)
  )
# # A tibble: 31 × 3
#    SMTS               n avg_mr
#    <chr>          <int>  <dbl>
#  1 Adipose Tissue   886  0.964
#  2 Adrenal Gland    205  0.989
#  3 Bladder           11  0.989
#  4 Blood           2561  0.926
#  5 Blood Vessel    1041  0.967
#  6 Bone Marrow      160  0.953
#  7 Brain           2076  0.959
#  8 Breast           306  0.982
#  9 Cervix Uteri      11  0.987
# 10 Colon            539  0.989
# # ℹ 21 more rows

In the above, note that n() is a special function that counts the number of rows in each subset of the data we are doing our calculations by. So, for example, we now know that there were 160 bone marrow samples, and that they had an average map rate of 95.3%.

We can put all of these things together to, for example, calculate the average map rate for each subtype of skin sample:

# Combine subset, calculation, and grouping
dat |>
  filter(SMTS == "Skin") |>
  group_by(SMTSD) |>
  summarize(
    n = n(),
    avg_mr = mean(SMMAPRT, na.rm = TRUE)
  )
# # A tibble: 3 × 3
#   SMTSD                                   n avg_mr
#   <chr>                               <int>  <dbl>
# 1 Cells - Transformed fibroblasts       362  0.991
# 2 Skin - Not Sun Exposed (Suprapubic)   408  0.981
# 3 Skin - Sun Exposed (Lower leg)        592  0.947

We can continue to add steps to our pipeline. For example, maybe we want to see all the sample types, and have them ordered:

dat |>
  group_by(SMTS) |>
  summarize(
    n = n(),
    avg_mr = mean(SMMAPRT, na.rm = TRUE)
  ) |>
  arrange(avg_mr) |>
  print(n = Inf)
# # A tibble: 31 × 3
#    SMTS                n avg_mr
#    <chr>           <int>  <dbl>
#  1 Blood            2561  0.926
#  2 Nerve             498  0.943
#  3 Lung              607  0.945
#  4 Thyroid           564  0.945
#  5 Muscle            718  0.952
#  6 Bone Marrow       160  0.953
#  7 Brain            2076  0.959
#  8 Heart             727  0.962
#  9 Adipose Tissue    886  0.964
# 10 Blood Vessel     1041  0.967
# 11 Skin             1362  0.970
# 12 Breast            306  0.982
# 13 Pituitary         192  0.984
# 14 Prostate          159  0.984
# 15 Stomach           272  0.985
# 16 Pancreas          268  0.985
# 17 Spleen            169  0.987
# 18 Cervix Uteri       11  0.987
# 19 Uterus            117  0.987
# 20 Salivary Gland    104  0.987
# 21 Testis            284  0.988
# 22 Kidney             50  0.988
# 23 Vagina            124  0.988
# 24 Esophagus        1111  0.988
# 25 Ovary             138  0.988
# 26 Liver             188  0.988
# 27 Small Intestine   143  0.988
# 28 Colon             539  0.989
# 29 Fallopian Tube      7  0.989
# 30 Bladder            11  0.989
# 31 Adrenal Gland     205  0.989

Exercise: What are the most and least common subtypes of brain samples? Produce an ordered list.


Merging

Suppose we wanted to know something like the average age of the donors for each tissue type. We cannot obtain this information directly (yet), because the age information is in the subj data table, while the tissue type information is in the samp data table. To answer this kind of question, we will have to merge the two data sets.

In order to merge two data sets, we need to have a key: a column that appears in both data tables and that tells us which rows to combine. In our case, we don’t exactly have such a column, but we can clearly create a subject ID from the sample ID by cutting off everything after the second dash:

# Creating SUBJID column
samp <- mutate(samp, SUBJID = gsub('([^-]*)-([^-]*)-.*', '\\1-\\2', SAMPID))

This involves regular expressions; if you don’t know what they are, don’t worry about it. Once we’ve created the column, we can merge the two data sets using inner_join():

# Merging samp and subj
sdt <- inner_join(samp, subj, by = "SUBJID")

We have just combined the sample information and subject information into a single data set which I’m calling sdt for “samples data table”. To get a better understanding for what just happened, let’s look here:

# Let's see what we did
dim(samp)
# [1] 15598    65
dim(subj)
# [1] 752   5
dim(sdt)
# [1] 15598    69
samp |> filter(SUBJID == "GTEX-WWTW") |> select(SUBJID, SAMPID, SMTS, SMTSD)
# # A tibble: 4 × 4
#   SUBJID    SAMPID                  SMTS           SMTSD                        
#   <chr>     <chr>                   <chr>          <chr>                        
# 1 GTEX-WWTW GTEX-WWTW-0002-SM-4MVNH Blood          Cells - EBV-transformed lymp…
# 2 GTEX-WWTW GTEX-WWTW-0008-SM-4MVPE Skin           Cells - Transformed fibrobla…
# 3 GTEX-WWTW GTEX-WWTW-0626-SM-4MVNS Muscle         Muscle - Skeletal            
# 4 GTEX-WWTW GTEX-WWTW-1326-SM-4MVNT Adipose Tissue Adipose - Visceral (Omentum)
subj |> filter(SUBJID == "GTEX-WWTW") |> select(SUBJID, SEX, AGE, DTHHRDY)
# # A tibble: 1 × 4
#   SUBJID      SEX AGE   DTHHRDY
#   <chr>     <dbl> <chr>   <dbl>
# 1 GTEX-WWTW     1 50-59       0
sdt |> filter(SUBJID == "GTEX-WWTW") |> select(SUBJID, SAMPID, SMTS, SEX, AGE, DTHHRDY)
# # A tibble: 4 × 6
#   SUBJID    SAMPID                  SMTS             SEX AGE   DTHHRDY
#   <chr>     <chr>                   <chr>          <dbl> <chr>   <dbl>
# 1 GTEX-WWTW GTEX-WWTW-0002-SM-4MVNH Blood              1 50-59       0
# 2 GTEX-WWTW GTEX-WWTW-0008-SM-4MVPE Skin               1 50-59       0
# 3 GTEX-WWTW GTEX-WWTW-0626-SM-4MVNS Muscle             1 50-59       0
# 4 GTEX-WWTW GTEX-WWTW-1326-SM-4MVNT Adipose Tissue     1 50-59       0

This is sometimes called a “one-to-many” merge: a single entry in subj is being matched and merged with many entries in samp. But each of the GTEX-WWTW samples is coming from the same person, so they are all coming from a man in his fifties who died after being on a ventilator. There are more complicated issues that come up in merging, like what to do when records can’t be matched, but none of those issues are present in this example, thankfully.

One issue that does come up, however, is the fact that there are subjects in sdt that we do not have any expression data on. So let’s amend our earlier assignment to also subset by samples that are in the expression matrix:

# Merging: revised
sdt <- inner_join(samp, subj, by = "SUBJID") |>
  filter(SAMPID %in% rownames(E))

Note that sdt matches up with E now:

dim(sdt)
# [1] 11688    69
dim(E)
# [1] 11688 18851
all.equal(sdt$SAMPID, rownames(E))
# [1] TRUE

Exercise: So, what is the average age of the donors for each tissue type?


Plotting

Instead of obtaining numerical summaries like we did earlier, plotting is also extremely useful. Thankfully, because filter(), mutate(), and so on return data frames, they work very well with ggplot2. I’m not going to cover this exhaustively, but let’s just create a few plots to see how well they work together.

For example, earlier we calculated the average age of donors at different levels of the Hardy scale. Let’s create a plot instead. There are two small issues: * Hardy scale is recorded as a number even though it’s really more of a category (factor) * Age is always 45, 55, etc., so let’s add some random noise to it (this is called “jittering”)

So, let’s mutate these variables before we plot them (note that we are not modifying the original data set, just creating a temporary version for plotting purposes):

sdt |>
  mutate(Hardy = factor(DTHHRDY)) |>
  mutate(Age = Age + runif(length(Age), -4.5, 4.5)) |>
  ggplot(aes(Hardy, Age)) +
  geom_boxplot()

What if, instead of Age, we wanted to plot expression? Here it gets a little tricky because expression is in a separate matrix, so we need to merge the expression data for the gene of interest in first. Let’s write a little function to do that:

add_gene <- function(gene) {
  data.frame(SAMPID = rownames(E), expression = E[, gene]) |>
    inner_join(sdt, by = "SAMPID")
}

Now let’s look at how expression of the gene SCN5A changes by Hardy level. Because the gene expression is highly skewed, I’m going to plot it on the log scale. Because you can’t take the log of zero, I’m going to add 1 first (another call to mutate():

gene_id <- filter(gdt, Symbol == "SCN5A")$ID
add_gene(gene_id) |>
  mutate(Hardy = factor(DTHHRDY)) |>
  mutate(expression = expression + 1) |>
  ggplot(aes(Hardy, expression)) +
  geom_boxplot() +
  ylab("SCN5A") +
  scale_y_log10()

It looks like this gene has very low expression and doesn’t vary by Hardy scale. However, I happen to know that SCN5A is a calcium channel gene mainly expressed in the heart, so let’s filter to look at just heart samples:

add_gene(gene_id) |>
  filter(SMTS == "Heart") |>
  mutate(Hardy = factor(DTHHRDY)) |>
  mutate(expression = expression + 1) |>
  ggplot(aes(Hardy, expression)) +
  geom_boxplot() +
  ylab("SCN5A") +
  scale_y_log10()

Now a pattern has emerged: SCN5A is noticeably more highly expressed in the hearts of individuals who died suddenly compared to those who suffered long, debilitating illnesses.

Expression by tissue

OK, now let’s get back to the central biological question behind the GTEx project: how does the expression of genes vary with tissue type?

Initial attempt

Let’s say we decide to start by asking the question: what genes are the most highly expressed in the pancreas? We might think about approach the question like this:

  1. Subset E to include just pancreas samples
  2. Take the mean of each gene in the pancreas subset
  3. Sort and look at the top genes
panc_id <- filter(sdt, SMTS == "Pancreas") |> pull(SAMPID)
panc_means <- E[panc_id,] |>
  colMeans() |>
  tibble::enframe(name = "ID", value = "mean")

The last step of calling enframe() converts the vector into a tabular data frame (tbl_df); this will allows us to integrate the results into our dplyr pipelines.

Now, let’s see what those top genes are:

# Top genes: Initial approach
inner_join(panc_means, gdt) |>
  arrange(-mean) |>
  print(n = 20)
# # A tibble: 18,851 × 4
#    ID                mean Symbol Description                                
#    <chr>            <dbl> <chr>  <chr>                                      
#  1 ENSG00000115386 65328. REG1A  regenerating family member 1 alpha         
#  2 ENSG00000137392 54362. CLPS   colipase                                   
#  3 ENSG00000142789 28048. CELA3A chymotrypsin like elastase family member 3A
#  4 ENSG00000172023 23745. REG1B  regenerating family member 1 beta          
#  5 ENSG00000153002 17187. CPB1   carboxypeptidase B1                        
#  6 ENSG00000219073 10602. CELA3B chymotrypsin like elastase family member 3B
#  7 ENSG00000142615  6052. CELA2A chymotrypsin like elastase family member 2A
#  8 ENSG00000172016  5972. REG3A  regenerating family member 3 alpha         
#  9 ENSG00000243480  4902. AMY2A  amylase, alpha 2A (pancreatic)             
# 10 ENSG00000164266  4768. SPINK1 serine peptidase inhibitor, Kazal type 1   
# 11 ENSG00000162438  3742. CTRC   chymotrypsin C                             
# 12 ENSG00000231500  2363. RPS18  ribosomal protein S18                      
# 13 ENSG00000143954  2129. REG3G  regenerating family member 3 gamma         
# 14 ENSG00000215704  1380. CELA2B chymotrypsin like elastase family member 2B
# 15 ENSG00000122406  1140. RPL5   ribosomal protein L5                       
# 16 ENSG00000177954   979. RPS27  ribosomal protein S27                      
# 17 ENSG00000163682   850. RPL9   ribosomal protein L9                       
# 18 ENSG00000168028   823. RPSA   ribosomal protein SA                       
# 19 ENSG00000143947   821. RPS27A ribosomal protein S27a                     
# 20 ENSG00000188846   806. RPL14  ribosomal protein L14                      
# # ℹ 18,831 more rows

At the top of the list are the REG genes and the chymotrypsin genes; this makes sense as these genes are directly related to the function of the pancreas. However, there are also a bunch of “ribosomal protein” results. These are perhaps not so interesting – they indicate that pancreas cells are making ribosomes, but all cells do that. So perhaps these genes are high in all tissues, not just pancreas. In order to see whether this is the case, I’m going to create a function called tissue_plot:

tissue_plot <- function(gene) {
  symbol <- gdt |> filter(ID == gene) |> pull(Symbol)
  add_gene(gene) |>
    ggplot(aes(expression, SMTS)) + 
    geom_boxplot() +
    xlab(symbol)
}
tissue_plot("ENSG00000231500")

So the pancreas makes a lot of RPS18, but plenty of other tissues do too. Let’s see how, say, colipase compares:

# Tissue_plot: Colipase
tissue_plot("ENSG00000137392")

This seems much more relevant as a pancreas-specific gene – and in fact, the pancreas is the only organ that produces colipase. At this point, it’s worth taking a moment to think about how we should define a better metric that prioritizes genes like CLPS over genes like RPS18.

Let’s refine that approach

There are many ways to answer this question – no single right or wrong answer. However, it seems that something which will help us in our efforts is to create a matrix of the expression for each gene, for each tissue. In other words, we created a pancreas means vector above, but let’s repeat that for every tissue.

First, let’s take our earlier code and turn it into a function that can accept any tissue (as opposed to hard-coding pancreas):

tissue_mean <- function(tissue) {
  tissue_id <- filter(sdt, SMTS == tissue) |> pull(SAMPID)
  E[tissue_id,] |> colMeans()
}

Now, we need to call this function for every tissue and stack the results. The easiest way to do this is with sapply():

tm <- unique(sdt$SMTS) |> sapply(tissue_mean)

At this point, as I said, there are lots of good ways to answer this question. For example, one approach might be what I’ll call the “pancreas fraction” (the fraction of total expression that occurs in pancreas):

pf <- tm[, 'Pancreas'] / rowMeans(tm)
tibble::enframe(pf, name = "ID", value = "fraction") |>
  inner_join(gdt, by = "ID") |>
  arrange(-fraction) |>
  print(n = 20)
# # A tibble: 18,851 × 4
#    ID              fraction Symbol   Description                                
#    <chr>              <dbl> <chr>    <chr>                                      
#  1 ENSG00000137392     29.8 CLPS     colipase                                   
#  2 ENSG00000162438     29.8 CTRC     chymotrypsin C                             
#  3 ENSG00000243480     29.8 AMY2A    amylase, alpha 2A (pancreatic)             
#  4 ENSG00000219073     29.7 CELA3B   chymotrypsin like elastase family member 3B
#  5 ENSG00000142789     29.7 CELA3A   chymotrypsin like elastase family member 3A
#  6 ENSG00000142615     29.7 CELA2A   chymotrypsin like elastase family member 2A
#  7 ENSG00000153002     29.6 CPB1     carboxypeptidase B1                        
#  8 ENSG00000215704     29.6 CELA2B   chymotrypsin like elastase family member 2B
#  9 ENSG00000204140     29.2 CLPSL1   colipase like 1                            
# 10 ENSG00000185176     29.1 AQP12B   aquaporin 12B                              
# 11 ENSG00000184945     29.0 AQP12A   aquaporin 12A                              
# 12 ENSG00000114204     28.3 SERPINI2 serpin family I member 2                   
# 13 ENSG00000187733     28.2 AMY1C    amylase, alpha 1C (salivary)               
# 14 ENSG00000237763     27.6 AMY1A    amylase, alpha 1A (salivary)               
# 15 ENSG00000143954     26.9 REG3G    regenerating family member 3 gamma         
# 16 ENSG00000115386     26.4 REG1A    regenerating family member 1 alpha         
# 17 ENSG00000174876     26.3 AMY1B    amylase, alpha 1B (salivary)               
# 18 ENSG00000172023     26.3 REG1B    regenerating family member 1 beta          
# 19 ENSG00000164266     25.1 SPINK1   serine peptidase inhibitor, Kazal type 1   
# 20 ENSG00000115263     24.9 GCG      glucagon                                   
# # ℹ 18,831 more rows

Many of the same genes show up as before, but the ribosomal proteins (highly expressed in all tissue) have been weeded out; all of these genes seem completely reasonable.

Do you have a different idea in mind? Awesome! Try it out and let me know how it goes!