12 HOMEWORK

12.1 HOMEWORK

You will write a report describing the genome wide gene expression of putative proteins in your target organism. You will have a Quarto file (qmd file) at the end and upload the file on klms.kaist.ac.kr until 22nd Dec. It should include R code-chunk that can be executable. Insert figures in the qmd file.

See the following guide. These steps in the guide include minimal requirements for your reference. You can do any additional work you want. You can do it with the same organism but don’t use the same expression sample (GSE) with the following guide.

12.1.1 Preparation a qmd file

Install R, Rstudio, Quarto on your local computer (OR use posit-cloud)
Create a new quarto file
Start with the YAML head as follows

```{r}

    ---
    title: "HOMEWORK"
    eval: false
    author: Your name and student number
    date: The date you make this
    format: 
      html: 
        toc: true
        number-sections: true
        code-overflow: wrap
    ---
    
```

12.1.2 Choose a target organism

Additional requirements

Include a brief description of the organism in the quarto file

Choose a target organism you are interested in
Search for its information in NCBI genome database (https://www.ncbi.nlm.nih.gov/datasets/genome/

Click “Scientific name”

Confirm if there is any Series in GEO database and Click Reference genome ID

At the bottom, memorize Genbank id “U00096.3”

12.1.3 Download

Use rentrez to download gb file (it takes time)

library(rentrez)

recs <- entrez_fetch(db="nuccore", id="U00096.3", rettype="gbff")
write(recs, file="examples/u00096-3.gb")

12.1.4 Visualize genes with putative functions

Additional requirements

What’s the proportion of the putative proteins

Use readGenBank() function in genbankr package to load the genbank file
Use cds() function in Biostrings package to load cds information
Use pyranges function for filtering
Use ggbio for visualization

library(genbankr)
library(Biostrings)
library(BSgenome)
library(plyranges)
library(ggbio)

mygenome <- readGenBank("examples/u00096-3.gb")
mycds <- cds(mygenome)

gr1 <- granges(mycds)
gr2 <- mycds |> 
  plyranges::filter(grepl("putative", product)) |> 
  granges()
  

ggplot() + 
  layout_circle(mycds, geom = "ideo", fill = "gray70", radius = 9, trackWidth = 1) +
  layout_circle(mycds, geom = "scale", size = 3, trackWidth = 1, scale.n=20)   +
  layout_circle(gr1, geom = "rect", color = "steelblue",  radius = 5)  +
  layout_circle(gr2, geom = "rect", trackWidth = 1, scale.n=20, radius = 4)

12.1.5 Codon usage

Additional requirements

How long is the whole genome?
What is the average length of cds?

Plot for the codon usage
Use getSeq() function in BSgenome package to read whole genome sequence

myseq <- BSgenome::getSeq(mygenome)
mycdsseq <- BSgenome::getSeq(myseq, gr1)
codon_usage <- trinucleotideFrequency(mycdsseq, step=3)
global_codon_usage <- trinucleotideFrequency(mycdsseq, step=3, simplify.as="collapsed")
names(global_codon_usage) <- GENETIC_CODE[names(global_codon_usage)]
codonusage2 <- split(global_codon_usage, names(global_codon_usage))
global_codon_usage2 <- sapply(codonusage2, sum)

Use ggplot for the bar graph
Use dplyr, tibble for tidy data (load tidyverse)

library(tidyverse)

global_codon_usage2 |> 
  data.frame() |> 
  tibble::rownames_to_column() |> 
  dplyr::rename(gcu = global_codon_usage2, aa = rowname) |> 
  ggplot(aes(x=aa, y=gcu)) +
  geom_bar(stat="identity")

12.1.6 CDS length distribution

seqlen <- nchar(mycdsseq) |> 
  data.frame() |> 
  dplyr::rename(len = nchar.mycdsseq.)
seqlen |> 
  ggplot(aes(x=len)) +
  geom_density(fill="green", alpha=0.2, adjust=3) +
  scale_x_continuous(limits = c(-1000, 8000))

seqlen |> 
  summarize(m = mean(len), s = sd(len))

12.1.7 Load GEO data

Additional requirements

Include description of the data

Click Series related to the target organism

Choose any GSE with enough number of samples

Avoid to use zero feature gse dataset as below. Try to use another gse dataset which contains features

library(GEOquery)

gse <- getGEO('GSE17276', GSEMatrix = TRUE, destdir = "examples")
mygse <- gse[[1]]
class(mygse)
mygse

12.1.8 Draw boxplot

Additional requirements

Describe what is the difference between normalization and standardization
Do you need to normalize your data?

library(tidyverse)

mypdata <- pData(mygse)
myfdata <- fData(mygse)
myexp <- as.data.frame(exprs(mygse))

# boxplot
myexp |> 
  rownames_to_column() |> 
  pivot_longer(-rowname) |> 
  ggplot(aes(x=name, y=value)) +
  geom_boxplot()

12.1.9 The expression of putative proteins

Since we used the same organism for the genome info and GEO database, we can find the expression of putative genes

library(plyranges)

putative_locus_tags <- mycds |> 
  mcols() |> 
  data.frame() |> 
  dplyr::filter(grepl("putative", product)) |> 
  dplyr::slice_sample(n=100) |>  ## 100 samples for visulization
  dplyr::pull(locus_tag)

Get the expression of the putative protein genes

newexp <- myexp |> 
  bind_cols(myfdata) |> 
  dplyr::filter(ORF %in% putative_locus_tags) |> 
  dplyr::select(-ID, -GENE_SYMBOL) |> 
  dplyr::select(ORF, everything()) |> 
  tidyr::drop_na()

Merge two dataframes by accession number “GSMxxx” for the addition of group information (phenogroup)

phenogroup <- mypdata |> 
  dplyr::select(geo_accession, source_name_ch1)

newexp_t <- newexp |> 
  pivot_longer(-ORF) |> 
  pivot_wider(names_from = "ORF") 

newexp_m <- phenogroup |> 
  left_join(newexp_t, by = join_by(geo_accession == name)) |> 
  dplyr::rename(accno = geo_accession, type = source_name_ch1)

Visualization of the putative genes with respect to the phenotype

newexp_m |> 
  pivot_longer(-c(accno, type)) |> 
  group_by(type, name) |> 
  summarize(m = mean(value), s = sd(value)) |> 
  ggplot(aes(x=type, y=m, fill=type)) +
  geom_bar(stat="identity", position = "dodge") +
  facet_wrap(name ~ .) +
  scale_x_discrete(labels = NULL)