Data Bonus - "Kaplan Meier"
Learning outcomes
After having completed this chapter you will be able to:
- Load data into R
- Explore some TCGA data to perform Kaplan-Meier curves
- Understand the statistical concepts of KM curves
Exercises
TCGA
Background
TCGA stands for the cancer genome atlas. Choose your favourite cancer and download some clinical information and some genetical information on those patients (you can try for instance Cervix cancer CESC).
Hint
Download and use the TCGAbiolinks package to be able to retrieve more easily data.
query <- GDCquery(project = "TCGA-CESC",
data.category = "Gene expression",
data.type = "Gene expression quantification",
experimental.strategy = "RNA-Seq",
platform = "Illumina HiSeq",
file.type = "results",
legacy = TRUE)
# Download a list of barcodes with platform IlluminaHiSeq_RNASeqV2
GDCdownload(query)
#prepare table
?GDCprepare
CESCrnaseqSE <- GDCprepare(query,save=TRUE,save.filename="object_tcga_cesc")
head(CESCrnaseqSE)
library(SummarizedExperiment)
CESCMatrix <- assay(CESCrnaseqSE,"raw_count")
## retrieve clinical data
clin.cesc <- GDCquery_clinic("TCGA-CESC", "clinical")
Questions
- Inspect the dataset
- Choose 2 columns of quantiative measures (or change a qualitatif measure into a quantitatif measure) and see if there is a significant association between those variables.
- Formulate point 2 into a biological null hypothesis and then into a statistical one
- Still using TCGAbiolinks package try to assess if the chosen meta data columns are associated with survival
- TCGAbiolinks provides a framework to do the survival curves, what if you have not TCGA data, how would you have written it ?
Hint
Check the code of the function, do you know how ? Hint of the hint: write the function in R without the parenthesis.
- Bonus : Generate a signature of 10 genes (can choose them randomly) and check if this signature is associated with outcome