vdocs.Rmd
library(vdocs)
vdocs
creates beautiful, rigorous, and transparent data
analysis reports via “PCS lab notebooks” like the one below!
With a “PCS lab notebook”, practitioners can provide a complete
narrative of their data analysis and easily document any human judgment
calls made along the way to promote trustworthy and veridical data
science. For more on veridical data science, see Kumbier and Yu
(2020). In addition to the convenient documentation template,
vdocs
provides starter code to run a PCS-style analysis
with basic prediction and stability checks.
To get started, the easiest way to create a new PCS Lab Notebook is
to open a new Rmarkdown file from template in RStudio: go to
File > New File > R Markdown... > From Template > PCS Lab Notebook > OK
and voila! A directory, whose name was specified in the
Name
dialog box, has been created and contains the .Rmd PCS
lab notebook template.
This lab notebook template has been auto-populated with code to perform a standard PCS-style analysis, but of course, any or all of the R code can be changed. The PCS lab notebook still adds much-needed value by providing a checklist of questions that every data scientist should consider and document throughout the analysis pipeline. While responding to these questions takes time, we highly encourage every data scientist or practitioner to put in this extra effort as a step towards our greater goal of ensuring scientific reproducibility!
Let us now walk through an example usage of the PCS lab notebook using breast cancer (BRCA) data from The Cancer Genome Atlas (TCGA). For the sake of this walkthrough, the question of interest is two-fold: (1) whether we can predict the breast cancer subtype (known as the PAM50 subtype) using gene expression data and (2) which genes lead to these predictions. There are five different subtypes (Luminal A, Luminal B, HER2-enriched, Basal-like, and Normal-like) in this classification problem. More information on these PAM50 cancer subtypes can be found in The Cancer Genome Atlas Network (2012).
To begin, we must first obtain the data of interest. Here, we will
make use of the TCGAbiolinks
package, which can be
installed through Bioconductor
. In the following code
snippet to be run in the R console, we download gene expression data,
measured via RNA-Seq, for TCGA breast cancer patients along with their
PAM50 breast cancer subtype classification.
# Set query for TCGA BRCA gene expression data, measured via RNA-Seq
query <- TCGAbiolinks::GDCquery(
project = "TCGA-BRCA",
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "normalized_results",
experimental.strategy = "RNA-Seq",
legacy = TRUE
)
# Download and prepare data requested in query
TCGAbiolinks::GDCdownload(query)
data <- TCGAbiolinks::GDCprepare(query)
# Convert clinical data and assay data into data.frames
y_data <- data.frame(SummarizedExperiment::colData(data))
X_data <- data.frame(t(SummarizedExperiment::assay(data)))
# Remove samples that do not have a corresponding PAM50 subtype
keep_samples <- !is.na(y_data$paper_BRCA_Subtype_PAM50)
y <- as.factor(y_data$paper_BRCA_Subtype_PAM50[keep_samples])
X <- X_data[keep_samples, ]
# Save assay data (X) and PAM50 subtype data (y) to disk
if (!dir.exists(file.path("TCGA BRCA Example", "data"))) {
dir.create(file.path("TCGA BRCA Example", "data"), recursive = TRUE)
}
saveRDS(y, "TCGA BRCA Example/data/tcga_brca_subtypes.rds")
saveRDS(X, "TCGA BRCA Example/data/tcga_brca_array_data.rds")
As a result, the following have been saved to disk:
X
: a data frame of gene expression values, measured via
RNA-Seq, where each row corresponds to a patient and each column
corresponds to a gene.y
: a response vector with the PAM50 breast cancer
subtypes.After obtaining the data and saving it to disk, we can next input the required parameters to run the PCS lab notebook. We provide a brief description of the parameters below.
X
.
Note that this file path should be relative to the location of the
Rmarkdown file.y
. Note that
this file path should be relative to the location of the Rmarkdown
file.These parameters can be set in one of two ways:
params:
section within the yaml header of the Rmarkdown),
input the desired values in the value:
slot for each
parameter.Knit with parameters
. A
pop-up box will show up with slots to fill out for each parameter.After inputting the necessary parameters, users can modify any of the
template code to meet one’s needs. The template code can be easily
identified as they are found within R code chunks (i.e., code chunks
that begin with ```{r ...}
). While any of the R code chunks
can be modified, some code chunks that frequently require modification
are those labeled:
X
and/or y
data.caret
. This code chunk is only run when the
modeling package is set to be “caret” in step 2.h2o
. This code chunk is only run when the
modeling package is set to be “h2o” in step 2.tidymodels
. This code chunk is only run
when the modeling package is set to be “tidymodels” in step 2.For this example analysis, we choose to do some data cleaning and
reduce the number of features in the X
matrix. Originally,
the X
data consisted of ~20,000 genes, but to reduce the
computational burden of this example, we choose to keep only the top
1000 genes, ranked by highest variance. We also choose to remove
constant or duplicated columns from the data. To accomplish this, we
insert the following code snippet into the code chunk labeled
preprocess-data
.
Xtrain <- log(Xtrain + 1) %>%
removeConstantCols(verbose = 1) %>%
removeDuplicateCols(verbose = 1) %>%
filterColsByVar(max_p = 1000)
Xvalid <- log(Xvalid + 1)[, colnames(Xtrain)]
Xtest <- log(Xtest + 1)[, colnames(Xtrain)]
As for the modeling section, the PCS lab notebook, by default,
provides code to run both a random forest and xgboost fit. We will use
these default models for this example. However, these models can be
replaced, removed, or added onto. Please see the help page for
fitModels
(? vdocs::fitModels
) for more
information on how to specify these models, their parameters, and other
training options.
The final and by far the most important step is to provide a transparent narrative for the entire analysis. One of the key assets of the PCS lab notebook is that this narrative is made simple through a guided checklist of questions that every practitioner should consider and document throughout the analysis pipeline. These questions can be answered in one of two ways:
{asis, interactive_text = TRUE}
before knitting. (Check
out the Visual R
Markdown editor that was rolled out in RStudio v1.4! This visual
markdown editing mode is great for those who are less familiar with
Rmarkdown.)Note: though we list this step as last for organization and clarity, we encourage practitioners to provide this narrative concurrently with any code modifications (in step 3).