STARInsight's notebook interface allows bioinformatics researchers to perform custom analysis on genomic data stored on Annai's platform. Within a notebook, you can use common computing languages like R, Python, Scala, and SQL to query genomic variant data and perform complex calculations. The operations you perform are loaded into STARInsight's compute infrastructure, so you get access to the same computing resources as are available for STARInsight's apps (like pPCA or K Means).
The documentation in this article is focused on using R in the STARInsight notebook via the SparkR interpreter. Documentation for other supported languages (like Python and Scala) is coming soon.
Begin by navigating to the notebook tab and clicking the create notebook button.
You will be prompted to specify an analysis set and a filter set. After clicking “Create” your notebook will open in a separate tab. NOTE: you may need to manually override your browser's pop-up blocker to do this.
Notebooks are divided into “paragraphs” (the equivalent for users familiar with working in Jupyter notebooks is the “cell”). The first paragraph depicts a list of the sample IDs for the analysis set you selected when creating the notebook.
Before you can begin working with the data, you’ll need to run this first paragraph. Running your first paragraph loads your notebook into the system STARInsight uses to allocate computing resources. There could be some delay if other analysis jobs and notebooks are in the queue.
Doing so will load the accepted variants in a temporary table called
variants in SparkSQL. You can see the count of accepted variants after “Long” in the screenshot below.
In the second paragraph, run a SQL query to retrieve variants and make them available to R. Note that in this example we have run the
collect function. Doing this brings data from the different Spark executors to the Spark Driver. This is a good way to get started if you’re not familiar with building parallelized operations, but it comes with a performance drawback.
%r variants <- collect(sql(sqlContext, "SELECT * from variants"))
Once you’ve made the variants available to your R instance, you can begin analyzing your data. These genomic data fields are currently supported.
- Genotype = Note that genotype values are NOT phased.
- RS ID
- Ref = The reference allele for this position based on HG 19.
- Alt = The observed allele for this position.
- Quality Score = Present but not populated for all data projects.
- Read Depth = Present but not populated for all data projects.
- Imputed = A true / false value depending on your selection on the Pre-Processing page.
Working with Scratch Space
Your STARInsight domain has access to 20 GB of "scratch space." You can use this scratch space to store anonymized metadata, custom variant annotation data, or other -omics files like RNA-Seq data. This scratch space can also be used as a destination for outputting results files that are too large or unwieldy to be viewed in the notebook UI.
The scratch space can be accessed at /scratch using R directory commands
likesetwd(). Contact us for help configuring folder structure within this directory or to populate and remove files.
- By default the notebook will print plots at full screen width. Resize them with commands like this:
- The STARInsight notebook outputs variant data like this...
We recommend the
reshape2 package for transforming variant data into alternate table formats. A good tutorial can be found here.
- Use commands like
read.table()to pull data out of files stored in your domain's notebook scratch space.
%r setwd("/LEVOGENOMICS/SCRATCH") file <- "clinical_metadata.txt" t.data<-read.table(file, header = TRUE, sep = "") master.df
- The notebook will output a narrow table which wraps after only a few columns when using the
head()command. Set the width wider to view more of the columns per line.
%r library("reshape2") data<-dcast(variants,sample ~ chrpos) options (width=180) head(data)
Known Issues with the Notebook
- Column headers will not appear when printing a data frame.
- Some common R plotting functions (like
ggbiplot) do not work as expected. We have found
biplotto work more reliably.
- Underscores in a data row are sometimes treated as HTML tags like <strong>.
- Loading a routine that is not part of the standard SparkR package will not always persist across paragraphs in the notebook. This may require you to reimport packages at the start of new paragraphs.
- Specifying a default directory in your notebook scratch space will not persist across paragraphs.
- There are issues with the
tablecommands. Please use
- Commenting out multiple lines of code at once using the "COMMAND /" shortcut does not work consistently.