The pre-processing page allows you to configure a set of rules which you can use to filter out variants or samples that you don't want to use in a STARInsight app like pPCA or bring into the notebook interface.
Start by heading to the Pre-Processing tab in the Web UI.
Click the "Create Filter Set" button on the left side of the screen.
Once you've given your filter set a name and quick description, you can start assigning rules.
Two types of pre-processing rules are available: imputation rules and filter widgets.
Imputation rules allow you to assign a homozygous reference value for variant positions that have not been explicitly ingested into STARInsight. Imputation rules can be toggled at the project level. In the example below, STARInsight will assign 0|0 values for unrecorded positions in "Clinical Trial A" but not "ICGC."
Filter widgets are applied after imputation rules. Complex strings of filter widgets can be constructed by dragging these into the "Include" and "Exclude" dropzones on the Pre-Processing page. The basic rule to remember is that filter widgets in the "Include" column are combined with the AND operator. Filter widgets added to the "Exclude" column are combined together with the AND NOT operator.
The query in the example screenshot below would be: "Variants between positions 1:1 and 1:999 AND Read Depth greater than 40 AND NOT positions between 1:400 and 1:599."
When combining several widgets into a single filter set, you can optimize performance by listing the least specific filter widget at the top of the column. For example, if you were working with high quality data where most positions had a read depth of 50, you would apply your filter widgets in this order.
The list below explains the function of each of the filter widgets.
This widget prompts you to supply a BED file containing a list of regions of interest. The within vs. outside radio button can be used to more finely target your filter set. For example, imagine you uploaded a single BED file that targeted positions 1:200 to 1:300. Selecting the "Within these ranges" radio button would filter your data down to only positions 1:200 to 1:300. Selecting the "Outside these ranges" button would do the opposite: returning every position except positions 1:200 to 1:300.
Chromosome / Position
This widget is available as an alternative to supplying a BED file. Entering the following values would create a rule to include only variants on Chromosome 1 between positions 500 and 1000.
This widget filters out any variants that do not match the quality score criteria you provide. Quality scores are typically included in the Variant Call Files that you provide when first uploading your genomic data to STARInsight. Entering the following value would filter out any variants with quality scores lower than 40.
This widget filters out any variants that do not match the read depth criteria you provide. Read depth values are typically included in the Variant Call Files that you provide when first uploading your genomic data to STARInsight. Entering the following value would filter out any variants with read depths lower than 30.
This widget excludes variant positions which are unrecorded for a large number of the samples in an analysis set. This widget is an alternative to using the project-based imputation rules discussed earlier in this article. Entering the following value would filter out any variant positions which are unrecorded for more than 25% of samples in an analysis set.
That means that the widget would exclude position 1:1001 below but not position 1:1002
This widget excludes samples which contain more than a certain number of unrecorded positions. This widget is an alternative to using the project-based imputation rules rules discussed elsewhere in this article. Entering the following value would filter out any sample where more than 50% of variant positions were unrecorded.