Principal Component Analysis (PCA) seeks to reduce very large numbers of variables (for example, all of the different genotype values in a data set) down to a much smaller number of synthetic dimensions or “factors” which account for most of the statistical difference between samples. STARInsight assigns a score to each combination of sample / variant position based on genotype value, and uses those scores to calculate PCA factors.
STARInsight's PCA app is based on the MLlib Principal Component Analysis function for the Spark computing language.
This article will cover how to run the PCA app and interpret its results.
Running the App
Begin by clicking the PCA tile within the apps page.
A launch modal will prompt you to specify a few parameters.
An analysis set is a collection of samples which you have selected and saved from the Search page. Only your saved analysis sets will appear in this dropdown.
Preprocessing Filter Set
A pre-processing filter set identifies variants relevant to your research. Only your saved filter sets will appear in this dropdown.
Please note that this algorithm cannot be used with filter sets that cover more than 65,535 positions. To analyze filter sets which are larger than this, please use the Scalable Dimension Reduction app.
Setting this parameter will divide each sample's PCA factor values by the first factor (which always accounts for the greatest variation). For example, imagine a sample where the resulting factors were ( .8 , .7 , .6 ). Applying variance normalization would yield the following changes...
( .8 / .8 = 1 , .7 / .8 = .875 , .6 / .8 = .75 ) such that the final results became...
( 1 , .875 , .75 )
Number of Factors
Specify the number of factors you want to return here. Note that specifying more factors will increase the amount of time it takes your PCA job to run.
The results report contains three sections.
Many of the fields provided in the Summary Information panel are self-explanatory, but here are a few values that bear further explanation.
Samples Excluded From This Report = STARInsight will exclude samples based on the values you have provided for the "Sample Exclusion" filter on the pre-processing filter set used in this analysis.
This count indicates how many samples from your original analysis set were excluded from the analysis because of the rule you provided.
Positions Excluded from this Report = STARInsight will exclude positions based on the values you have provided for the "Position Exclusion" filter on the pre-processing filter set used in this analysis.
Total Positions in Analysis Set = This is the total number of variant positions you specified in the filter set used in the analysis.
In the results plot each sample included in your analysis is represented on an X-Y plot, where the X axis represents values for Factor 1 and the Y axis represents values for Factor 2.
Use the dropdown menus below your plot to relabel samples. You will see one dropdown for each project included in the analysis. You can use any metadata field available for the sample's project as a label. For example, if your project has six metadata fields ("Project ID", "Sample ID", "Gender", "Treatment", "Treatment Response", "Ethnicity"), you will see those in the dropdown.
If you see some data points that you want to analyze further, you can drag a selection rectangle over them and hit "Save" to create a new analysis set that contains only the selected samples. Pan around the plot by holding SHIFT, left-clicking, and dragging. You can zoom in and out by pinching or using your mouse's scroll option.
Depending on the number of samples you select, this table could be very large so we’ve provided a preview of Factor 1, Factor 2 values for 100 samples from your job. To download the entirety of your results (including PCA factors beyond 1 and 2), use the "Download Full Results" button at the bottom of the screen.
If you requested more than two factors in your pPCA job, these values will appear for each sample in your full results file.