Scalable Dimension Reduction

Overview

Scalable Dimension Reduction (SDR) is a clustering algorithm suitable for performing continental ancestry studies or for identifying outliers before performing additional tertiary analysis.

SDR takes very large numbers of variables (for example, all of the different genotype values in a data set) and seeks to reduce them down to a much smaller number of synthetic dimensions or “factors” which account for most of the statistical difference between samples. STARInsight assigns a score to each combination of sample / variant position based on genotype value, and uses those scores to calculate SDR factors. 

SDR is based on the Qatar Computing Research Institute’s Scalable Principal Component Analysis algorithm for the Spark computing language. While SDR is similar to Principal Component Analysis, it has several important differences (discussed below).

This article will teach you how to run the SDR app and interpret its results.

Running the App

Begin by clicking the SDR tile within the Apps page.

A modal will prompt you to provide some parameters.

 

Analysis Set

An analysis set is a collection of samples which you have selected and saved from the Search page. Only your saved analysis sets will appear in this dropdown. 

Preprocessing Filter Set

A pre-processing filter set identifies variants relevant to your research. Only your saved filter sets will appear in this dropdown. 

Note that your STARInsight domain may be subject to a limit in the number of chromosomal positions that a single filter set can encompass. If you create a filter set that exceeds this limit, it will not appear as available in this dropdown. 

Number of Factors

Specify the number of factors you want to return here. Note that specifying more factors will increase the amount of time it takes your SDR job to run. Your results will vary dramatically based on the number of factors you specify here (see below for an explanation of expected behavior).

Interpreting Results

The results report contains three sections.

Summary Information

Many of the fields provided in the Summary Information panel are self-explanatory, but here are a few values that bear further explanation. Note that many of these values are also included in an excel file which you can download at the bottom of the report.

 

Samples Excluded From This Report = STARInsight will exclude samples based on the values you have provided for the "Sample Exclusion" filter on the pre-processing filter set used in this analysis.

This count indicates how many samples from your original analysis set were excluded from the analysis because of the rule you provided. 

Angles of PCs versus initial PC = The values in this array indicate the angle of the intersection between the different factors you have requested. If you run a job with four factors, the first number will be zero, the second will indicate the angle between Factor 1 and Factor 2, the third between Factor 2 and Factor 3, and the fourth the angle between Factor 3 and 4.

Total Positions in Analysis Set = This is the total number of variant positions you specified in the filter set used in the analysis.

Column-wise Standard Deviations = These indicate the standard deviation of the values for the different factors you have requested. The first number in the array will indicate the standard deviation of the Factor 1 value for all samples in your analysis set; the second number will indicate the standard deviation of Factor 2 for all samples, and so on.

Variance of Columnar Variances = This is the standard deviation across the column wise variances.

Column-wise Variances = The numbers in this array indicate the contribution of each factor to the total variance in the data set. The first number is the variance for Factor 1, the second for Factor 2, and so on. These numbers will always be in descending order.

Positions Excluded from this Report = STARInsight will exclude positions based on the values you have provided for the "Position Exclusion" filter on the pre-processing filter set used in this analysis.

Results Plots

In the results plot each sample included in your analysis is represented on an X-Y plot, where the X axis represents values for Factor 1 and the Y axis represents values for Factor 2.

Use the dropdown menus below your plot to relabel samples. You will see one dropdown for each project included in the analysis. You can use any metadata field available for the sample's project as a label. For example, if your project has six metadata fields ("Project ID", "Sample ID", "Gender", "Treatment", "Treatment Response", "Ethnicity"), you will see those in the dropdown.

 

If you see some data points that you want to analyze further, you can drag a selection rectangle over them and hit "Save" to create a new analysis set that contains only the selected samples. Pan around the plot by holding SHIFT, left-clicking, and dragging. You can zoom in and out by pinching or using your mouse's scroll option.

 

Results Table

Depending on the number of samples you select, this table could be very large so we’ve provided a preview of Factor 1, Factor 2 values for 100 samples from your job. To download the entirety of your results (including PCA factors beyond 1 and 2), use the "Download Full Results" button at the bottom of the screen.

The excel file contains two tabs: one with the factor values for each sample in your analysis set and the other with some of the same summary information about your job as can be found in the top section of the report.

Important Differences from PCA

Scalable Dimension Reduction is different from classical Principle Component Analysis in important ways. In exchange for an ability to scale to a very large number of variants, SDR sacrifices in these areas of performance.

Results Depend on Factors Requested

Unlike with classical PCA, the number of factors that you request will change the results. This means that some trial and error is required to achieve useful clustering with SDR.

Variance Contribution

Both classical PCA and SDR, the returned factors provide descending contributions to the total variance observed in the data set. 

With classical PCA, Annai's bioinformatics team has observed that the first two factors returned account for over 60% of the total variation. With SDR, the first two factors will typically account for 25% of total variance. 

Factors are Not Orthonormal

Classical PCA returns factors which are orthonormal. In visual space, this is represented by factors intersecting as perpendicular lines. [1] 

SDR does not return orthonormal factors. Refer to the "Angles of PC versus Initial PC" metric to gauge the intersection angles of the different factors.

[1] Image credit to Wikipedia.org

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk