## Introduction

Principal Components Analysis (PCA) is a mathematical technique that has many uses. In Golden Helix SVS, Principal Components Analysis is used to find patterns between and among samples within genotypic or

other data (such as copy number variation data) that are shared across all markers or across all data.

Mathematically speaking, Golden Helix SVS takes the data (and numerically recodes it in the case of genotypic data) and converts it into a “Wishart matrix”, which basically consists of the correlations between the data of different samples. (More precisely, a “covariance matrix”.) This matrix is then analyzed for its “principal components”, that is, the “eigenvectors” it has corresponding to the highest “eigenvalues” it has. These eigenvectors show any patterns between and among the samples in the original data.

## Uses for Principal Component Analysis in Golden Helix SVS

One use of PCA in Golden Helix SVS is to correct data for association tests to remove confounding patterns of data between samples such as population structure. Another confounding pattern can be “batch effects”, which are the results of different samples being analyzed in different batches.

Normally, only the principal components with the top eigenvalues are used for this PCA correction.

Another use for Principal Components Analysis in Golden Helix SVS is to organize data from different samples into groupings, or “clusters”, of samples that have patterns of similar data. This is done by finding just a few principal components (in the Wishart matrix) with the highest eigenvalues, then taking two or three of these eigenvectors and scatter-plotting them against each other. (The Golden Helix SVS 3-D plot capability, **Plot > Scatter Plot 3D**, may be used when plotting three eigenvectors against each other.) Each plot point represents one sample, and the plot point’s position in the graph represents the respective eigenvectors’ values corresponding to that sample.

The plot points will usually have a tendency to organize into either lines (straight or otherwise) or else clusters around certain places on the graph. These patterns visually illustrate similarities or differences between the respective samples.

Normally, only the principal components with the top two, three, five, or ten eigenvalues are used for cluster plotting as described above.

## The Problem

When the sample size gets to be over 4,000, finding the principal components using the PCA features we have had up until now in Golden Helix SVS will show a noticeable delay, with 14,000 samples taking on the order of 12 minutes for principal component processing and 28,000 samples taking more than an hour. Finally, there is an ultimate limit for the sample size for these PCA features to function at all, which is just over 40,000.

## The Solution

To address this issue, a new feature has been added to the new release of Golden Helix SVS, SVS 8.9.1, which is designed to find a few principal components when the sample size is large or very large. This new feature has been tested on datasets containing 400,000 samples.

This new feature, **Genotype > Compute Large Data PCA**, works on either a spreadsheet with mapped genotypic data or on a spreadsheet with mapped numeric data.

## How Do We Do This?

As hinted above, computing principal components requires two steps: (1) computing the Wishart matrix, and (2) computing the actual principal components (eigenvectors and eigenvalues having the largest eigenvalues) of the Wishart matrix.

Up until now, the two Principal Component Analysis features of Golden Helix SVS have (1) stored the Wishart matrix in random-access memory, which, for large sample sizes, has a very large memory requirement, (2) computed all of the Wishart matrix’s eigenvectors and eigenvalues internally, copying out whatever component values were requested, even if only a very few components needed to be found, and (3) thrown away the Wishart matrix when all calculations were done.

The new **Genotype > Compute Large Data PCA** feature, by contrast, (1) stores the Wishart matrix on hard drive, (2) only computes those eigenvectors and eigenvalues that were requested, and (3) makes the Wishart matrix available for later use as a Golden Helix SVS spreadsheet.

## Step 1: Find the Wishart Matrix

Since random-access memory will not hold this matrix, optimum techniques are used for (1) preparing data from your spreadsheet and transforming it into a form that may easily be used to compute the Wishart matrix, writing this information into a temporary spreadsheet, then (2) scanning this data in an optimum way to allow forming the Wishart matrix on your hard drive in one pass.

The amount of repetitive scanning that must be done for forming the Wishart matrix can be minimized by setting the amount of “Transpose and analysis memory usage” to a high value.

## Step 2: Find the Principal Components

This step utilizes an algorithm discussed in Halko N., Martinsson P.G., Tropp J.A. (2010), ‘Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions’, arXiv:0909.4061v2 [math.NA] 14 Dec 2010.

For short, we will this algorithm as the “Halko algorithm”.

This method uses random numbers to find an orthonormal matrix “Q” (meaning containing columns that would qualify as eigenvectors for some matrix) whose range (the set of possible results of multiplying this matrix by any vector) approximates the range of another matrix “A” of a much larger size. This matrix “Q” may then be used to derive the eigenvectors of “A” which have the largest eigenvalues.

The essence of this method is as follows:

(1) Create a random-number matrix “Q”.

(2) Multiply “A” and “Q”. Place this result back into “Q”.

(3) Use a process called “Gram-Schmidt” to change matrix “Q” into an orthonormal matrix.

(4) Compute the transpose of “Q” matrix-multiplied by “A” then matrix-multiplied by “Q”. Call this “B”.

(5) Since “B” fits easily into random-access memory, use an off-the-shelf utility to compute all of its eigenvalues and eigenvectors. Call the matrix of resulting eigenvectors of “B” “V” and the matrix containing the eigenvalues on its diagonal (and zero elsewhere) “L”, so that we have “B” equaling “V” times “L” times the transpose of “V”.

(6) We approximate “A” as “Q” times “B” times the transpose of “Q”, which is the same as “Q” times “V” times “L” times the transpose of “V” times the transpose of “Q”. In other words, we claim that the diagonal of “L” approximates the eigenvalues of “A”, and that the product of “Q” and “V” approximates a matrix of eigenvectors of “A”.

To improve accuracy, Golden Helix SVS repeats steps (2) through (5) until the eigenvalues of a given repetition are close to the same as the eigenvalues found by the previous repetition. This works because taking “A” and multiplying it with the latest “Q” many times exaggerates the strongest aspects of “A” until only a linear combination of the eigenvectors of “A” corresponding to the highest eigenvalues (the principal components of “A”) is left in “Q”.

This algorithm works very well when you only need those very few principal components of “A”. No time is wasted internally finding the remaining components of “A”.

Additionally, multiplying “A” with “Q” can be done through the large-data technique of scanning “A” as stored on the hard drive and performing the multiplication operation with individual rows of “A” as they are read in from the hard drive.

## How To Use

This feature may be accessed from either a spreadsheet with mapped genotypic data or from a spreadsheet with mapped numeric data. In either case, use **Genotype > Compute Large Data PCA** from the spreadsheet menu.

If your data is genotypic, the following dialog will come up:

The options for both Step 1 and Step 2 will show. The *PCA Marker Normalization* and *Genetic Model for PCA* choices in Step 1 are the same options as those in **Genotype > Genotype Principal Component Analysis**.

You may choose to skip Step 2 at first and only compute a Wishart matrix, then, later on, re-run only Step 2, selecting *Use Pre-Computed Wishart Matrix* and selecting the Wishart matrix you computed earlier.

The default *Number of Principal Components* is 15 because if these components are used for plotting eigenvectors against each other, the first 10 components will usually contain the most interesting information, with 15 adding 5 extra components for good measure.

Since the Halko algorithm depends upon random numbers, it will also be dependent upon the random number “seed”. To be able to reproduce the same results, use *Fixed random seed (reproducible results)*.

Meanwhile, if the random number seed changes, the results will slightly change. The changes in results indicate where this algorithm is less accurate. Almost always, the error in this algorithm is within 10 times how much a result changes when the random number seed is changed. Use *Dynamic random seed (investigate precision)* to use a random seed that will be different every time you run Step 2 of this feature.

Meanwhile, if your data is numeric, this alternative dialog will come up:

Here also, the options for Step 1 and Step 2 will show.

For Step 1, the *PCA Marker Normalization* choices match those that would have been available for genotypic data, and if your data is recoded from genotypic data using the additive model, will produce the same results.

Meanwhile, if you want compatibility with **Numeric > Numeric Principal Component Analysis**, you should select *Apply no normalization*, which is the default. The output will match the output of **Numeric > Numeric Principal Component Analysis** when, in that feature, you check *Center data by marker* and un-check *Center data by sample*.

Step 2 for numeric data is exactly the same as Step 2 for genotypic data.

## Outputs

The output spreadsheet for eigenvalues and the output spreadsheet for eigenvectors are in exactly the same format as those generated by **Genotype > Genotype Principal Component Analysis** and by **Numeric > Numeric Principal Component Analysis**. This means that you can use your output *Principal Components/Eigenvectors* spreadsheet from a **Genotype > Compute Large Data PCA** run as a pre-computed principal components spreadsheet to apply PCA correction to your data. To do this, use **Genotype > Genotype Principal Component Analysis** or **Numeric > Numeric Principal Component Analysis**, select the *Use precomputed principal components* choice that exists in either feature and choose your output *Principal Components/Eigenvectors* spreadsheet.

## Performance

Setting the “Transpose and analysis memory usage” to its new default level of 2 GB, 4K samples can be processed to get 15 principal components in a matter of seconds,14K samples can be processed in a minute and a half, 28K samples can be processed on the order of eleven minutes, and 40K samples, which took hours before, can be processed on the order of a half hour.

Recently, an analysis using a preliminary version of this new feature was able to process a dataset with 400,000 samples, ten times the upper limit of the number of samples that the earlier features could process. This took about three days–it wasn’t instantaneous, but the job did get done. Previously, this dataset would have been impossible to analyze at all.