Large Scale PCA Analysis in SVS – Webcast Q&A Follow Up

         January 26, 2022

Thank you to those who attended our recent webcast by Gabe Rudy, Large Scale PCA Analysis in SVS. For those who could not attend, you can find a link to the recording here.

While this webcast discussed methods for principal components analysis (PCA) in SVS, including the new capability for performing principal components analysis on large sample sizes, it also touched upon the overall workflow of association testing in SVS. This workflow uses several features of SVS other than PCA which are related to data quality and improving the quality of the analysis, such as finding the Identity by State (IBS) and Identity by Descent (IBD) for pairs of samples, LD pruning of marker data, and determining a Genomic-Control “inflation factor” of test results.

We had several questions related to the overall workflow of association testing that we were not able to answer during the live webcast. Here are the questions and their answers:

Why should I filter with IBD and not IBS?

In the demonstration, Gabe actually used the IBS (Identity by state) matrix and the heat-map graph related to that matrix to find the one sample that needed to be eliminated from the analysis. Since to compute that matrix, it is necessary in SVS to use the IBD (Identity by descent) dialog, Gabe brought that dialog up, at which time he digressed somewhat in explaining what the expected values should be for the IBD PI matrix for sample pairs that are from different degrees of relatives (parent/child, siblings, etc.), before he showed the IBS matrix and its heat map.

Would you always want to normalize using “Theoretical standard deviation under HWE” when running PCA?

When running principal components analysis, normalizing each marker’s data by the standard deviation of that marker’s data is a standard practice, whether the actual standard deviation is used or whether an easily-computable theoretical value is used. (If p and q are the allele frequencies, and the marker data is in Hardy-Weinberg equilibrium (HWE), the standard deviation of this data will be proportional to the square root of p times q. “Theoretically”, most genotypic data that one encounters is close to being in Hardy-Weinberg equilibrium.)

An alternative “normalization” is to not normalize. When running PCA, this will have a tendency to more greatly emphasize markers with a higher minor allele frequency than would running PCA normalizing by the standard deviations of individual marker data.

Do you know why the inflation factor would not incorporate lineage or pedigree information? Specifically, if I had an unrelated population, why could I still have a higher lambda?

Background: If an association test uses a chi-squared statistic to determine a set of p-values, we define the “inflation factor” lambda as the ratio of the median of the chi-squared statistics from the actual test to the median of the expected or ideal chi-squared distribution for the test. According to several statistical models, the actual distribution will be spread out or “inflated” according to how much confounding there is of the test results. Ideally, lambda should be 1. If lambda gets to be toward 2 or higher, there may be a problem with the data.

Answer: The “inflation factor” lambda is strictly based on the chi-squared values of the test results, and is not tied in with whatever model is used to represent the data. Although we hope to reduce the confounding or lambda value by taking measures such as PCA correction or mixed models, this is no guarantee that we will get test results that have a lower lambda value.

Is there a way to evaluate lambda from the test results if the lambda value is not output by the test itself?

If it is desired to find an approximate “inflation factor” lambda from a set of p-values, without using any other information, accepted statistical practice is to “work backward” from the process of converting chi-squared values to p-values by using the appropriate utility function to convert a p-value to a chi-squared value, either for each p-value or for the median p-value. Usually, one degree of freedom for the chi-squared distribution is assumed in this process. The standard formula is then used to obtain an approximate lambda value.

This practice is implemented by the new add-in script, Calculate Approximate Lambda from P Values. Please use this script to obtain an approximate lambda if your test does not output chi-squared values or a lambda value.

Why did you use an IBS distance matrix and not an IBD sheet for the EMMAX test to help with inflation?

We normally prefer to use the IBS (Identity by State) matrix rather than IBD (Identity by Descent) PI matrix as a kinship matrix for EMMAX because the IBS matrix is based on the genomic state of the samples, irrespective of the descent of the patients or animals from which these samples were taken, and because we are correcting for the genomic patterns that may happen to exist in the data. On the other hand, the IBD PI matrix, even though it has the proper format to be used as a kinship matrix because it does estimate relationships between samples, is more focused on estimating these actual relationships or lack thereof rather than on finding genomic similarities.

Another alternative matrix that may be used as a kinship matrix is the GBLUP Genomic Relationship Matrix (GRM), which is based on the genomic state of the samples, as is the IBS matrix. The difference between the IBS and the GRM matrices is that the GRM is more strictly based on the correlation of sample data based on an additive recoding of the genotypes, instead of merely counting how many alleles happen to match between the genotypes of pairs of samples.

If you have any other questions, either about principal components analysis or performing association testing using SVS, or any other questions about SVS, please contact us at support@goldenhelix.com.

Leave a Reply

Your email address will not be published. Required fields are marked *