Analyzing Whole Exome, Large-n Cohorts in SVS

It’s come to my attention in recent weeks, through various customer interactions, that many are not aware of the fantastic functionalities that exist in SNP and Variation Suite (SVS) for large-n DNASeq workflows; this includes large cohort analyses with case/control variables.

The data you’ll see below is the publically available 1kG Phase 1 v3 Exome sequences from 1,092 individuals with a simulated case/control phenotype. In this dataset there are 167 cases and 925 controls. I’ve filtered this dataset down to the rare variants based on exons from RefSeq Genes version 105; NHLBI ESP6500SI-V2 Exomes Variant Frequencies 0.019 (using MAF threshold of 0.01) and removed all variants defined as common by dbSNP version 137 provided within SVS. And finally the dataset was filtered only to non-synonymous coding variants by using the Coding Variant Classification function, which left a dataset of 204,818 markers.

SVS has many functions for comparing two groups or filtering and annotating your dataset, but this is just one example. We also have options for family analyses, narrowing a dataset to a particular region of interest, or a certain set of genes, for activating the dataset by sample zygosity state, just to name a few…You can customize your research analysis as you see fit! That’s the beauty of SVS.

Now with a dataset of rare, non-synonymous, coding variants we can proceed with various collapsing methods. Collapsing methods or burden tests work by combining several variants into a single covariate based on gene regions. This is necessary when working with rare variants because traditional GWAS association tests do not have the power to detect the significance of the individual variant.

I started out by running a single-variant association test, just to see what I could find. I used the Basic Allelic model, testing significance using a Fisher’s exact test as found under the Genotype menu in SVS. An advantage of the Basic Allele test is that it doubles the number of observations, perhaps giving a slight power boost for rare variant testing. A disadvantage is that it ignores genotype specific information. A Fisher’s exact test is a statistical significance test used when analyzing contingency tables; it’s a multivariate frequency distribution of variables. Fisher’s exact test is called such because it accurately calculates the p-value instead of relying on an approximation as is used in many other statistical tests.

(Interesting fact: Sir Ronald Fisher (the guy this test is named after) was an English statistician, evolutionary biologist, and eugenicist. It’s said he originally had the idea for this test because Dr. Muriel Bristol-Roach, an algal biologist, stated she could tell whether tea or milk was added to her cup first. Fisher created the test to examine her claim and he ended up essentially proving her claim!)

The first option of analyzing rare variants is KBAC with Permutation testing. This method was originally developed by Liu and Leal 2010 and stands for Kernal-Based Adaptive Cluster Test. What the test does is place the variant data from each gene region into multi-marker genotypes. Then a case/control test is applied to the counts; the test is weighted and because of this the genotypes found more often in cases will be given higher weights which can potentially identify causal and non-causal genotypes. The standard version of the test uses a one-sided hypothesis. In other words, it expects that you are looking for genes where the cases have a burden of variation relative to the controls. If you suspect that it is the controls that may carry more mutations, you should either invert the case/control status and run the test again to see results in the opposite direction, or consider using the two sided testing options. In this instance 1M permutations were used.

The next option we’ll discuss is CMC, Combined Multivariate and Collapsing Method, which was developed by Liu and Leal 2008. This method provides the option for analyzing both common and rare variants in your dataset. This filtering took the form of filtering the exonic regions and filtering to only non-synonymous variants, without the allele frequency filters. This dataset of both rare and common variants consisted of 239, 004 markers.

This method first needs variants binned by variant frequency. This function is found under the DNASeq Menu > Variant Binning by Frequency Source, in this case NHLBI ESP6500SI-V2 Exomes – Variant Frequencies was used. In each defined bin, the variant type (heterozygous or homozygous) is given a binary indicator according to what is collapsed in that bin; then a multivariate test is performed on the counts across the bins. In SVS we have two options for CMC analysis, a Hotelling T-Squared Test or Regression, here we’ll discuss Hotelling’s test.  Hotelling’s Test requires a case/control dependent variable which is then tested against the independent binned variables.

(Interesting fact: Harold Hotelling (an American) was influenced by Sir Ronald Fisher’s work, “Statistical Methods for Research Workers” and considered him a colleague. He held professorial positions at various prestigious institutions: Stanford, Columbia and Univ. of North Carolina, where he helped develop the universities’ statistics departments. He also sponsored refugees from Nazism and European anti-Semitism.)

The features described above are just part of what makes Golden Helix’s SVS software so flexible. Researchers have the option of family analyses, focusing in on a dataset or specific set of genes. However, if working with non-synonymous variants SVS is also able to perform burden tests or the collapsing method. If you have any questions about SVS capabilities please contact our support team, we will be happy to help.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.