Our recent blog post about the release of SNP & Variation Suite v7.4 gave you a sneak peek into what can be achieved with the revamped SVS/Python integration, which includes the incorporation of NumPy and SciPy libraries and new graphical layout capabilities. A Python package (such as SciPy) is similar to an R package, which you may be more familiar with. Just as those with the know-how can write R functions using commands from R packages, one can write Python scripts using the commands from the SciPy package. In short, this new functionality enables us and other members of the Golden Helix community to rapidly deliver much more advanced methods and workflows for a wider array of data analysis challenges – as they arise – without having to wait months for the next version of SVS to be released. Further, these new features can be accessed by the broader community through SVS’s standard point-and-click interface.
In this post I highlight some new analysis features, such as ANOVA and Nonparametric testing, that take advantage of the enhanced functionality. All scripts mentioned are available for anyone to download and use from our scripts repository.
First is the ANOVA with Phenotype and SNPs script, which was created to solve a particular user problem. The user had genotypic data and was looking for a way to not only get F statistics, as in a genotypic test, but also the mean and standard deviations of the phenotype per genotype group. SVS currently reports the average dependent value per group but does not report the standard deviation. The ANOVA script provides the additional output of means and standard deviations per group. Using the SciPy package, which provides functions for a one-way ANOVA F test and Kruskal-Wallis H-Test, we were able to solve this particular request in just a few days. The final result is that with this new script, the user can not only perform either an ANOVA test or the nonparametric version but can also get information on the distribution of the phenotype per genotype group for each SNP under consideration.
We also created a second ANOVA script, ANOVA on Numeric Columns. This script, instead of mimicking a genotypic test, takes a categorical column as dependent and performs an ANOVA test on all selected binary or numeric columns. Again, either the selection of ANOVA F-Test or the Kruskal-Wallis H-Test is allowed. This script also has the option to output multiple testing corrections (Bonferroni and FDR). The resulting spreadsheet contains statistics about all of the numeric columns or phenotypes split by groups based on the categorical dependent column.
Both of the above scripts have included the Kruskal-Wallis H-Test as a nonparametric equivalent to the ANOVA F-Test. We also have scripts for exclusively numeric data that perform other nonparametric tests, such as the Mann-Whitney Rank Sum test and the Wilcoxon Rank Sum test. Why do we need nonparametric tests for genotypic, categorical, and numeric data? First, many of you asked for it! Second, sometimes the data requires it.
An assumption common to parametric tests is that each sample comes from a normally distributed population. However, this assumption is not always appropriate. For example, copy number data often comes from either a bimodal or trimodal (or even multi-modal) distribution. The plot below shows histograms of simulated data from this type of distribution split on case control status.
When we use a conventional association test to test for a difference in the means, we find a large p-value (.6919). This value is remarkably different than the p-values we find when nonparametric association tests are used. If you were making decisions based on a p-value cutoff, you would likely make very different conclusions if you did not look at the data more closely!
|Numeric Association Test||-0.3993 (T-statistic)||0.6919|
|Wilcoxon-Rank Sum Test||-3.327 (Z-statistic)||0.00088|
|Mann-Whitney Rank Sum Test||77 (U-Statistic)||0.00046|
If we look at the distributions assumed by the parametric test, shown by the density plots below, this disparity makes sense. Because we assume that the data comes from normal distributions (which is clearly not the case), the means appear to be very similar. Closer inspection of the data shows that these assumptions are not valid and that nonparametric approached should be considered.
The incorporation of NumPy and SciPy have allowed us to develop scripts that use several types of nonparametric tests. The new script Nonparametric Association Tests (Binary Dependent) provides access to the Mann-Whitney Rank Test and the Wilcoxon Rank-sum test, both which require a binary dependent and several numeric columns as the independent variable in each test. The statistics are based on the ranks of the observations, which in turn do not require the data to follow any specific distribution. Or maybe you’re interested in the correlation between a numeric dependent variable and several other numeric columns, which may or may not follow a normal distribution. Nonparametic Correlation can compute the Spearman Rank Correlation test and Kendall’s Tau Correlation test with just a few clicks and a little computation time.
Python as a scripting language is highly capable of solving any researcher’s needs, and we continuously develop new scripts to meet both custom workflows as well as simple customer requests. Thus not only are we able to provide extra utility between major product releases, but the internal functions also allow Golden Helix’ development team to provide a large return with very little demand on resources. This means faster implementation of new features for our customers. I’ve only mentioned a few of the available scripts in SVS 7.4. Many more along with instructions are available through our script repository. …And that’s my two SNPs.
This article was very informative. Thank you very much, Autumn.
Thank you very much for an very extensive and informative post. I have myself been tinkering around with Python and ANOVA. Will have a look at your scripts later on. Thanks again.