Bryce Christensen

SVS, Population Genetics, and 1000 Genomes Phase 3

One frequent question I hear from SVS customers is whether whole exome sequence data can be used for principal components analysis (PCA) and other applications in population genetics. The answer is, “yes, but you need to be cautious.” What does cautious mean? Let’s take a look at the 1000 Genomes project for some examples.

Phase 3 of the 1000 Genomes Project was released in 2014. The latest major update to the autosomal data, version 5, was released in September. This iteration of the 1000 Genomes Project (1kGP) includes whole-genome sequence data for over 2500 individuals from 26 distinct worldwide populations. As a nice bonus, there is also a large amount of GWAS chip data available. Almost all of the samples have data for the Affymetrix SNP6.0 (Affy6) platform, and many of them also have Illumina 2.5M data. The chip data gives us a nice baseline for comparing methods and results in SVS.

For this analysis, I reduced the 1kGP data to exon regions only. I downloaded the 1kGP autosomal VCF files, used the UCSC browser to create a BED file of RefSeq exon regions (+/- 5 bp), then filtered the files and combined them using VCFtools. The resulting VCF file, containing about 2.2M variants, was imported to SVS for analysis.

Preparing the Data
SVS has methods for PCA, relatedness estimation, fixation indices, inbreeding estimation and other population-level statistics that are designed for use with data from GWAS chips. But even with GWAS data, we don’t usually use all available SNPs to run these statistics. It is usually best to reduce the data to a set of highly-informative markers. Generally speaking, informative markers will have high call rates, high minor allele frequencies (MAF), and will mostly be independent of one another (low pairwise linkage disequilibrium, or LD).

When working with exome data in SVS, it is usually possible to identify a set of informative variants to use for population statistics. One of the most important things is that your variant calls were processed and/or imported in such a way that genotype calls are present for homozygous-reference genotype calls. In that scenario, the software will treat the data no differently than genotypes from a GWAS chip, and you can proceed to select a set of informative SNPs. The following slide summarizes such a process for the 1kGP exome data as compared with the corresponding Affy6 genotypes:


Both datasets were filtered to include only SNPs with a minor allele frequency greater than 1%, with a maximum pair-wise LD R-squared of 10% within a 50-SNP window. The two processes returned similar-sized SNP lists: about 74k from Affy6 and 64k from the exomes. But there is minimal overlap in content here. Only 1011 of the Affy6 SNPs even fall within the same regions (exons +/-5bp) that the exome variants are drawn from. The Affy6 SNPs have an average MAF of 0.118; the average MAF in the exomes is 0.068. The median spacing between the Affy6 SNPs is 15kb; the median spacing of the exome variants is 3.8 kb. The exome variants are certain to be under different selective pressures than the broadly-distributed Affy6 SNPs.

PCA Results
I used SVS to perform PCA on both datasets to find out if we get similar results from these two very different variant sets. The results were striking. The first two principal components, which broadly capture the continental-level stratification of the 1kGP samples, were essentially identical. As shown in the figure below, the only noticeable difference is that the first component is merely reflected around zero. This would make no practical difference if, for example, you were using the principal components as a correction factor in an association test. Further comparisons show that the first four components of the two variant sets are very similar, but diverge beginning with the fifth component. Beyond that point, PCA is generally detecting different features of the data. So you can use PCA to identify population stratification, but you need to be cautious about how you interpret the results.


What about relatedness testing?
Another common application for population-based data is relatedness testing. I ran the two variant sets through the SVS Identity-by-Descent (IBD) algorithm to see if there were any related samples in the 1kGP phase 3 data.

We must be careful about the structure of the population in this analysis. For the IBD method to give uniform results, the subjects need to come from a homogeneous population. The IBD algorithm is based on observed-versus-expected allele sharing, and the expected sharing rate for each SNP is based on that SNP’s observed MAF.

If subjects are drawn from two ethnic groups, and one group is much larger than the other, the majority group will establish the expected sharing rate. A common result of this is that the minority group will appear to have excess sharing at a large number of loci with “rare” minor alleles, and will have inflated IBD estimates as a result. This is especially problematic when the minority group is from a population that is not well represented in HapMap and other diversity catalogs. I first noticed this several years ago in a GWAS that included several Native Americans in addition to the Caucasian majority. The Native Americans were all unrelated, but had pair-wise IBD estimates indicating that they were all inter-related at the level of 2nd-3rd degree relatives. The 1kGP data has similar effects, but with much more diversity in the data.

I ran the IBD algorithm on the same two variant sets that were used for PCA. As expected, the ethnic diversity creates noise that makes it difficult to distinguish between more distant relationships, but the Affy6 and exome data agree that there appear to be 12 first-degree relative pairs in the 1kGP phase 3 version 5 data. The release notes indicate the removal of related samples in a previous version, but I think that they missed a few. Consider the following table:

Sample1 Sample2 Population SuperPop IBD (Exome)  IBD (Affy6) Relationship
NA20891 NA20900 GIH SAS 0.531 0.542 Parent-Child
NA20882 NA20900 GIH SAS 0.545 0.539 Parent-Child
HG03733 HG03899 STU SAS 0.587 0.571 Full Sib
HG03873 HG03998 ITU – STU SAS 0.572 0.564 Full Sib
HG03750 HG03754 STU SAS 0.557 0.544 Parent-Child
NA19904 NA19913 ASW AFR 0.500 0.500 Parent-Child
NA20317 NA20318 ASW AFR 0.500 0.500 Parent-Child
NA20320 NA20321 ASW AFR 0.500 0.500 Parent-Child
NA20334 NA20355 ASW AFR 0.500 0.500 Parent-Child
NA20359 NA20362 ASW AFR 0.500 0.500 Parent-Child
HG02429 HG02479 ACB AFR 0.492 0.483 Full Sib
NA19331 NA19334 LWK AFR 0.424 0.439 Full Sib

IBD estimates of about 0.5 indicate a first-degree relationship. The values in the table range from 0.424 to 0.587. The values are a little bit noisy due to the multiple ethnic groups, but I’m quite certain of the relationships. I re-ran the analysis separately only the SAS (South Asian) and AFR (African) super-populations, and the results within the groups were much cleaner and confirmed this result.

The IBD algorithm also gives the probability of sample pairs sharing 0, 1 or 2 alleles IBD genome-wide, which can be used to infer the type of relationship. Parent-offspring pairs should have 100% sharing of 1 allele, and no sharing of 0 or 2 alleles. Sibling pairs should share 25%, 50%, and 25% respectively for 0, 1 and 2 copies. Based on this data (not shown), we can determine the types of relationships observed. Two results stand out to me:

  1. Samples NA20882, NA20891 and NA20900 form a family trio.
  2. Samples HG03873 and HG03998 appear to be siblings, but are reported to be from different populations. One is reported to be Indian Telegu (ITU), and the other Sri Lankan Tamil (STU). I reviewed the principal components to see if one of the pair might be mislabeled, but the ITU and STU groups are very similar in the PCA results, even when the South Asian populations are examined alone.

Final Thoughts
The SVS software doesn’t distinguish between sequence variants and chip-based genotypes for most genotype analysis functions. As such, exome variant data can be used for many GWAS-style analysis procedures. But be aware that exome data is distributed very differently than are the SNPs from GWAS chips, and you need to be cautious about interpreting the results.

Posted in Customer Questions, General statistical genetics principles, How to's and advanced workflows | Tagged | Leave a comment
Andreas Scherer

Final Thoughts on PAGXXIII


The Plant & Animal Genome XXIII Conference (PAG) was again a success. It’s the venue for leading genetic scientists and researchers involved in plant and animal research to meet with their peers. If anything the event continues to grow. The largest population of registrations tend to be from an Academic background (64%), with Industry (25%) and Government (11%) sectors comprising the remainder. This year well over 3000 attendees from over 50 countries worldwide visited. Without exaggeration it is a truly global event with about 40% of attendees traveling from outside the USA.

One of the highlights were the plenary sessions. The Town & Country Ballroom was packed each and every time. Here is a brief overview:
Continue reading

Posted in About GHI, Big picture | Tagged , , | Leave a comment

VarSeq helps fuel expansion at NorthShore University HealthSystem


Golden Helix and the NorthShore University HealthSystem recently announced our collaboration as the NorthShore puts the VarSeq software to work in it’s gene panel pipeline. VarSeq will be used in Northshore’s clinical lab to help them realize their aggressive expansion plan for 2015.

The NorthShore Next Generation Sequencing (NGS) lab focuses primarily in Oncology, both somatic mutations in a variety of tumor types as well as inherited cancer syndromes. The lab hopes to move forward with exome sequencing later in 2015.

We are very excited to work with NorthShore in the coming year and we look forward to support our many other collaborations as well. Read the full press release here.

Posted in Big picture, Clinical genetics, News, events, & announcements | Leave a comment
Andreas Scherer

Genetic Testing for Cancer

200x120Cancer Cell

In 1914 the German cytologist Theodor Boveri coined the phrase “Cancer is a disease of the genome”. At this time his ideas were equally revolutionary as they were highly contested. Fast forward. More than hundred years later, Next-Generation Sequencing effectively permits a highly sensitive analysis of cancer cells. It can help us to understand mutations associated with cancer development and progression. It also reveals other genomic rearrangements previously unknown to occur in the cancer genome. Translation of these findings for clinical purposes is increasingly part of standard care today. I fully believe that next-generation sequencing will rapidly become a powerful tool for the personalized diagnosis and management of cancer. This e-book will focus on the parts of this process that are best understood: Cancer Gene Panels.
Continue reading

Posted in About GHI, Best practices in genetic analysis, Bioinformatic support, Clinical genetics | Leave a comment
Cheryl Rogers

Dr. Andreas Scherer to speak at ITI 2015

The Integrative Therapies Institute is soon hosting the annual, ITI 2015 conference January 23rd through the 25th in sunny San Diego and our own Dr. Andreas Scherer has been invited to speak.

Some of the most prominent genomic and integrative medicine specialists will gather at ITI 2015 to share case studies and protocols with the community. Attendees can expect to get keen insights into best practices, aimed at improving patient outcomes.

Dr. Scherer will present Bioinformatics of Cancer Gene Panels at 5:30 pm on Friday, January 23rd. The talk will address the value of NGS in a clinical setting as well as the bioinformatics tools and best practices used to achieve the goal of a reproducible workflow for analyzing NGS gene panel data.

For more information or to register for ITI 2015, please visit the official website here.

Posted in About GHI, Best practices in genetic analysis, Big picture, Clinical genetics | Tagged | Leave a comment
Gabe Rudy

In Pursuit of Longevity: Analyzing the Supercentenarian Whole Genomes with VarSeq

If you haven’t been closely watching the twittersphere or other headline sources of the genetics community, you may have missed the recent chatter about the whole genome sequencing of 17 supercentenarians (people who live beyond 110 years).

While genetics only explains 20-30% of the longevity of those with average life-spans, it turns out there is a number of good reasons to think extreme longevity has a very large genetic component (and interestingly not a lot to do with lifestyle choices like smoking, alcohol use, exercise etc).

The collaboration between Stanford researchers (including the visionary Dr Leroy Hood) and the Supercentenarian Research Foundation resulted in an open access paper in PloS One about the sequencing and analysis of 17 of the 22 supercentenarians alive in the US.

This group clearly wants their data and results to have the best chance of making an impact and being of significance to the genetics community. Along with the paper being open, there are supplemental tables of various sets of interesting variants, as well as a reference to their project’s website with directions for requesting access to the full raw variant calls for each individual.

Continue reading

Posted in Best practices in genetic analysis, Big picture, Clinical genetics | Leave a comment
Andreas Scherer

What to expect from Golden Helix in 2015


There is a lot we can be grateful for at Golden Helix. The past year was marked by two major breakthrough launches. Earlier in 2014, we shipped SVS 8 which unified SVS with our GenomeBrowse product. We were able to improve SVS’ data management and visualization capabilities. In addition we added a number of new methods in SVS, such as SKAT-O, MM-KBAC, and various genomic prediction algorithms.
Continue reading

Posted in About GHI, Big picture | 1 Comment
Cheryl Rogers

PAG Bound!

Once again, we will be kicking off our year with our annual trip to San Diego for PAG XXIII. This year, it could not come at a better time. Over the last few weeks, it has been bitter cold in Montana with temps barely reaching above zero degrees and I for one am looking forward to the warm sun. And moreover, this being my first PAG, I am really looking forward to meeting some of you!

This year you will find us at booth 124 and with the addition of Genomic Prediction to SVS, we have some great product demonstrations as well as t-shirt give-aways lined up for you. If you haven’t seen our latest designs, you can check them out here.

Here is our demo and t-shirt give-away schedule (bold times are t-shirt give-aways!) :

Sunday, January 11

  • 3:15 pm – GenomeBrowse: Curating Your Own Reference Sequence and Gene Track
  • 5:00 pm – GWAS using Sus Scrofa as a model
  • 7:00 pm – GenomeBrowse: Using Evernote to Share Publication Quality Plots and Notes

Monday, January 12

  • 11:00 am – Genomic Prediction with gBLUP and Bayes C-Pi
  • 1:30 pm – GenomeBrowse: Curating Your Own Reference Sequence and Gene Track
  • 3:30 pm – GWAS using Sus Scrofa as a model

Tuesday, January 13

  • 10:15 am – RNASeq Differential Expression using Public Data from Mus musculus
  • 1:30 pm – Genomic Prediction with gBLUP and Bayes C-Pi

This year’s PAG is certain to be superb with a great line-up of speakers, workshops, and presentations. We can’t wait and we hope to see you there!

Posted in News, events, & announcements, Plant & animal | Tagged | 1 Comment
Cheryl Rogers

2014 in a Nutshell

It’s cliche, I know, but wow…2014 flew by! And what a great year it was for the Golden Helix team – we made upgrades to both GenomeBrowse and SVS and released a brand new product – VarSeq!

In April, we released GenomeBrowse 2.0, which was a reflection of our most frequent user requests. Users now have the ability to upload genome reference sequences through our Data Conversion Wizard. GenomeBrowse also now supports BED and WIG files as well as several other file formats like BAM, BED, GTF, 2Bit, VCF, FASTA, TSV, and CSV. GenomeBrowse also now includes native integration of Evernote, the ability to access files over a network (FTP or HTTP) or the cloud, and remote control allowing researchers to control GenomeBrowse programmatically via HTTP remote access to integrate GenomeBrowse visualization into already existing workflows.

Continue reading

Posted in Uncategorized | Leave a comment
Andreas Scherer

Golden Helix Gives Back – The Winners

Last month, I announced our Golden Helix Gives Back Campaign. During times like this, when funding is tight, we wanted to make a statement to our community. We at Golden Helix are committed to empower researchers and practitioners in the life science field. For those hard working people it is nice to catch a break from time to time.

After our announcement, we were surprised at the volume of inquiries, numbering into the hundreds. While we wish we could help everyone, our team of reviewers had a difficult job to select the three researchers that we felt were most deserving. Continue reading

Posted in About GHI, Big picture | Leave a comment