Creating Annotation Tracks from 1000 Genomes Phase 1 Data


If you have ever worked with NGS variant data, you may have come to realize that the first task at hand is the seemingly simple categorization of your variants into two bins: known and novel.

Of course, if you’ve ever worked with NGS variant data, you may have also come to the realization that this step is more complex than it seems. It requires the important choice of how you define a variant as “known” or “common” and what the implications of that choice are for your analysis. You are not short on choices of sources for known variants from the venerable dbSNP archive to the clean-slate 1000 Genomes project. It was even mentioned at ASHG/ICHG this year that Complete Genomics intends to add a “globally diverse” panel of reference genomes to their current repository of 69 public genomes.

Often the best choice for your novel variant categorization is to use a reference panel of control individuals with the closest matching ethnicity. But rather than have to face the task of compiling all those reference panels yourself for analysis, it may be preferable to curate them from a known source. In this blog post we will demonstrate the process for curating this data using the 1000 Genomes project’s data on 1,094 individuals of diverse origin as an example. The results of this process are annotation tracks of allele frequencies that can be used in the SNP & Variation Suite (SVS) genome browser and filtering workflows (like those in our NGS Variant Analysis tutorial).

About 1000 Genomes Project to Date
The goal of the 1000 Genomes project is to “find most genetic variants that have frequencies of at least 1% in the population studied.” The project was broken down into three pilot projects and the main project. The pilot projects served to assess and help define the project specifications. Data was analyzed by multiple centers on various platforms and the methods for gene region capturing were also assessed in the pilot projects. The pilot project is now complete, and it determined that 4x coverage was sufficient to meet the goal of identifying most of the variants with a frequency greater than or equal to 1%. Now they have moved onto the main project which is broken up into three phases. Phase 1 will be to sequence 1,167 samples from 13 populations, Phase 2 to sequence 633 samples from 7 populations, and finally Phase 3 to sequence 779 samples from about 8 populations. Unlike the HapMap project, no assumptions regarding “health” are made with these samples.

To date, a subset of 1,094 samples from Phase 1 has had whole genome sequencing performed and the data made available. We will use this dataset to create variant frequency annotations for the whole population set and sub-populations.

Let’s Start Curating
For the 1,094 sample public data, there are two types of files: an “All Sites” file and genotype calls split by chromosome file.

The “All Sites” file (ALL.wgs.phase1.projectConsensus.snps.sites.vcf) gives counts of all alleles, alternate alleles, and alternate allele frequencies (AAF). The AAF is computed from a joint analysis of the alignments of all samples as the “alt alleles” counts vary greatly. This file was used directly to create the 1kG_Phase1_All-Sites-2011_05-GHI_GRCh_37_Homo_sapiens annotation track available to download in SVS. No extra work was required to create this track other than selecting the proper fields from the VCF file to include. While the header information in the VCF file claimed that sites that had more than one alternate allele frequency would have the allele frequencies listed for each ALT allele in the same order as listed in the ALT field, there was only ever one alternate allele frequency, thus the same frequency information was used for each “Ref/Alt” pair.

The second type of file contains genotype calls which we enhanced using SNPTools (an imputation program) to make the calls more accurate. Thus there are no missing values in the calls for each variant. As a result, the genotype calls are not directly equivalent to the allele count results in the “All sites” file. The benefit to the genotype calls, however, is the samples can be subdivided by population to get population specific variant genotype calls as well as population specific alternate allele frequencies.

Defining Populations

So how should the populations be defined in order to build these alternate allele frequencies? Well, there are a couple of options. The populations of the samples given by the 1000 Genomes project are from several different continents; in the first phase of the project, the continents include the Americas, Europe, East Asia, and West Africa.

Population code Population Description Continent Number of Samples
ASW HapMap African ancestry individuals from SW US Americas 61
CEU CEPH individuals Europe 87
CHB (CHB) Han Chinese in Beijing East Asia 97
CHS (CHB) Han Chinese South East Asia 100
CLM Colombian in Medellin, Colombia Americas 60
FIN HapMap Finnish individuals from Finland Europe 93
GBR British individuals from England and Scotland (GBR) Europe 89
IBS Iberian populations in Spain Europe 14
JPT JPT Japanese individuals East Asia 89
LWK (LWK) Luhya individuals West Africa 97
MXL HapMap Mexican individuals from LA California Americas 66
PUR Puerto Rican in Puerto Rico Americas 55
TSI Toscan individuals Europe 98
YRI (YRI) Yoruba individuals West Africa 88

It seems that the “Continent” categorization results in large enough groups to providing meaningful allele frequencies. But to validate this approach, let’s use principal component analysis (PCA) of the genotypes for all samples. We should find in the first two Principle Components (PCs) a strong segregation by population and hopefully continent.

This is is easier said than done. Just take a look at the size of those genotype VCF files! We went ahead and imported those massive VCF files all at once into a Golden Helix SVS project. The import took several days on a Windows 2008 server with 8 cores and 64 GB of RAM but went smoothly and resulted in a spreadsheet with 38,877,749 genotype columns and 1,094 sample rows.

Before we ran PCA on the genotype data, a basic set of filters, available in the Golden Helix SVS program, were applied to get a set of genotypes of common variants in linkage equilibrium for autosomes only.

  1. LD Pruning: The variant genotypes (38,877,749 in total) for all populations were first pruned by linkage disequilibrium (LD) to inactivate the first marker of a pair of markers within a 50 marker window that had a Composite Haplotype Method (CHM) R^2 value of greater than 0.5. This reduced the number of genotype calls by 12,189,892 markers, leaving a total of 26,687,857 markers.
  2. Autosomes: Only markers from autosomes were selected, all other markers were filtered from the genotype spreadsheet. This left 25,675,011 markers.
  3. MAF >= 0.3: The variant genotypes were filtered on Minor Allele Frequency < 0.3, only markers with an MAF >= 0.3 were kept in the spreadsheet for PCA calculations. This left only 493,175 markers.

PCA was performed looking for the top 10 principal components using an additive genetic model and the markers were normalized based on the Theoretical Sigma at Hardy-Weinberg Equilibrium (HWE). Outliers were not removed nor were components recomputed from the principal component computations. See figure below for a scree plot of the first 10 eigenvalues computed from the principal component analysis.

Figure 1: Scree plot of eigenvalues

The first two principal components were plotted in an XY Scatter plot in Golden Helix SVS and the data points were colored based on population group.

Figure 2: EV1 vs EV2 Colored on Population Group

By examining the plot in Figure 3, we can clearly see three major clusters of individuals. But this is also a perfect example of what admixture populations look like in PCA plots. Specifically the data points for the samples in the ASW, PUR, CLM, and MXL populations appear to be mixtures of the most distant population clusters. These populations, not coincidentally, also happened to be the four populations in the Americas continent classification group. After these populations were hidden from the PCA plot three clean continental groups were clearly defined.

Figure 3: EV1 vs EV2 Colored on Population – Admixture Groups Removed

Naturally the populations left were defined into a European population, an Asian population, and an African population according to these population groups:

Population Group Continent Group Number of Samples
CEU Europe 87
FIN Europe 93
GBR Europe 89
IBS Europe 14
TSI Europe 98
Total Europe 381
CHB Asia 97
CHS Asia 100
JPT Asia 89
Total Asia 286
LWK Africa 97
YRI Africa 88
Total Africa 185
Total Overall 852

Now that we have clearly defined and validated continent groups, the next step is to create annotation tracks for population specific alternate allele frequencies. First we need to create subset spreadsheets for each continent. Initially though, these subset spreadsheets will contain the full set of variants. To get variants unique to each continent group we will count the number of alternate alleles for each variant in our population subset and filter out those variants that don’t have at least one alternate allele (i.e. all samples are the reference for that locus). The number of variants per continent group that remain are listed in the table below.

Continent Group Number of Variants
Europe 16,404,865
Asia 14,114,929
Africa 23,071,081

We can then compute the alternate allele frequencies (count of the number of alternate alleles over the total number of alleles) for each variant site in each continent group and make annotation tracks simultaneously containing the location of the variants, the reference allele, alternate alleles, and alternate allele frequencies.

The names of these annotation tracks are:

  • 1kG_Phase1_AFR-Sites-2011_05-GHI_GRCh_37_Homo_sapiens.idf
  • 1kG_Phase1_ASN-Sites-2011_05-GHI_GRCh_37_Homo_sapiens.idf
  • 1kG_Phase1_EUR-Sites-2011_05-GHI_GRCh_37_Homo_sapiens.idf

These annotation tracks can be downloaded through the Golden Helix SVS Annotation Track Manager by opening Golden Helix SVS and going to Tools > Manage Annotation Tracks and clicking on the Download from Network button. A reminder, all of these annotation tracks are for hg_19 (GRCh_37) and so the filter for annotation tracks must be set to this build to have these tracks visible in the download list.

Using These Annotation Tracks
Here are just some examples of ways these annotation tracks might be used to analyze your own variant data set:

  1. Filter sequence data based on presence (or absence) in the 1kG Phase 1 All Sites probe track or the continent specific probe tracks.
    Select > Filter by Annotation > Filter by Probe Track Membership
  2. Bin the variants based on the Alternate Allele Frequencies in the 1kG Phase 1 continent specific probe tracks. This can then be used for the CMC collapsing method.
    Quality Assurance > Genotype > Variant Binning by Frequency Track and
    Analysis > Collapsing Methods > CMC with Hotelling T Squared Test or
    Analysis > Collapsing Methods > CMC with Regression.

Note: the above filtering and analysis methods require the Sequence Module for Golden Helix SVS.

With all the great public data repositories out there, it can still seem like a large hurdle to utilize those often unrefined sources in your own experiment’s analysis. At Golden Helix, we think providing public data in a pre-curated and immediately useful form, like we showed in this blog post, is just as  important as the analysis methods themselves in performing meaningful analysis. We will continue to listen to our customers and incorporate new data releases and sources to our annotation repository. Of course, everything we did here is completely repeatable and supported within SVS so you can just as easily create private or public annotation tracks.  …And that’s my two SNPs.

Greta Linse Peterson

About Greta Linse Peterson

Greta Peterson is Golden Helix’s Director of Services. Her main duty is managing the Field Application Scientist and Customer Support teams. Greta and her team is also responsible for software quality control, ensuring that the software releases are subject to the most rigorous testing protocols and for all the technical documentation and tutorials. In addition, Greta writes Python scripts for extending SVS functionality and conducts software demonstrations and training for customers and prospects. Greta joined Golden Helix in 2008 when she completed her Masters degree in both Mathematics and Statistics at Montana State University in Bozeman. When Greta is not working, she enjoys spending time with her family and hiking the surrounding areas of Bozeman.
This entry was posted in How to's and advanced workflows. Bookmark the permalink.

30 Responses to Creating Annotation Tracks from 1000 Genomes Phase 1 Data

  1. yuandejian says:

    hi Greta
    could you tell me what softwares you use to generate the pca plot? Is eigensoft or gcta?
    And could you tell me how to make the figure 4s of this link
    Thank you very much and I look forward to your response.

  2. Giulio Genovese says:

    Do you have a list of those 493,175 markers available to share?

  3. Anita, thank you very much! Working out how to involve more people is high on our agenda, now.
    Femmes Sacs Fr

  4. What’s up, I wish for to subscribe for this website cheap call to pakistan obtain latest updates, so where can i do it
    please help.

  5. The local SEO Company helps their client’s website to achieve high rankings in the
    search engines. Then, once published, promote them through your social media outlets to widen your audience.
    To get an idea of how the SEO report look like, you
    can request them to provide you with a sample.

  6. Il est possible de sauvegarder infos gardé dans la santé de l’application pour vous d’ iCloud ,
    partout où il peut être fixé tout en expédition au sommeil.

    grand poste à lire

  7. Jeffry says:

    Ask The data set (Jeffry) other father and
    mother if she wants to forgive back child
    support that you were not able to pay.

  8. Darby says:

    Jan is frequently the month in Atlanta though tree elimination can be done at any time of
    the season.

  9. Upon achievement of the check, you will be permitted to consider the site-
    courses that are certain desired at a high price of $30 per site- specific class.

    my webpage developer.mozilla.Org

  10. Arlene says:

    I really hope that none of the folks who ordered themselves or everyone hurts because they
    bypassed the education that is desired to have
    the osha card (Arlene).

  11. says:

    There are absolutely totally free games and a lot of options
    as further attraction at net casinos.

    my webpage

  12. If you consider that only acquiring the Venus Factor plan will provide the promised benefits,
    then you are completely wrong.

    my blog:

  13. m88 says:

    This article is really a pleasant one it helps new the web
    people, who are wishing for blogging.

  14. 右ここに行っ受信|私はあなたはよあなたがします限りを愛しました。 材料主題スタイリッシュなあなたのオーサリング、| 魅力上品スケッチです。それにもかかわらず、あなたはgetコマンド得 震えそれ以上のあなたは、以下のものが提供されたいです。 ハイキング増加これを遮蔽することが多い内側ケース|ほぼ非常に多くがまったく同じ| のようなので、以前は再び| 複数のさらなるを体調不良紛れもなく来ます。
    一部予約販売 爆発的な人気

  15. 私がチェックするために使用される常に、こんにちは ブログここで中夜明け、愛はもっと知りますなど。
    限定モデル 一部予約販売

  16. 私がすることができます ただただ 本当に 実際に誰か純粋 知っている 彼らがしているもの インターネット上で議論の上インターネット。 あなたは間違いなく光に、それが重要な作る| 問題問題をもたらすする方法を理解する方法を知っています。
    3日間限定 最新作モデル

  17. うわー! 結局私はどこからブログ私が得た実際|本当に実際には 有用入手 取る事実 について私の研究と知識に関する。
    特価最安値 一部予約販売

  18. Duane says:

    Here are 50 healthy snacks for diabetics with
    the calorie, carb, salt, fat, health and wellness wsu doctors (Duane) protein nutritional data you need to make a good
    treat choice!

  19. Local health administration degree nyc said in January that it would no longer provide insurance on the state exchange, according
    to the Tennessean everyday paper.

  20. There is no credit check no paper work its all on the basis of the information you

    My website car loans for bad credit no money down in pa

  21. Jimmie says:

    If we compare the functions of this table with other comparable inversion tables for back pain (Jimmie)
    present in the market then we will definitely discover it to be finest among all.

  22. While most developers choose to launch their game on a
    single platform, in one language (usually English), in one country (usually
    the USA), this could place serious limitations on their game’s
    potential for success. Always find out if the games your children are playing
    contain material that is suitable for their
    age. With this game, you will be required to build your own village and eventually protect it from other gamers.

  23. The ultimate way to avoid this is to seek help from a reliable,
    knowledgeable us tax resistance professional.

  24. 依波路機械であるロッキーシリーズの腕時計をみると、依波路機械であるエーゲ海シリーズの腕時計をみると、依波路機械、ジャズシリーズの腕時計をみると、依波路機械である雅構造の自動シリーズの腕時計をみると、依波路石英—ジョス石英シリーズの腕時計をみると、依波路石英——で典雅シリーズの腕時計をみると、

  25. It’s going to be finish of mine day, except before finish I am
    reading this impressive paragraph to increase my knowledge.

  26. If you can t help as it has about, well let’s talk about– have three options, and a
    regular basis. You can have as many had been jetpack joyride coin hack ezeedl jetpack joyride evidently created with PHP,
    ASP Tutorial, ASP.

  27. Thanks for every other magnificent article. The place else may
    anyone get that type of info in such a perfect means of writing?
    I have a presentation subsequent week, and I am at the look for such info.

    My web-site professional kitchen remodelers

  28. |I don?t waste my free time in watching videos except I be keen on to read posts on net and obtain updated from most up-to-date technologies.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>