It has now been about a year since Illumina and Affymetrix announced their respective exome genotyping arrays. Both products were launched with ambitious visions of how they would enable researchers to learn significantly more about the cause of human diseases.
Sales of the Illumina chip exceeded expectations, and the company said that it sold enough exome chips through the first quarter of 2012 to test 1.3 million samples. Exact sales figures for the Affy product are harder to come by, but they report strong interest and claim competitive advantage over Illumina due to having greater content. I’ve heard through various sources that sales of the exome chips are being driven by research consortia buying the chips in large volumes.
I haven’t yet had the opportunity to work with experimental data from either exome chip yet, but I have been very curious to learn more about the content available from each array. I have also been anxious to get some answers about a certain technical aspect to these products that has bothered me since they were announced.
So in the spirit of election season, I decided to research the candidates to see how they compare on some selected issues. I’ll share what I learned in this blog post. (Hint: the missing genotype calls from these chips might be more interesting than the successful calls.)
Comparing the Chips
I downloaded annotation files for both chips directly from the respective manufacturers. The Affymetrix annotation file is called “Axiom_Exome_1A.na32.annot.csv,” and the Illumina annotation file is called “HumanExome-12v1-1_A.csv.” Both annotation files give fairly extensive information about each variant, but I only used the mapping information and the specified reference and alternate alleles for these comparisons. I used public data from multiple sources, including dbNSFP and the NHLBI Exome Sequencing Project, to annotate the variants assayed by each chip and created the following table to compare them side-by-side. Let’s take a look at the numbers:
|Affymetrix Axiom Exome 1A||Illumina HumanExome12v1-1_A|
|SNVs in RefSeq exons||286,144||225,467|
|SNVs in RefSeq exons +/- 100 bp||289,085||227,551|
|Non-Synonymous SNVs (per dbNSFP)||267,477||214,328|
|Possibly or Probably Damaging SNVs (PolyPhen2 via dbNSFP)||97,526||80,752|
|“Damaging” SNVs (SIFT via dbNSFP)||82,903||66,762|
|SNVs found in NHLBI ESP||242,972||188,781|
|Alternate Allele Frequency of SNVs from NHLBI ESP (Q1/median/Q3)||0.00028 / 0.00074 / 0.0052||0.00021 / 0.00059 / 0.0034|
|RefSeq gene transcripts with at least 1 coding SNV||23,844 (849 unique)||23,028 (33 unique)|
|Exon SNVs per RefSeq transcript (Q1/median/Q3||6 / 11 / 20||4 / 9 / 16|
As advertised, the Affy chip definitely seems to win the content battle. It has more total features than the Illumina chip, more coding variants, and more functionally significant variants. Figure 1 (below) clearly shows that the Affy chip has more variants per gene than its competitor. The Affy chip includes variants from 849 gene transcripts that are not represented at all on the Illumina chip, while Illumina has coverage of just 33 gene transcripts not covered by Affy. It is interesting to note that about 230,000 SNVs are shared by the two chips. That means that the Illumina chip has fewer than 13,000 unique features, while Affy has more than 100,000 unique features. Illumina’s marketing materials claim “unparalleled coverage of putative functional exonic variants” — that sounds like election year hyperbole and fuzzy math to me. In their defense, it may have been written before the formal launch of the Affy product.
A Technical Question
I have been bothered by a certain question ever since the exome chips were announced. I wondered how it was possible to assay so many variants in a space as small as the exome without having the allele probes overlap multiple variant positions. I have shown in a previous blog how a single mismatched base in the target sequence will prevent a probe from hybridizing. My concern with the exome chips was that if multiple targeted variants are so close together that the probes cover two (or more) variant positions, and a subject happens to have the alternate allele for both variants on the same copy of the chromosome, then there is a chance that one or more probes will fail to hybridize. In such a scenario, the probes that do not hybridize will probably result in a failed genotype call.
I looked at the marker spacing of both arrays to determine whether in fact there was a risk for overlapping allele probes, and the answer was an overwhelming yes. Genotyping arrays typically have probes between 30 bp and 50 bp in length, and both chips have over 20,000 variants that are less than 10 bases from the previous variant. I picked a random gene (AHNAK) for closer examination and found serious problems on both chips. In a single exon of that gene, Affy has 7 consecutive SNPs with inter-marker spacing of 13, 8, 4, 18, 1 and 24 bp, while Illumina has 5 SNPs with spacing of 13, 8, 4, and 19 bp. This was cause for concern. I couldn’t figure out how to design probes to uniquely assay each variant, but thought that perhaps the manufacturers were smarter than I am and the array design was more sophisticated than I believed. So I went to the vendors for more information.
I wrote identical emails to the technical support representatives at each company. I gave the respective examples from AHNAK and asked the following question:
“It seems therefore that the probes used to assay these SNPs must overlap one another. If a subject has a variant at one of these positions, then any other probes spanning that variant might fail to hybridize, resulting in failed genotype calls. Am I correct? Or is the product designed to account for this? If so, how?”
Illumina responded to me within hours, and Affy replied the following morning. I’ll share their complete responses here, with my own comments to follow.
Response from Illumina:
Thank you very much for your inquiry. There are several points to consider in answering this question.
-The HumanExome beadchip was designed by a consortium of scientists as a custom chip. While it is true that Illumina scored these assays, and that in the scoring process, SNPs in such close proximity would have been tagged in the scorefile, the decisions to include or exclude given assays from the design were made by the consortium.
-Unlike our GoldenGate genotyping technology, in which the allele-specific hybridization is coupled to the PCR amplification and labeling of the target sequences, the Infinium process starts off with an unbiased whole-genome amplification and fragmentation that results in a large excess of target sequence for hybridizing to the array probes.
-The exome chip probes target SNPs that are rare variants, with very low minor allele frequencies, further reducing the chances that any SNPs underlying the probe targets will affect results from that probe.
In summary, the unbiased nature of the amplification step, the overabundance of target sequence, and the extremely low MAF of the target SNPs all combine to reduce the likelihood that the assays for SNPs in close physical proximity will affect each other’s results.
If you have any further questions, please let us know.
Response from Affymetrix:
Thank you for contacting Affymetrix Applications Support.
You are correct. It is possible for nearby SNPs to interfere with each other, dependent on both proximity and frequency of the interfering SNP. Probesets can be designed to hybridize to either strand, so in many cases nearby variants were avoided on the Exome array by designing the probeset to hybridize to the other strand, with no interfering SNP. In some cases, such as the one you describe, this is impossible to fully arrange. Here we would have avoided using any probesets with known, interfering variants within 5 bp of the target, but at least a few of the listed variants would have to be interrogated by probesets with an interfering SNP within the 30 bp probe sequence.
In many cases, the potentially interfering SNP actually has very little impact on the measurement of the target SNP. In others, the effect can be quite large, but this is nearly always caught by our standard SNP QC filtering metrics.
The most likely outcome when the interfering SNP has a large effect, but low frequency, is that individuals with the interfering variant will receive a call of “no call” for the target SNP. There is the possibility of an incorrect call, say heterozygote when the truth is homozygote, but this turns out to be quite uncommon.
If the interfering SNP has a large effect and high frequency, the clustering solution for the target SNP will be visually incorrect, and will almost always be caught be our standard SNP QC metrics, as well. However, it’s always a good idea to visually inspect the clustering solutions for SNPs producing apparently significant results, especially in cases such as this where there’s reason to suspect problems.
The application of the advanced workflow (especially the R-script for SNP filtering) should be able to identify problems with interfering SNPs.
Hope this helps. Please feel free to contact us if you have additional questions.
The Illumina response opened by shifting blame for any design flaws to the consortium that designed the content. This is not unique–Illumina actually sells several genotyping chips, particularly in the agrigenomic sector, that were designed by scientific consortia and are therefore not fully supported by Illumina. But I was surprised that the Illumina representative was reluctant to take responsibility for the HumanExome chip, which has been such a central focus of recent marketing campaigns. They make a valid point that the majority of assayed SNPs are very rare, and therefore the probability of any given subject having two consecutive variants on the same chromosome is very low (I have no data to confirm if any pairs of consecutive rare variants are expected to be in LD or are independent). The comment about unbiased whole-genome amplification just confused me. It doesn’t matter how much target sequence you have–if the target doesn’t match the probe, it won’t hybridize. If anything, having an abundance of target sequence would seem to guarantee that the probes will hybridize to the copy of the chromosome without interfering SNPs, thus producing an incorrect genotype. Anyway, the bottom line is that Illumina admits to the possibility of neighboring SNPs interfering with one another.
The Affymetrix response had a different tone. They recognize the problem, describe the design process by which they attempt to account for it, and actually suggest workflows to address the issue. I found the contrast quite refreshing. The Axiom chemistry uses probes designed with the variant base at the tail end of the probe, as opposed to older designs with the variant base at the center of the probe. By putting the variant base at the end of the probe, they are able to design primers to assay a SNP from either direction on the DNA strand, making it possible to design unique primers for SNPs at very close intervals. I think that the Illumina Infinium system also places the variant at the end of the probe, but I don’t see any indication that they exploited this fact in the array design.
The exome chips provide an economical method to assay a large number of coding variants and investigate the role of rare DNA variants in causing disease. The potential for interference between neighboring SNPs is real, but I wouldn’t let it stop me from using either of the chips. Just be aware that if you see several consecutive missing genotypes for an individual where other subjects have good call rates, then it might be worth taking a closer look at the data. I’ve heard anecdotes from early adopters of the Illumina chip that the genotype calling has been “difficult.” It could be due in part to interfering SNPs, but I would guess that the rarity of the SNPs is a bigger issue, as that can make it very difficult to call genotypes based on signal clustering.
Which product wins?*** For me, it would depend on the application. If I had a cohort of subjects with existing GWAS data and I wanted to run an exome chip to supplement that data, or if I was only interested in the exome, I would choose the Affy product due to greater content and what I perceive as better product design. If I had a group of subjects that had never previously been genotyped on a genome-wide scale, and I wanted to do a thorough GWAS including exome variants, I would strongly consider Illumina and look at one of the combined GWAS+exome arrays. I think the HumanOmniExpressExome has a very nice balance of content, and would be a great choice for most applications. The Omni2.5Exome is great if you can afford it, but I think the Omni5Exome is overkill for most applications.
***Bryce’s opinions don’t necessarily reflect the opinions of Golden Helix Inc.