“Where is the missing heritability?” is a question asked frequently in genetic research, usually in the context of diseases that have large heritability estimates, say 60-80%, and yet where only perhaps 5-10% of that heritability has been found. The difficulty seems to come down to the common disease/common variant hypothesis not holding up. Or perhaps more accurately, that the frequency of the assayed markers is not in line with the frequency of the disease (or specific sub-phenotype thereof). Most of the technologies directed towards finding the genetic links to diseases – e.g., the first generation of major microarray platforms used in genome-wide association studies (GWAS) – were developed using this hypothesis as a premise.
Limitations of First Generation Microarrays
One major limitation is that the microarrays used in most major GWAS efforts to date employ common genetic variants originally identified in a rather small number of presumably healthy people (HapMap Phase I). Many high-profile and heavily researched diseases, such as Type 1 diabetes, are really not so common, appearing in 1 person out of, perhaps, 500-800. Why, then, should we expect that common genetic polymorphisms found in a handful of HapMap individuals would be linked to the causes of disease in the relatively small proportion of people who have Type 1 diabetes? The assumption that common single nucleotide polymorphisms (SNPs) will reliably tag such variants is shaky.
Admittedly, we will find some additional signal if we use massive sample sizes, but we will still be missing the bulk of the heritability because of one important mathematical fact: correlation does not obey a transitive relationship. If A is correlated with B, and B with C, then A is not necessarily correlated to C, unless the correlation is perfect. The first generation of microarrays operated off the premise that nearby SNPs in linkage disequilibrium will be sufficiently correlated with the causative SNP to get in the ballpark of the causative variant. That is, they assume transitivity. However, if the causative variants are rare in the healthy population, then they are unlikely to be highly correlated with common variants typed in healthy individuals, and larger sample sizes are going to give, at best, diminishing returns.
Alzheimer’s is often given as an example of the success of the common disease/common variant hypothesis, yet, ironically, it provides an excellent example to illustrate the failure of tagging SNPs. We were recently involved in quality control and analysis of the GenADA Alzheimer’s study posted by GlaxoSmithKline on dbGaP. The study used the Affymetrix 500K array, which does not assay the specific common polymorphism for APOE, a gene in chromosome 19 that has been linked to Alzheimer’s in several linkage and association studies. Separately, that SNP, rs429358, was genotyped using low-throughput methods. When we ran the association, there was no appreciable association in the APOE region from the 500K SNPs, but rs429358 was significant at almost a 1e-60 level in a sample size of 1577 individuals. We then searched chromosome 19 to find the SNP most correlated with rs429358 and found one correlated with a trend test p-value below 1e-13. However, its association with case/control status was 0.0036 – a value not even considered nominally associated in a genome-wide context. Interestingly, we imputed this 500K data set up to the ~900k Affymetrix 6.0 density and still did not see a genome-wide significant signal. Could vast sample sizes have found this signal? Perhaps, but how many samples would it take? In this case, typing the correct marker made all the difference.
The Marriage of Next-Gen Sequencing and Microarrays
I believe the way around this is to locate variants in the diseased individuals and then run all of the machinery of GWAS that has served us well so far. One way is to use next-generation sequencing on modest numbers of cases for a given disease (possibly pooling samples) to find markers that are rare in the overall population but common and more highly penetrant in the disease population, and use those markers in association studies – perhaps with custom microarrays. Given the Alzheimer’s associated SNP was commonly present in Alzheimer’s patients, one might have sequenced even a dozen Alzheimer’s patients and used those SNPs in a GWAS to find the highly significant signal.
Some who espouse the rare variant hypothesis say that there will be many, many rare variants that add up to explaining the missing heritability. I’m inclined to think that for most diseases there will be relatively few and that we just haven’t found them yet. I also think we will have to modify our view of heterogeneous “common diseases” like heart disease – there are probably dozens of ways the cardiovascular system can go wrong, many of them being characterized as “heart disease,” but each being its own unique disease. Perhaps the only problem with the common disease/common variant paradigm is that there are hardly any truly common diseases when we consider sub-phenotypes.
What About That Height Paper?
How then do we reconcile all of the above with the fascinating recent paper that came out in Nature Genetics by Jian Yang, et al., “Common SNPs Explain a Large Portion of the Heritability for Human Height” where 45% of human height variance can be explained by considering all 294,831 common SNPs used in the study simultaneously? Previous to this paper, GWAS studies on tens of thousands of individuals found approximately 50 genome-wide significant SNPs, and determined that these only accounted for around 5% of height variability. The authors write,
There are two logical explanations for the failure of validated SNP associations to explain the estimated heritability: either the causal variants each explain such a small amount of variation that their effects do not reach stringent significance thresholds and/or the causal variants are not in complete linkage disequilibrium (LD) with the SNPs that have been genotyped.
After showing in a regression framework that 294,831 SNPs together account for 45% of the variation, the authors conclude that both explanations hold – there are many causal variants of weak effect, and that the remaining variation is accounted for by untyped variants that are not sufficiently correlated with the typed variants to explain the remaining heritable variability.
For me, there is a disconnect between demonstrating that thousands of variables in a regression model together have a high correlation with height, and concluding that therefore there must be thousands of weakly penetrant causes. Rather the data also seems consistent with there being many weak correlations with potentially quite few untyped causal variants. I think the paper shows a very nice result, in that the heritability is not missing, but I disagree that it must be a huge number of weak effects. In any case they do ultimately invoke untyped variants as the explanation for the other ~1/2 of the variability, which is consistent with the view that we just haven’t typed the variants that matter for our phenotypes of interest.
I have a fundamental cognitive dissonance with the view that it will require the incorporation of tens of thousands of low odds SNPs to explain disease variation. It clashes with the paradigm that drives scientists to search for simpler and more fundamental causes for effects, dating back to Newton who said “‘Natura valde simplex est et sibi consona”, or “nature is exceedingly simple and harmonious with itself”. The evolution of a field of knowledge towards becoming a science begins with classification (e.g. taxonomies), followed by correlation, followed by causation. GWAS is still firmly entrenched in the correlation phase, and correlation only gives us potential directions towards locating the cause. When we consider a disease that presents a consistent phenotype – a single effect, it is difficult for me to posit thousands of causes. Thousands of weak correlations, on the other hand, are totally expected due to the highly interconnected nature of human biology.
The Future of GWAS
I believe success in explaining the “missing heritability” will come if we use clinical data, proteomics, gene expression and metabolomic biomarkers to define subphenotypes (known as deep phenotyping), use next-generation sequencing to sample the variants in those disease subgroups and then follow those with disease-focused GWAS based on custom arrays. Custom arrays with 10,000 SNP and indel markers are currently priced in the ~$50/sample range for large studies. Whole-exome next-gen studies are in the ~$3000/sample range and falling, and whole-genome sequencing is below $10,000/sample and falling. In the coming years, I expect to see many cost-effective and productive studies consisting of sequencing 50-100 cases to find variants to design a custom chip, followed by a GWAS of a few thousand samples. The genotyping work for a 10,000 person study of this sort could be done for under $1M, and this will only go down over time. I believe this marriage of next-generation sequencing and custom microarrays is likely to be a long and fruitful one and is an important part of our own product development direction. …And that’s my two SNPs.
That’s why the article/review/opinion-paper by Robert Plomin et al. is so very interesting… Common disorders are quantitative traits: there’s no such thing as “common diseases”, we are all unique hence we each “fit” on a continues scale of quantitative trait frequencies… That’s why I think we should start from what we can measure (genotypes) and than differentiate the “cases” from the “controls” (at least in genomics) – admittedly I am still pondering on this idea…, i.e. is it valid?
Going back to my comment on a science taking its baby steps through classification, then transitioning to correlation and to causation, it is interesting to note that the concept of disease is essentially classification. Someone observes a pattern, gives it a name, and a phenotype is born. Diagnosis consists of pattern matching. The next step is to form correlations with these classifications. However, when we ultimately move to causation, I believe many of our classifications will be seen to be faulty and need to be abandoned. For instance a mutation in gene XYZ could be the cause of 10% of 5 psychiatric conditions, and a drug could treat the disease for those 10% if it addressed that mutation. Unfortunately as long as we call the disease by the names of those 5 conditions, we will have to run 5 separate clinical trials by indication to get FDA approval in the US — when really we ought to call the indication “disorder of gene XYZ” and run a trial for just those who have that mutation. That is, we must ultimately abandon many of our cherished disease classifications which emerged out of a classification scheme that was not based on a common cause — only a common effect,
Classification tends to be discrete, which is in conflict with the quantitative or continuous characterizations of disease that you advocate. I agree with you that the quantitative measures are better characterizations, and indeed we are all unique. My sense of where phenotype characterization has to go is through molecular measurement — gene expression, proteomics, and small-molecule biomarkers, and that these will have to be assessed longitudinally. An important part of preventative medicine will be looking for departures from baseline rates of change for various biomarkers per individual. Early cancer detection will probably be the first application.
This is an interesting article, but I’m really not sure why the future involves microarrays. Personally, I just assume that the price for doing genomics will continue to fall through the third and fourth generation of sequencers, making microarrays more and more obsolete. I can see their utility in diagnostics – eg, given a panel of known disease causing snps, which does a patient have? – but clinging onto them as an integral part of GWAS just doesn’t make sense to me.
Furthermore, I’m not really sure I believe in the model you’ve proposed of 50-100 cases with NGS, followed by large cohorts of microarray data. That’s based on the assumption (which contradicts what you said above) that disease subtypes are all simple. Using your own example of heart disease, there may be 1000’s of ways in which the heart can go wrong, so you’d barely scratch the surface in your first 100 patients.
Rather, I believe the future is in collecting large databases of healthy and diseased individual’s genotypes, which will give improved resolving power – and that only happens when we start sequencing as many people as we can, and pooling the data. When this happens, I just simply see microarrays disappearing from the GWAS landscape, and being pushed to the diagnostic side.
Have I missed something?
Thanks for your thoughts, Anthony. How far in the future the model I proposed may be viable depends on the pace of innovation and how close the cost curves of next-gen sequencing (NGS) and microarrays have to come to crossing before we might dispense with microarrays altogether. At the moment with a ~200X difference in assay price in NGS vs. custom microarrays, we have a ways to go. However, this is not the only part of the cost we have to forecast. At recent conferences for the first time I’ve heard directors at major sequencing centers talk about their perception that the cost of bioinformatics infrastructure and personnel for NGS far outstrips the cost of the sequencing itself. I’ve never heard that before in the context of microarrays.
In the model I am proposing with 50-100 cases with NGS followed by custom arrays, I am assuming that we are looking at a homogeneous phenotype. Absolutely we should be collecting large databases of healthy and diseased individuals’ genotypes as you suggest. I would add to that that we should also longitudinally capture RNA and urine at periodic intervals, but perhaps that is a topic for another blog post. Still, if you are focused on a specific disease and costs are constrained, I believe the approach I set forth will bring more findings for your research dollar today. In 3-5 years that may well not be the case.
I think there’s a logical flaw in your motivating question: “Why, then, should we expect that common genetic polymorphisms found in a handful of HapMap individuals would be linked to the causes of disease in the relatively small proportion of people who have Type 1 diabetes?”
It’s actually entirely plausible that common variables can affect risk for rare outcomes, and indeed we see it all around us. Lung cancer is very rare (less than 0.1% of the population), smoking is common, yet we know that smoking is a huge risk factor for lung cancer. Dying in a car accident is rare, drunk-driving is all too common, but again, drunk driving hugely increases your risk for dying in a crash. The key in all of these cases (smoking, drinking and driving, having a high risk common allele) is that they increase your risk from “really, really small” to “really small”.
Good points, Jeff. Your analogies with lung cancer are very apt. I should have added a qualifier with my statement. I wasn’t trying to say that common variants cannot affect risk, and indeed the data shows they can. Rather I was trying to make the point that if you are looking for a correlation with an uncommon disease, the most power comes from matching the frequency of your biomarkers with that of the disease. And then to get the highest power biomarkers, look for variants that are common in the diseased population and rare in the healthy population. But your point that common variants can affect risk is well taken.
Pingback: Why Isn’t the Missing Heritability Nearly Neutral and Tightly Networked? [Mike the Mad Biologist]
Dear Dr Lambert,
Thank you for important questions raised. Just one note: as you know, values of the heritability coefficients for common-complex diseases (phenotypes) depend from certain human population gene pool (Falconer). In my populations-based study (Caucasus,Russia) of complex phenotypes & quantitavie physiological traits I have got heritabilities (both h2 and Ga) for same certain trait(s) that were varied from 55 to 70%%(depended from degrees of isolation& gene drif of the populations studied and connecting gene pool diversity). Are you agree that results of GWA or linkages, as well as CNV & LOH of any complex diseases (phenotype), can be different in diverse human populations as well? Thanks in advance for respond.
Thanks for writing. Certainly the results of GWA and linkage studies can differ by population. For example, there are instances of studies replicating in several Caucasian cohorts, then failing to replicate in Chinese cohorts, and vice versa. What do you think are the implications of your widely varying heritability estimates?
Glad to find your respond, many thanks! Sorry for delay, just today returned to Moscow office from long expedition I had for Caucasus highland ethnics genetic study. No any internet connection there.
Regarding Chinese and Caucasians: it is understandable. But in my group long-term study on the Caucasus we found significant differences between Caucasus ethnic populations as well both in H2/GA and Lods. In the study of genetic isolates we found a populations diversity within same ethnic isolates also that are connected with founder effect and other gene drift factors, as well as with traditional for these small ethnics isolates endogamy and consanguinity. About population stratification and Lods we found(Bulayeva et al, 2007, J.Genomics, UK). Most interesting is a latest finding in CNV & LOH study when we typed AFFX SNP 5 & 6 for same pedigrees members (affected and non-affected) from same isolates. To be concrete in the discussion, if you have any interest, I’d like to send you some tables with LODs and CNV&LOH for same clinical phenotypes in same ethnically diverse populations (isolates). Can I do that and how (your email address?)
Pingback: Buy Generics
Pingback: To Impute, or not to Impute | Our 2 SNPs…®
Pingback: “Making Sense” of all that data |