Exaggerating your number of controls or being precise? “Variant not found in over 10,000 chromosomes from EVS…”

Reading through the last release of AJHG I saw a couple papers mention that the putative rare variant they were investigating was “not present in over 10,000 control chromosomes from the EVS”.

My first reaction was, “What? Do they mean the NHLBI 5400 Exome Sequencing Project? They only have 5,400 exomes not over 10,000! I wonder if there is some new control database I wasn’t aware of?” But being in journal-reading mode, I didn’t bother to look it up and went on.

I was pouring orange juice for my two year old this morning and pictures of perfect karyograms with attached diploid chromosomes came to my head for some reason. Aha! They are technically correct, and I suppose being very precise. They could also have said their variant wasn’t heterozygous or homozgyous in any of the the containing diploid chromosomes from the 5,400 exomes from NHLBI 5400ESP project. Or simply “It wasn’t present in over 5,400 exomes from the Exome Variant Server”. With the variant being on an autosomal chromosome, there was indeed two chromosomes where the mutation could have occurred per sample.

I don’t know. It seems a bit weird to claim your number of controls in terms of chromosomes rather than samples.

If you’ve seen this precedent before, I’d love to hear it.

4 thoughts on “Exaggerating your number of controls or being precise? “Variant not found in over 10,000 chromosomes from EVS…”

  1. Bryce Christensen

    I’ve often seen similar language, particularly in the context of imputation where the size of the reference dataset is quantified by the number of reference haplotypes, rather than the number of individual subjects used.

  2. R Segurado

    Given that chromosomes are the unit of observation in the usual case-control test for genetic association, maybe it’s not that peculiar a way of phrasing it. But it is cheeky.

  3. Jeffrey Rosenfeld

    I think this is just another part of the general push to exaggerate the number of samples that were really used in a GWAS-type study. You initially start with 1000 cases and 2000 controls, but because of various problems (genotyping error rate, duplication of samples, cryptic relatedness…) you reduce your numbers down to 800 cases and 1900 controls. Do you report that the study used 3000 samples, or 2700 samples? Either number can technically be termed correct since you did initially investigate 3000 samples, but I would think that the 2700 number is more honest

  4. Dan Gaston

    From reading lots of clinical genetics papers recently, in order to determine what the proper number of statistical controls is mainly, it is quite common there to report the number of control chromosomes as it is true sample number for determining the frequency of an observed variant.


