About Gabe RudyMeet Gabe Rudy, GHI’s Vice President of Product Development and team member since 2002. Gabe thrives in the dynamic and fast-changing field of bioinformatics and genetic analysis. Leading a killer team of Computer Scientists and Statisticians in building powerful products and providing world-class support, Gabe puts his passion into enabling Golden Helix’s customers to accelerate their research. When not reading or blogging, Gabe enjoys the outdoor Montana lifestyle. But most importantly, Gabe truly loves spending time with his sons, daughter, and wife. Follow Gabe on Twitter @gabeinformatics.
Recent on Our 2 SNPs...®
On my flight back from this year’s Molecular Tri-Conference in San Francisco, I couldn’t help but ruminate over the intriguing talks, engaging round table discussions, and fabulous dinners with fellow speakers. And I kept returning to the topic of how we aggregate, share, and update data in the interest of understanding our genomes.
Of course, there were many examples of each of these topics given by speakers and through the many conversations I had. The ENCODE project’s massive data output is illuminating the functional value of the genome outside protein coding genes. The CHARGE consortium, with its deeply phenotyped and heavily sequenced cohort of 14,000 individuals, will take a step forward in our understanding of the genome as large as those made by the HapMap and 1000 Genomes Project.
Continue reading →
Tis the season of quiet, productive hours. I’ve been spending a lot of mine thinking about file formats. Actually I’ve been spending mine implementing a new one, but more on that later.
File formats are amazingly important in big data science. In genomics, it is hard not to be awed by how successful the BAM file format is.
I thought one of the most tweetable moments at ASHG 2013 was when Jeffrey Reid from BCM Human Genome Sequencing Center (HGSC) talked about how they offloaded to the cloud (via DNAnexus) 2.4 million hours of compute time to perform the alignment and variant calling on ~4k genomes and ~12k exomes.
In the process, they produced roughly half a petabyte of BAM files (well mostly BAM files, VCFs are an order of magnitude smaller, but part of the output mix).
I’d speculate that Heng Li‘s binary file format for storing alignments of short reads to a reference genome is responsible for more bytes of data being stored on the cloud (and maybe in general) than any other file format in the mere 4 years since it was invented.
But really, the genius of the format was not in the clever and extensible encoding of the output of alignment algorithms (the CIGAR string and key-value pair “tag” field have held up remarkably well through years of innovation and dozens of tools), but in the one-to-one relationship it shared with its text-based counterpart, the SAM file. Continue reading →
I’m a believer in the signal. Whole genomes and exomes have lots of signal. Man, is it cool to look at a pile-up and see a mutation as clear as day that you arrived at after filtering through hundreds of thousands or even millions of candidates.
When these signals sit right in the genomic “sweet spot” of mappable regions with high coverage, you don’t need fancy heuristics or statistics to tell you what the genotype is of the individual you’re looking at. In fact, it gives us the confidence to think that at the end of the day, we should be able to make accurate variant calls, and once done, throw away all these huge files of reads and their alignments, and qualities and alternate alignments and yadda yadda yadda (yes I’m talking BAM files).
But we can’t.
Thankfully, many variants of importance do fall in the genomic sweet spot, but there are others, also of potential importance, where the signal is confounded. Continue reading →
My investigation into my wife’s rare autoimmune disease
I recently got invited to speak at the plenary session of AGBT about my experience in receiving and interpreting my Direct to Consumer (DTC) exomes. I’ve touched on this before in my post discussing my own exome and a caution for clinical labs setting up a GATK pipeline based on buggy variants I received in an updated report.
But I haven’t had a chance to discuss the potentially most interesting member of my exome trio: my wife.
While my exome analysis falls squarely in the “narcissisome” camp of investigating a healthy individual with no expectation of finding highly penetrant functional alleles, I have a meaningful and nuanced question to ask of my wife’s exome: Can exome data provide a plausible genetic story about the pathology of a complex disorder like autoimmune diseases?
While I commented previously about the incomplete picture we have after the “GWAS era” in complex diseases, it turns out Rheumatoid Arthritis is arguably one of the success stories.
The overall heritability of RA is estimated at 60%, and after many GWAS studies, recent meta-analyses pin the portion of the heritability we can account for around 50%.
But my wife was not diagnosed with RA, but rather Juvenile Idiopathic Arthritis (JIA). Although symptomatically these end up being classified and treated similarly, JIA is defined with the early onset of symptoms that is first diagnosed before the age of 16. Continue reading →
In preparation for a webcast I’ll be giving on Wednesday on my own exome, I’ve been spending more time with variant callers and the myriad of false-positives one has to wade through to get to interesting, or potentially significant, variants.
So recently, I was happy to see a message in my inbox from the 23andMe exome team saying they had been continuing to work on improving their exome analysis and that a “final” analysis was now ready to download.
This meant I had both an updated “variants of interest” report as well as updated variant calls in a new VCF file. I’ll get to the report in a second, which lists rare or novel variants in clinically associated genes, but first let’s look at what changed in the variant calls. Continue reading →
Join me on December 5th for a one-hour webcast as I explore my personal exome provided by the Exome Pilot project of 23andMe.
Exome sequencing has seen many success stories in the realm of diagnosing highly penetrant monogenic disorders as well as in informing treatment of certain cancers. As the use of exome sequencing expands to more complex polygenic disorders and peeks into the realm of consumer genetics, we are faced with a set of challenges in both the bioinformatics and interpretation steps of analysis.
I will be acting as an asymptomatic consumer enthusiast as I apply the transparent techniques of high-impact variant discovery using SNP & Variation Suite (SVS) and GenomeBrowse.
These analysis techniques will reflect those commonly used in a clinical diagnostic lab setting to find putative variants for monogentic disorders. As I weed out false positives and genes with low functional significance, I will face the more daunting challenge of interpreting highly credible loss of function or missense variants and what if any impact that would infer to my disease risk, pharmacogenomic profile, or other annotated genomic traits.
To Find a Killer Variant: Successes and Challenges on the Journey to Mass Adoption of NGS in the Clinic
Recently, I have been spending some time analyzing real patient data. I’m preparing for a webcast I’ll be giving in which I will walk through the process of replicating the findings of Dr. Gholson Lyon‘s study on the novel disease diagnosis he named Ogden Syndrome.
Being so close to data that comes directly from clinical settings got me thinking about Gholson’s original paper and subsequent editorial in Nature.
The Ogden Syndrome paper was not only a great case study of using bioinformatic filtering to find a causal mutation with deadly consequences, Dr. Gholson Lyon also used it to make a statement.
When I read his editorial in Nature, I thought he was simply proposing we raise the bar on the standards of lab work and informatics done when sequencing patients for research projects.
It turns out he was saying something much more radical with deep implications on what it will take for NGS to really get traction in the clinic. Continue reading →
After my latest blog post, Jeffrey Rosenfeld reached out to me. Jeff is a member of the Analysis Group of the 1000 Genomes Project and shared some fascinating insights that I got permission to share here:
I saw your great blog post about the problems in the lack of overlap between Complete Genomics and 1000 Genomes data. I just had a paper published that addresses these same issue that I think you and the Golden Helix team would find interesting:
Jeff Continue reading →
I recently curated the latest population frequency catalog from the 1000 Genomes Project onto our annotation servers, and I had very high hopes for this track. First of all, I applaud 1000 Genomes for the amount of effort they have put in to providing the community with the largest set of high-quality whole genome controls available.
My high hopes are well justified.
After all, the 1000 Genomes Phase 1 project completed at the end of 2010, and they have released their catalog of computed variants and corresponding population frequencies at least five times since.
In May 2011, they announced an interim release based only on the low coverage whole genomes. This release was widely used, and one we also curated. Then in October 2011, their official first release was announced – an integrated call set that incorporated their exome data. Following that, version1 was released and re-released three times throughout November and December 2011. In 2012 we saw version2 in February, and finally version3 was released in April.
But as it turns out, simply using the 1000 Genomes Phase1 Variant Set v3 as your population filter will fail to filter out some common and well-validated variants. Continue reading →
There is nothing cooler than having something arrive that you have been excitedly waiting for: last week I got an email notification that my 23andMe exome results were ready.
Actually, I got 3 emails that my exome results were ready.
You see, I lucked out.
It all began two years ago on DNA day when Hacker News reported that 23andMe was running a special deal on their personal SNP-array genotyping and interpretation service. Looking across the room at my 7 month pregnant wife, I smiled and pulled out my credit card.
I then proceeded to enter in its numbers – 3 times.
Thankfully 23andMe allowed for returning your spit-in-a-tube DNA up to 6 months from the purchase of your order. Given one of my spit providers was my yet-to-be-born son, this was a very fortuitous policy.
Minus the frustrations of getting a newborn to provide what seems like a million little droplets of spit, the 23andMe customer experience turned out to be really quite entertaining and useful.
For example, based on the roughly 1 million SNPs, dispersed across the genome, 23andMe can predict fun traits like earwax type and the ability to taste bitter food. They can provide useful pharmacogenomic assessments like your sensitivity to Warfarin and how fast you metabolize caffeine. Finally, they can predict your lifetime risk of contracting common genetically-linked diseases such Coronary Heart Disease, Type 2 Diabetes, and certain types of cancers.
But as I talked about in my recent post, the research that 23andMe’s predictions are based on have serious limitations as to how much of the real genetic risk for these complex diseases it can account for with the common SNP-based genotyping it provides with its service.
There will probably be many ways in which further research and sequencing techniques will account for this missing genetic risk, but a promising research direction is examining rare variants in and around the protein-coding genes (the exome) that have the potential to directly influence biological function. Continue reading →
Exaggerating your number of controls or being precise? “Variant not found in over 10,000 chromosomes from EVS…”
My first reaction was, “What? Do they mean the NHLBI 5400 Exome Sequencing Project? They only have 5,400 exomes not over 10,000! I wonder if there is some new control database I wasn’t aware of?” But being in journal-reading mode, I didn’t bother to look it up and went on.
I was pouring orange juice for my two year old this morning and pictures of perfect karyograms with attached diploid chromosomes came to my head for some reason. Aha! They are technically correct, and I suppose being very precise. They could also have said their variant wasn’t heterozygous or homozgyous in any of the the containing diploid chromosomes from the 5,400 exomes from NHLBI 5400ESP project. Or simply “It wasn’t present in over 5,400 exomes from the Exome Variant Server”. With the variant being on an autosomal chromosome, there was indeed two chromosomes where the mutation could have occurred per sample.
I don’t know. It seems a bit weird to claim your number of controls in terms of chromosomes rather than samples.
If you’ve seen this precedent before, I’d love to hear it.
Today I ran into an interesting fact about how a prolifically used catalog of population controls classifies African Americans with potential impacts on research outcomes.
The 1000 Genomes Project is arguably our best common set of controls used in genomic studies.
They recently finished what was termed as “Phase 1″ of the project, and they have been releasing full sets of the variants they discovered in over 1,092 individuals from various population backgrounds.
One of the key attributes of any population variant catalog is the frequency in which the variant allele shows up.
Having allele frequency information allows you to filter variants from your own samples. For example, researchers often do a first pass filter of their variants to only investigate “rare” variants. Continue reading →
Type 2 Diabetes, Rheumatoid Arthritis, Obesity, Chrohn’s Diseases and Coronary Heart Disease are examples of common, chronic diseases that have a significant genetic component.
It should be no surprise that these diseases have been the target of much genetic research.
Yet over the past decade, the tools of our research efforts have failed to unravel the complete biological architecture of these diseases. Continue reading →
As I’ve mentioned in previous blog posts, one of the great aspects of our scientific community is the sharing of public data. With a mission of providing powerful and accurate tools to researchers, we at at Golden Helix especially appreciate the value of having rich and extensive public data to test and calibrate those tools. Public data allow us to do everything from testing the limits of our spreadsheet interface by importing whole genome samples from the 1000 Genomes project, to providing real-world data for incoming users to try full-feature examples of analysis workflows.
In this blog post we will examine the richness of Complete Genomics’ public samples available from their website. I will lead you through a series of filtering routines to reduce the search space from millions of variants to something more manageable (2,500 variants), all while discussing interesting characteristics of the dataset, perhaps giving you ideas for exploring your next sequence project. Continue reading →
In a series of previous blog posts, I gave an overview of Next Generation Sequencing trends and technologies. In Part Two of the three part series, the range of steps and programs used in the bioinfomatics of NGS data was defined as primary, secondary and tertiary analysis. In Part Three I went into more details on the needs and workflows of tertiary analysis and recently the Sequence Analysis Module in SVS was released to meet some of those needs.
Since then we have had quite a few requests for more details on the file formats and programs involved in getting data ready for tertiary analysis. Sequencing service providers most likely will have a set of tools orchestrated into a secondary analysis pipeline to process the sequence data from the machine to the point of being deliverable for tertiary analysis. But there are good reasons to understand this pipeline yourself in more detail. For one, you may want to know what to expect or ask from a service provider, internal core lab, or collaborating bioinformatician. Or you may simply want to learn about the types of files produced by various pipelines and what that means for your tertiary analysis. Continue reading →
The advances in DNA sequencing are another magnificent technological revolution that we’re all excited to be a part of. Similar to how the technology of microprocessors enabled the personalization of computers, or how the new paradigms of web 2.0 redefined how we use the internet, high-throughput sequencing machines are defining and driving a new era of biology.
Biologists, geneticists, clinicians, and pretty much any researcher with questions about our genetic code can now more affordably and capably than ever get DNA samples sequenced and processed in their pursuit for answers. Yes, this new-found technology produces unwieldy outputs of data. But thankfully, as raw data is processed down to just the differences between genomes, we are dealing with rich and accurate information that can easily be handled by researchers on their own terms.
In Part 1 of this series, I followed the story of how this revolution came about through the innovation of companies competing during and after the years of the Human Genome Project. In Part 2, I enumerated the tiers of analysis that start at processing the parallel measurements of the current generation’s high throughput sequencing machines, and ends with genomic variants of sequenced samples. Though more innovations are sure to come in this realm of pipelined processing of sequence data, it’s not until the final tier, tertiary analysis, that researchers are able to apply their own personal expertise to the process of extracting insight from their research data.
In this final installment of the Hitchhiker’s Guide to Next Generation Sequencing, I’m going to explore in more depth the workflows of tertiary analysis, focusing primarily on genotypic variants. Over the last three to four years, the scientific community has proposed a set of methods and tools for us to review as we explore the current landscape of solutions. So let’s examine the current state of sense making, how the field is progressing, and the challenges that lay ahead. Continue reading →
When you think about the cost of doing genetic research, it’s no secret that the complexity of bioinformatics has been making data analysis a larger and larger portion of the total cost of a given project or study. With next-gen sequencing data, this reality is rapidly setting in. In fact, if it hasn’t already, it’s been commonly suggested that the total cost of storing and analyzing sequence data will soon be greater than the cost of obtaining the raw data from sequencing machines.
In my previous post, A Hitchhiker’s Guide to Next Generation Sequencing – Part 1, I covered the history and forces behind the decreasing cost curve of producing sequence data. Through innovation and competition, high throughput sequencing machines are making it more affordable to use whole exome or whole genome sequencing as a research tool in fields that benefit from deeply understanding the genetic components of disease or other phenotypic traits.
In this post I plan to explore, in depth, what goes into the analysis of sequence data and why both the cost and complexity of the bioinformatics should not be ignored. Whether you plan to send samples to a sequencing-as-a-service center, or brave the challenges of sequencing samples yourself, this post will help distinguish the fundamental difference in analyses by their usage patterns and complexity. While some bioinformatics packages work well in a centralized, highly tuned and continuously improved pipeline, others fall into a long tail of valuable but currently isolated tools that allow you to gain insight and results from your sequence data. Continue reading →
If you have had any experience with Golden Helix, you know we are not a company to shy away from a challenge. We helped pioneer the uncharted territory of copy number analysis with our optimal segmenting algorithm, and we recently hand crafted a version that runs on graphical processing units that you can install in your desktop. So it’s probably no surprise to you that the R&D team at Golden Helix has been keeping an eye on the developments of next generation sequencing technologies. But what may have surprised you, as it certainly did us, was the speed in which these sequencing hardware platforms and services advanced. In a matter of a few short years, the price dropped and the accuracy improved to reach today’s standards where acquiring whole exome or whole genome sequence data for samples is both affordable and accurate.
In a series of three blog posts ( Download this series as a PDF), I’m going to cover the evolution of sequencing technologies as a research tool, the bioinformatics of getting raw sequence data into something you can use, and finally the challenges and unmet needs Golden Helix sees in the sense-making of that processed sequence data.
To start with, let’s look at the story of how we got to where we are today. If you ever wondered what’s the difference between an Illumina HiSeq 2000 and a SOLiD 4hq, or why it seems that every six months the purported cost of whole genome sequencing is halved, then this story is for you. Continue reading →
In the paper Runs of homozygosity reveal highly penetrant recessive loci in schizophrenia Dr. Todd Lencz introduced a new way of doing association testing using SNP microarray platforms. The method, which he termed whole genome homozygosity association, first identifies patterned clusters of SNPs demonstrating extended homozygosity (runs of homozygosity or “ROHs”) and then employs both genome-wide and regionally-specific statistical tests on ROH clusters for association to disease. This approach differs from single SNP or haplotype association methods employed in most genome-wide association studies, and may be better for identifying chromosomal segments that harbor rare, penetrant recessive loci. Another advantage of this approach is that it reduces the number of covariates and therefore multiple testing corrections for genome-wide significance.
The original algorithms for Dr. Lencz’s study were co-developed by Dr. Christophe Lambert from Golden Helix and later implemented in the SNP and Variation Suite (SVS).
But, like all good science that gets adopted by the community, some other papers and programs have added to the original ROH method. An especially important contribution is the addition of more advanced filtering criteria for ensuring that algorithmically detected ROHs are true population variants and not due to: random chance, the result of genotype calling anomalies, or low marker coverage over certain regions of the genome. This is the first area we looked at for improving our method. Continue reading →