If you have had any experience with Golden Helix, you know we are not a company to shy away from a challenge. We helped pioneer the uncharted territory of copy number analysis with our optimal segmenting algorithm, and we recently hand crafted a version that runs on graphical processing units that you can install in your desktop. So it’s probably no surprise to you that the R&D team at Golden Helix has been keeping an eye on the developments of next generation sequencing technologies. But what may have surprised you, as it certainly did us, was the speed in which these sequencing hardware platforms and services advanced. In a matter of a few short years, the price dropped and the accuracy improved to reach today’s standards where acquiring whole exome or whole genome sequence data for samples is both affordable and accurate.
In a series of three blog posts ( Download this series as a PDF), I’m going to cover the evolution of sequencing technologies as a research tool, the bioinformatics of getting raw sequence data into something you can use, and finally the challenges and unmet needs Golden Helix sees in the sense-making of that processed sequence data.
To start with, let’s look at the story of how we got to where we are today. If you ever wondered what’s the difference between an Illumina HiSeq 2000 and a SOLiD 4hq, or why it seems that every six months the purported cost of whole genome sequencing is halved, then this story is for you.
How We Got Here
As I’m sure Frederick Sanger could tell you, DNA Sequencing is nothing new. He received the Nobel Prize (or half of one) for his technology to determine the base sequence of nucleic acids in 1980. But it seemed that it wasn’t until ten years later, with the audacious pursuit of sequencing the entire human genome, that the real driver of innovation took hold: competition.
With both the Human Genome Project and Celera racing to the finish line, improvements were made to the entire pipeline: from the wet work to the sequence detection hardware to the bioinformatics. What was originally expected to, optimistically, take fifteen years was out the door in ten. Hot on the heels of these original innovators was a new round of start-ups mirroring the dot-com era with their ruthless competitiveness and speed of innovation.
First out of the gate was the venerable 454 Life Sciences, later acquired by Roche, with their large-scale parallel pyrosequencing capable of long reads of 400 to 600 bases. These read lengths allowed for the technology to sequence novel organisms without a reference genome and were able to assemble a genome de Novo with confidence. Although using the advances of the microprocessor industry in producing highly accurate small scale parallel components, the 454 system was still fairly expensive in acquiring a Gb (billion base-pairs) of sequence data.
Over a pint of beer at the local chemist’s bar near Cambridge University, a couple of Brits decided they could do better. In their informal “Beer Summit”, they hashed out an idea for a sequencing chemistry that had the potential to scale to a very cheap and high-throughput sequencing solution. With the biochemistry skills of Shankar Balasubramanian and the lasers of David Klenerman, a massively parallel technique of reversible terminator-based sequencing was matured and commercialized under the funding of their start-up they called Solexa. With the promise of this potential, Solexa was purchased by U.S. based Illumina. The promise held out, and the 1G Genetic Analyzer released in 2006 could sequence a personal genome for about $100,000 in three months.
Coming to market at the same time, but seeming to have just missed the wave, was the Applied Biosystems (ABI) SOLiD system of parallel sequencing by stepwise ligation. Similar to the Solexa technology of creating extremely high throughput short reads cheaply, SOLiD has the added advantage of reading two bases at a time with a florescent label. Because a single base pair change is reflected in two consecutive di-base measurements, this two-base encoding has inherent accuracy in detecting real single nucleotide variations versus potential sequencing errors. In a seemingly otherwise head-to-head competitive spec sheet with Illumina’s Solexa technology, the momentum of the market went to the company that shipped working machines out to the eager sequencing centers first. That prize was won by Illumina by a margin of nearly a year.
Drivers of the Cost Curve
With the fierce competition to stay relevant in an exploding marketplace, the three “next-generation” sequencing platforms Roche 454, Illumina and ABI SOLiD vastly improved the throughput, read length, and quality of their sequencing hardware from their initial offerings. New companies such as Ion Torrent and Pacific Biosciences began to innovate their way into the market with new sequencing technology, each with their own unique advantages–Ion Torrent with simple chemistry and inexpensive hardware and Pacific Biosciences with extremely long reads and reduced sample prep. These “third-generation” sequencing companies have the potential to completely change the cost structure of sequencing by removing entire steps of chemistry or using alternatives to complex optical instruments. But despite the allure of the novel, Illumina has set the competitive bar very high with its recent release of the HiSeq 2000 in terms of throughput and cost per Gb of sequence data produced.
Alongside the technological drivers, there are two other factors I see that are making sequencing a viable and affordable research tool. First is the “democratization of sequencing” effect causing more sequencing machines to show up in smaller institutes, and second is the centralization and specialization found in the “sequencing as a service” business model. Let’s explore both of these and how they may influence the researcher.
Democratization of Sequencing
While larger genome institutes are snatching up the latest high throughput sequencing machines, their existing machines are often being bought by smaller institutes quite happy with the throughput of the previous generation of hardware. Roche is even building a different class of product, the 454 Junior, for people wanting easier sample prep and less expensive machines without the need for as high of throughput per run. ABI is similarly releasing the SOLiD PI at a lower price point.
The target audience for these products are not those wanting to sequence whole genome or even whole exome human samples in their own lab, but rather people who need the flexibility and turn-around time of their own sequencing machine and want to investigate targeted regions or organisms that do not require gigabases of sequence data.
This is the democratization of sequencing: the power of being able to run experiments that treat sequencers as glorified microscopes or cameras, peering into the uncharted territory of certain bacteria or capturing the result of RNA expression experiments.
Sequencing as a Service
On the other hand, if you are interested in getting full genome or exome sequences of humans, you may be interested in another emerging trend: the centralization and growing of sequencing centers concerned with reducing costs by taking advantage of their focused expertise and economies of scale.
Despite what a sequencing platform vendor may tell you, every system comes with its quirks and issues. From the sample prep, to loading and unloading the sequencer, to monitoring and processing the bioinformatics pipeline, there are a lot of places for the process to go wrong and time and money to be wasted. But, on the other hand, if you take a page out of the book of highly efficient manufacturers of the 21st century, you can see amazing consistency and accuracy with a process of continuous improvement in place.
By focusing on just human whole genome or whole exome sample processing, you gain the expertise in providing the best quality data capable of the underlying technology. Complete Genomics has taken it one step further, building their sequencing service around their own unique sequencing chemistry and hardware to have complete control over every step of the process. This would be a good place to raise the flag that just having an outsourced service provider does not eliminate the potential for batch effects and poor study design to confound your downstream analysis. I will talk more about this in Part 2 with the discussion of the analysis of sequence data.
An often undervalued part of the production of quality sequence data is the bioinformatics pipeline that takes the raw reads from the machine and does the assembly or alignment to a reference genome and finally calls the variants or differences between the consensus sequence of the sample and reference genome. Single purpose software tools used in the pipeline have been rapidly developed by the sequencing centers themselves and other bioinformatics researchers. Though built on these open source tools for the most part, the expertise and compute power required to run this pipeline benefits greatly from the economies of scale and specialization of the sequence service providers.
Though not immediately obvious, we can now see that both the democratization of sequencing and the centralization of sequencing through service providers each fulfill their own complementary market needs. If your research is focused on human samples, and you want to do large runs covering whole exomes or genomes, it makes sense to consider the benefits of sequencing services both in terms of price and quality.
Sequencing as a Service Sounds Good. Now What?
So you’ve determined that sequencing as a service is good way to go and may be wondering what you should be requesting from your service provider? What data should you be keeping? What do you need to ensure you will have the data in a format ready for downstream analysis?
Although you may have the option to order just the raw reads from sequencing service providers, we have discussed some reasons why it makes sense to have them run their data processing pipeline on the reads so they can provide you with data ready for downstream analysis.
In fact, the sequence alignment algorithms such as BWA have matured to the point where it doesn’t even make sense to keep the raw reads in their FASTQ format once alignment has been done. To allow for the easiest use of your data in downstream analysis, you should ask for your aligned data in the now standardized and efficient BAM file format and your variant calls in the near-standardized VCF format (although variant calls in any text format is usually sufficient).
For Next Time
If it seems like I’m just getting into the good stuff, don’t worry. In my next post, Part 2 of this series, I will break down the analysis of sequencing data into three categories: Primary, Secondary and Tertiary analysis. Just as important as choosing the right sequence provider and asking for the right services is understanding how to get the most out of your sequence data.
Stay tuned as we evaluate the current landscape of analysis solutions for your sequencing study and the unmet needs of this exciting and rapidly evolving market for using high throughput sequencing as a research tool. …And that’s my two SNPs.