Detection of CNVs in NGS Data
Our Secondary Analysis 2.0 blog series continues with Part III: Detection of CNVs in NGS Data. We will give you an overview of some design principles of a CNV analytics framework for next-gen sequencing data. There are a number of different approaches to CNV detection. The published algorithms share common strategies to solve the underlying computational problems. In principle, CNV detection methods incorporate three major steps:
- Data preprocessing to correct for biases in the data and create a baseline for detecting variation.
- Assign copy-number states.
- Large Event Calling by defining the boundaries of multi-target events using a segmentation algorithm.
In this phase, the algorithms correct for systematic biases. They normalize the data establishing a baseline for detecting variation. The two most common methods for addressing systematic bias are GC-content and mappability correction. The most common methods for normalization are Principal Component Analysis (PCA) and normalization relative to reference samples. One source of bias in the coverage data is CG-content bias. It is known that regions with high or low GC-content tend to have lower mean read depth due to PCR efficiency in amplification. When correcting for GC-bias, CNV calling algorithms generally will either filter out regions with extreme GC-content or perform normalization to account for the bias. Algorithms that use the filtering approach include XHMM and OncoSNP-SEQ, while algorithms using normalization to account for GC content include CLAMMS, ReadDepth, Patchwork and Control-FREEC.
Another source of bias in the coverage data is mappability bias. Mappability for a given region is the probability that a read originating from the region is unambiguously mapped to it. Regions with low mappability tend to produce more ambiguous reads, which can cause errors in CNV detection. Generally, algorithms will address mappability bias by filtering out low mappability regions. Methods that address mappability bias in this way include CODEX, Control-FREEC and OncoSNP-SEQ.
Several CNV detection algorithms perform their primary normalization via Principal Component Analysis (PCA) on the coverage data. PCA uses an orthogonal transformation to convert a set of observations into a set of linearly uncorrelated variables called principal components. The CONIFER and XHMM algorithms perform normalization using PCA by removing the k strongest principle component. As an alternative to PCA, it is also possible to perform normalization using a set of reference samples. This is done by using deviation from the average coverage in the reference samples as an indicator of CNV occurrence. Generally, this is done by computing evidence metrics, such as a Z-score, relative to the control samples. This approach normalizes out biases present across the reference samples, thereby reducing or eliminating the need to explicitly correct for systematic biases such as GC-content and mappability. Algorithms that rely on reference samples for CNV detection include CoNVaDING, VisCap, CLAMMS and CNVkit.
While PCA based normalization has the advantage of handling varied and even unknown sources of noise in the data, the approach has two major disadvantages compared to reference sample normalization in a clinical setting. First, it requires significantly more samples to provide robust results. Clinical labs may have as few as 15-20 samples as they validate and configure a test. Reference sample-based normalization can provide reasonable results with far fewer samples. Secondly, the choosing of the k strongest principle components to factor out of the data is a somewhat subjective parameter, yet highly influential on the final result. For clinical validation of bioinformatics methods in a genetic test, algorithms should be robust. This means that small changes in inputs will not lead to dramatically different results. Additionally, they need to be as transparent as possible in regards to the usage of intermediate values and metrics. Each false negative (known CNV missed by the algorithm) must be investigated and understood to characterize the limits of a test. For these reasons, we feel the black-box nature of PCA is inferior to the reference sample normalization approach for an algorithm in the clinical context.
During the state assignment step, a copy number state is assigned to each target (or each segment, if segmentation is performed first). The classification problem requires some empirical criteria for assigning a copy number state to a given region. Some algorithms rely on empirically defined thresholds to determine copy number state, while others use Hidden Markov Models (HMM) for classification. Thresholding is the simplest method for CNV classification. This approach involves setting thresholds for one or more of the metrics and calling a CNV if the metrics at the target fall above or below the thresholds. Thresholding is used by CoNVaDING , ReadDepth, Patchwork, Control_FREEC and BIC-Seq. Alternatively, classification can be performed using HMM. HMMs are statistical models that represent the system as a Markov process with hidden states. In an HMM, the state of the system is not directly observable but a single evidence variable, which is dependent on the state, can be observed. Algorithms that use HMMs for classification include XHMM, CANOES, and CLAMMS.
When compared to thresholding approaches to state assignment, HMMs have several key advantages. First, HMMs account for conditional dependencies between the states of adjacent target regions, increasing the probability that a target will have the same state as its neighbor. Second, the probabilistic nature of HMMs allows us to quantify the uncertainty of CNV calls by assigning a probability to each called region. However, unlike thresholding methods, HMMs cannot easily incorporate multiple evidence variables. This shortcoming has led many researchers to rely on thresholding methods despite HMM’s other advantages.
Large Event Calling
Large events must be constructed by merging targets in the same contiguous region into a single event with well-defined boundaries. The most common approach is a simple merging procedure of consecutive targets with the same copy number state. This approach is used by CLAMMS, CANOES, CoNVaDING, OncoSNP-SEQ and XHMM. Unfortunately, these methods fail to reliably call large CNVs as one contiguous event, which is desirable from a clinical interpretation perspective. Instead, these methods will call larger CNV events as a group of smaller events separated by outliers.
Other methods, especially those focused on calling large events, up to and including chromosome level aneuploidy, perform segmentation before calling copy number state. The common segmentation algorithm is Circular Binary Segmentation (CBS). CBS performs segmentation by iteratively computing segments to maximize the variance between segments while minimizing the variance within each segment. It is used by ExomeCNV, VarScan2 and Patchwork. Segmentation before event calling will result in the detection of only large events spanning multiple targets.
Stay tuned for Part IV of our Secondary Analysis 2.0 blog series where we will show some examples of how our CNV caller, VS-CNV, identifies CNVs.