Cardiac Structural and Sarcomere Genes Associated With Cardiomyopathy Exhibit Marked Intolerance of Genetic VariationClinical Perspective
Background—The clinical significance of variants in genes associated with inherited cardiomyopathies can be difficult to determine because of uncertainty regarding population genetic variation and a surprising amount of tolerance of the genome even to loss-of-function variants. We hypothesized that genes associated with cardiomyopathy might be particularly resistant to the accumulation of genetic variation.
Methods and Results—We analyzed the rates of single nucleotide genetic variation in all known genes from the exomes of >5000 individuals from the National Heart, Lung, and Blood Institute’s Exome Sequencing Project, as well as the rates of structural variation from the Database of Genomic Variants. Most variants were rare, with over half unique to 1 individual. Cardiomyopathy-associated genes exhibited a rate of nonsense variants, about 96.1% lower than other Mendelian disease genes. We tested the ability of in silico algorithms to distinguish between a set of variants in MYBPC3, MYH7, and TNNT2 with strong evidence for pathogenicity and variants from the Exome Sequencing Project data. Algorithms based on conservation at the nucleotide level (genomic evolutionary rate profiling, PhastCons) did not perform as well as amino acid-level prediction algorithms (Polyphen-2, SIFT). Variants with strong evidence for disease causality were found in the Exome Sequencing Project data at prevalence higher than expected.
Conclusions—Genes associated with cardiomyopathy carry very low rates of population variation. The existence in population data of variants with strong evidence for pathogenicity suggests that even for Mendelian disease genetics, a probabilistic weighting of multiple variants may be preferred over the single gene causality model.
New DNA sequencing technologies are poised to transform the genetic evaluation of patients. Soon, the availability of genetic information will no longer be a barrier to our understanding of the genetic basis of disease. Rather, our ability to understand and interpret the data will be paramount. The interpretation of clinical genetic testing is a complex process that requires an appreciation of factors establishing causality as well as a detailed understanding of the tolerated genetic variation present in human genomes of different ethnicities. Until recently, much of the genetic variation in human populations was unknown. With large-scale population sequencing projects such as the 1000 Genomes Project,1 the true extent of this variation is now becoming clear. Indeed, recent analyses indicate a surprising prevalence of tolerated genetic variation.2–4
Editorial see p 597
Clinical Perspective on p 610
Clinical genetic testing is increasingly available for conditions such as hypertrophic cardiomyopathy, where it is used for predictive family testing and long QT syndrome, where it may alter management as well as impact family screening.5–7 The yield from genetic testing, however, can be variable. Evidence for or against a variant’s role is assembled from previous reports in the literature; cosegregation; the likelihood that the variant disrupts the reading frame (weighted more toward nonsense variants, small insertion–deletion variants, or splice site variants); and the algorithmic predictions based on conservation, constraint, or protein motif disruption. Despite such resources, a large number of variants found through clinical genetic testing remains of unclear significance. Greatly lacking is knowledge of the population genetic variation in these and other genes, which is needed for the interpretation of variants not just in Mendelian diseases, but also for common disease risk assessment8,9 and pharmacogenomics.10–12
One recent project to catalog population-scale single-nucleotide variant data has been the National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project (ESP).13,14 This large-scale effort is aimed at sequencing the exome, consisting of the protein coding regions (exons) of the human genome, from members of several different cohorts followed throughout the country for the purpose of defining the genetic components of complex diseases. In contrast with the 1000 Genomes study, which has low coverage of hundreds of genomes, the NHLBI exomes study has high-coverage (average >100×), high-quality sequencing data for >5000 individuals of white and black ethnicity. Thus, it represents a valuable comparison data set for variants thought to cause monogenic Mendelian disease. One limitation of both of these data sets, however, is the absence of structural variants. These may be particularly important because of their tendency to disrupt the reading frame. The Database of Genomic Variants (DGV)15 is a curated repository of structural variation (consisting of insertions, deletions, and copy number variants), which serves a similar purpose as the above for structural variation.
Using these sources of population genetic variation, we sought to characterize the tolerance of the human genome to variation in genes associated with Mendelian diseases with a specific focus on those that have been associated with inherited cardiomyopathy.
National Heart, Lung, and Blood Institute Exome Sequencing Project Data
Data from the NHLBI ESP5400 data set was accessed on December 12, 2011, and downloaded for analysis. These data are the accumulation of variants from the exomes of 5379 individuals from multiple cohorts of the ESP, including the Women’s Health Initiative, Framingham Heart Study, Jackson Heart Study, Multi-Ethnic Study of Atherosclerosis, Atherosclerosis Risk in Communities, Coronary Artery Risk Development in Young Adults, Cardiovascular Health Study, Genomic Research on Asthma in the African Diaspora, Lung Health Study, Pulmonary Arterial Hypertension population, Acute Lung Injury cohort, and Cystic Fibrosis cohort. The primary purpose of the ESP is to sequence the exomes of a large number of individuals selected for the extremes of primarily complex traits from these cohorts. Although these exomes may not represent a true sample of the general population, they do represent a phenotyped cohort that is unlikely to be enriched for Mendelian disease, with the possible exception of cystic fibrosis. Resulting single-nucleotide variant calls were filtered for depth and base call thresholds and were annotated for quality using a support vector machine algorithm by the NHLBI ESP data analysis group. Only calls that passed all quality filters were used for downstream analysis. Further information regarding alignment, variant calls, and filtering, as well as the entirety of these data, are available at http://evs.gs.washington.edu/EVS/.
1000 Genomes Data
Database of Genomic Variants
For evaluation of structural variation, the November 2010 data release from the Database of Genomic Variants (http://projects.tcag.ca/variation), aligned to the hg19 version of the human genome, was accessed. This includes data from 42 separate studies evaluating for structural variation involving segments of DNA >1 kb, as well as smaller insertions–deletions (indels). The data are collected from small individual genome and population-level studies without known enrichment for disease.
Gene annotation data were accessed from the Online Mendelian Inheritance in Man (OMIM) database16 (http://www.ncbi.nlm.nih.gov/omim). All genes as annotated in the NCBI Reference Sequence Database (RefSeq)17 via the University of California, Santa Cruz (UCSC) Genome Browser18 (including alternate isoforms) were divided into subgroups by OMIM annotation and literature review according to their known association with (1) cardiomyopathy, specifically hypertrophic cardiomyopathy (HCM) or dilated cardiomyopathy (DCM); (2) any other Mendelian disease; or (3) neither of the above. After accounting for alternate isoforms, there were 120 isoforms of 46 separate genes associated with inherited cardiomyopathy (online-only Data Supplement Table I), 5764 isoforms of 2831 separate genes with other Mendelian disease association, and 25 437 isoforms of 16 102 separate genes without known Mendelian disease association for which variant data from the ESP5400 data set were available.
Analysis of Population Variation Data
Variants from the ESP5400 data set were grouped by gene into the 3 categories previously described, and minor allele frequencies for each variant were extracted. Variant subtypes were analyzed by predicted functional effect (synonymous, missense, nonsense, and splice), and the sum of minor allele frequencies across known isoforms was used to come up with a raw count of expected number of variants per type per transcript. For synonymous, missense, and nonsense variants, this number was then normalized for transcript length based on data from RefSeq. For splice site variants, this number was normalized by number of known exons per transcript.
In order to evaluate the distribution of small indels (1–50 bp), which were notably absent from the public release of the ESP5400 data set, the subset of called indels from the 1000 Genomes Phase 1 March 2012 release was retrieved and annotated using ANNOVAR19 software against the NCBI RefSeq database17 to determine the subset in coding regions of genes with any disease association as above.
Curation of Known Variants
We manually curated a set of variants in MYH7, MYBPC3, and TNNT2 with strong evidence for causing cardiomyopathy. This set was comprised of missense variants seen in patients at the Stanford Center for Inherited Cardiovascular Disease from September 2010 to December 2011 and considered likely or very likely disease causing. To supplement this list, we selected variants from a publicly available repository of sarcomeric variants20 with the highest number of independent citations. These variants were then manually curated, and any variants we considered likely or very likely disease causing were included in our high-confidence set. Curation relied on published data, cases from our clinical cohort, and case or control data from commercial genetic testing laboratories. Classification was based on segregation data, presence in multiple unrelated cases, absence in controls, and availability of compelling animal model or in vitro data. Variants were considered very likely disease causing only if strong segregation data and animal model data were available.
Algorithmic Prediction of Variant Pathogenicity
All missense variants from the NHLBI ESP5400 data set as well as variants from our curated list of known pathogenic variants in HCM were scored using genomic evolutionary rate profiling,21 a measure of evolutionary constraint at a nucleotide base level using a rejected substitution score, and PhastCons,22 another measure of evolutionary conservation at the nucleotide base level utilizing multiple sequence alignment, using the SeattleSeq SNP Annotation server (http://snp.gs.washington.edu/SeattleSeqAnnotation). Polyphen-223 (http://genetics.bwh.harvard.edu/pph2/) and SIFT24 (http://sift.jcvi.org/) scores, both predictions of pathogenicity of missense variants based on the effects of the predicted resulting amino acid substitution, were obtained from their respective servers.
Structural Variation Analysis
Structural variants from DGV were grouped on the basis of Mendelian disease association. The average number of structural variants per gene was computed. Because of the varying size of both structural variants and the transcripts they affect, we normalized by evaluating only structural variants affecting protein coding regions of genes and calculating the percent of each gene’s coding region based on transcript length affected by a deletion in DGV.
All data analysis was carried out using the R Statistical Programming Language. Tests for statistical significance between groups were nonparametric tests without assumption of the underlying distribution. These included the Wilcoxon rank-sum test for direct comparison between 2 groups, the Kruskal–Wallis test for analysis of variance, and Spearman’s rank order for correlation. Given that most genes are not in linkage with each other, linkage between genes does not affect the results of the Kruskal–Wallis test significantly.
For the analysis of the exonic distribution of pathogenic and ESP variants, Fisher exact test was used. Although Fisher’s test does assume independence of events that may not necessarily be true for the distribution of variants in a gene because of linkage disequilibrium, given the overall rarity of most variants analyzed (almost all <1% minor allele frequency and the majority being unique), it is unlikely that a rare variant in 1 exon significantly affects the probability of a variant in another exon.
Most Genetic Variation Is Rare
Most variants in the population data were not shared between many individuals. Private variants, those that were found only in 1 person, were abundant. Out of the 9974 total variants called in the NHLBI exomes distributed among 46 separate cardiomyopathy-associated genes, 9103 (91%) had minor allele frequencies <1%. Of these rare variants, 5448 (60%) were private. This predominance of rare variants was almost identical in other genes, regardless of whether they were associated with Mendelian disease. Common variants (minor allele frequency >5%) comprised only 5% of all genetic variation in the coding regions of human genes.
We found many genes for which a large amount of genetic variation was not only expected, but also likely serves a critical purpose. Among Mendelian disease–associated genes, the 5 with the highest rates of missense variation were all human leukocyte antigen loci (online-only Data Supplement Table II), where high rates of polymorphism are thought to be selectively maintained.25 Another well-recognized gene locus with very high missense variation was the ABO blood group locus. Among non-Mendelian disease–associated genes, those with the most variation included many of the olfactory receptor genes, consistent with the survival advantage of a sophisticated sensing system for environmental odorant molecules.26
Missense and nonsense variant rates did not seem correlated when looking across all genes (Spearman’s ρ = 0.36). This remained true when looking at the subset of genes with Mendelian disease association or the subset without Mendelian disease association.
Mendelian Disease Genes Exhibit Lower Rates of Genetic Variation
We found significantly lower levels of variation in genes associated with Mendelian disease as compared with genes without a known association (Table 1). In general, this reduction was much stronger for types of genetic variation that would be predicted to have more impact on the resulting protein product, such as splice site or nonsense variants. Mendelian disease genes were noted to have a 67.3% lower rate of nonsense variants as compared with genes without known disease association (P=9.6×10−6). These variants were even more rare in cardiomyopathy-associated genes (Figure 1), which exhibited a 98.7% lower nonsense variant rate as compared with nondisease-associated genes and a 96.1% lower rate as compared with the remaining Mendelian disease–associated genes (P=5.7×10–7). Similarly, lower variant rates were seen for both missense and splice site variants as well. Interestingly, this scenario was reversed with respect to synonymous variation, with cardiomyopathy-specific genes having slightly higher rates of variation (116.4 variants per megabase of coding region per chromosome in cardiomyopathy genes versus 90.8 and 95.1 variants per megabase of coding region per chromosome for non-OMIM and OMIM genes, respectively, P=2.7×10–3).
Nonsense Variants Are Extremely Rare in Cardiac Structural and Sarcomere Genes
Single-nucleotide variants thought to have the most effect on protein function are ones that result in a premature stop codon, that is, nonsense variants. We looked at cardiomyopathy-associated genes in the NHLBI exome data to evaluate for the overall prevalence of this type of variation in a population without known inherited cardiomyopathy. Overall, we found that nonsense variants were extremely rare in these genes. In fact, in the subset of genes that are routinely sequenced for clinical purposes in HCM, we found only 1 nonsense variant each in MYH7 and MYBPC3. Nonsense variants were completely absent in the sarcomeric genes ACTC1, TNNT2, TNNI3, MYL2, MYL3, and TPM1. Although the nonsense variant in MYH7 has not been reported previously, the nonsense variant found in MYBPC3 (p.Trp1214Ter) has been associated with hypertrophic cardiomyopathy in 1 published report in an Asian Indian population.27
Among cardiomyopathy-associated genes, the gene with the greatest number of nonsense variants in the ESP5400 exomes data were the very large gene titin (TTN), which has been implicated in familial DCM. This may be largely due to its immense size, as the coding region of titin consists of upward of 100 kb. In total, we noted 23 predicted nonsense variants in titin in the NHLBI exome data. The majority of these nonsense variants seemed to be distributed evenly throughout the length of the gene, although 2 notable clusters of nonsense variants were found near the 5′ end of the gene (Figure 2). This is in direct contrast to a recent report of a high burden of variants in the A band of the titin protein (corresponding to a group of exons near the 3′ end of the transcript) associated with DCM.28 Both clusters of nonsense variants in our analysis were in exons that are specific to the novex alternate splice isoforms of titin, the first in the terminal exon (exon 46) of the novex-3 isoform (NM_133379) and the second in exon 44 of the novex-2 isoform (NM_133437). Neither of these is the major cardiac isoform of titin, which may explain why nonsense variants in these regions may be more tolerated.
In contrast, DMD, which has been implicated in Duchenne and Becker muscular dystrophy as well as X-linked familial cardiomyopathy,29,30 was noted to manifest an extremely low rate of nonsense variants despite its enormous size. Of all human genes, DMD spans the largest region of the genome: encompassing 2.4 million bases, with a coding region consisting of about 14 kb spread over >70 exons. The NHLBI data set, however, contained only 1 predicted nonsense variant within this gene.
Prediction of Pathogenicity of Missense Variants Remains Challenging
We collected 46 variants, 40 of which were missense, with particularly strong evidence of causality from 3 genes most often found to be causal in HCM (MYBPC3, MYH7, and TNNT2; online-only Data Supplement Table III). Given a large amount of ambiguity over the effects of missense variants in the genome, we compared the missense variants from this pathogenic list to missense variants from the NHLBI exome data within the same genes. These 40 pathogenic missense variants were generally located in regions within these 3 genes that were notable for very low variant frequencies in the population data, suggesting that these are regions with vital functions that do not tolerate high rates of variation (Figure 3).
Furthermore, 10/26 of the pathogenic missense variants in MYH7 and 6/10 of the pathogenic missense variants in TNNT2 were found in exons that were notable for a complete absence of nonsynonymous likely benign variation (online-only Data Supplement Table IV). These exons in MYH7 (exons 6, 7, 9, 13, and 19 of NM_000257) and TNNT2 (exon 10 of NM_000364) thus likely encode critical functional domains in the resultant peptide. In support of this, the above noted exons in MYH7 all encode for portions of the functional head and neck domains.31 In addition, the above-mentioned exon in TNNT2 encodes a portion of a tropomyosin-binding site, with induced variants in this exon previously shown to strongly reduce binding efficacy.32,33 In general, exonic distribution was strikingly different between the pathogenic variants and ESP5400 variants in MYH7 (P=.0059) and TNNT2 (P=.013). This difference was not statistically significant in MYBPC3, which may be because of the low number of pathogenic missense variants in this gene in our collection, consistent with reports that the majority of disease-causing variants in this gene tend to be frameshift, splice, or nonsense variants rather than missense.34,35
Of note, 4 of the 46 variants with good evidence of pathogenicity were present in the NHLBI exome data. The individual incidences of these variants were very low, with almost all found in only 1 individual each, except for 1 variant in TNNT2, p.Arg278Cys, which was found in 6 individuals in the NHLBI exome cohort. No phenotype information was available to us for these individuals. These variants were removed from the NHLBI ESP variant list for any further analysis.
We used widely accepted variant classification algorithms to predict the pathogenicity of missense variants. We found the evolutionary constraint-based algorithms’ genomic evolutionary rate profiling and PhastCons to be poorly predictive of variant pathogenicity in these data. Notably, genomic evolutionary rate profiling scores seemed on the whole to be higher in the NHLBI ESP variant set (Figure 4), the opposite of what would be expected. Although PhastCons predicted scores of >0.95 (max score of 1) for all the variants in our curated causative variant list, the majority of presumably tolerated missense variants (67%) from the NHLBI exome data set were also noted to have a similarly high PhastCons score, resulting in a c-statistic for classification of 0.52, akin to no discriminatory power (Figure 5).
The use of algorithms based on amino acid substitution gave much better results. SIFT had modest discriminatory power with a c-statistic of 0.70. Polyphen-2, which also uses information about peptide structure and interaction, performed the best with a c-statistic of 0.77. It should be noted, however, that Polyphen-2 is based on a machine-learning algorithm that was trained on variants that may have included some of those from our curated list.
Cardiomyopathy Genes Exhibit Less Structural Variation
We attempted to recapitulate these findings in other types of genetic variation by evaluating the distribution of small indels in data from the 1000 Genomes Project. There were notably only 5969 indels from this data set in coding regions, of which 868 were in Mendelian disease–associated genes and 26 were in cardiomyopathy-associated genes. These figures gave total rates of 17 indels per 1000 exons in non-Mendelian disease genes, 10 indels per 1000 exons in Mendelian disease genes, and 9 indels per 1000 exons in cardiomyopathy genes. However, the overall low number of these types of variants in this data limited any further statistical analysis.
We then used data from DGV to query on a per gene basis the number of all structural variants that have been reported as well as the overall extent of the coding region of genes that are covered by known structural variants. We found that the total number, per gene, of all structural variants and only structural variants affecting coding regions did not differ between genes associated with Mendelian disease and those that are not (Table 2). However, we did note a 53% reduction of coding region covered by reported deletion type structural variants in genes that are specifically associated with cardiomyopathy as compared with genes without Mendelian disease association (P=0.02).
Recent studies have suggested a surprising rate of tolerance to genetic variation within the human genome. Here, we show that this tolerance does not extend to genes associated with cardiomyopathy, especially structural and sarcomere genes. This observation fits with a systems model of organism function where some genes are disproportionately intolerant of variation because their function has less redundancy. In addition, in describing population variation data for these genes, we note the presence of a surprising number of disease-associated variants in a population without enrichment for cardiomyopathy.
In contrast to the high rate of genetic variation found in genes dependent on diversity for effective function, such as the olfactory receptor loci, we found that population genetic variation, especially variation expected to affect protein function, was rare in Mendelian disease–associated genes. We hypothesized that genes essential for cardiac function might be among the genes most intolerant of variation. Not only was this the case, but the strength of these associations was also found to be dependent on the severity of the predicted alteration of protein function, exemplified by the extreme rarity of nonsense variants in cardiomyopathy-specific genes. These findings extended to structural variants as well, specifically in regards to the percent of the coding transcript that is involved in deletion-type structural variants in individuals without disease.
One strength of our study is in the practical application to clinical genetic testing, which relies on data from unaffected individuals to judge the likely pathogenicity of novel variants. Because our understanding of human genetic variation has improved, it has become clear that even rare genetic variation can be normal and well tolerated, representing a challenge in linking genotypes to phenotypes. One recent study has estimated, using 1000 Genomes data, that the average person has as many as 100 loss of function variants per genome.2 This population level of variation has implications for the interpretation of results of clinical genetic testing. However, our results indicate that this variation is not evenly distributed, and genes for which associations with Mendelian disease have been established have much lower levels of such variation, likely representing the effects of purifying selection.
Why genes associated with cardiomyopathy show even lower rates of genetic variation than other Mendelian disease–associated genes is not self-evident, but many possibilities exist as to why this may be the case. One study has suggested that Mendelian disease genes may not necessarily be the hubs of gene networks36 (because to manifest disease, a variant cannot be fatal). However, genes associated with cardiomyopathy may be an exception given their essential functions within the sarcomere and the heart’s unique position in serving all other organs. Variants in these highly structured peptides with molecular motor functions that operate constantly throughout life would be expected to be heavily selected against in the general population. The finding of a slight increase in synonymous variants in cardiomyopathy-associated genes is unexpected. It is possible that this represents a decrease in codon use bias in these genes relative to others, which may in turn reflect a decreased need of efficiency of translation of these structural proteins, but why this may be the case is not evident.
One intriguing finding in cardiomyopathy genetics is the contrast between disease-causing variants found in MYBPC3 and those in MYH7, the 2 genes with the highest number of HCM-causing variants. Indeed, the high rate of nonsense pathogenic variants found in MYBPC337 is in contrast with the almost universal missense nature of those found in MYH7. The extreme rarity of nonsense variants in cardiomyopathy genes in the data presented here suggests that a high probability for pathogenicity for such variants found in MYBPC3 in patients would be appropriate. The absence of disease-causing nonsense variants in MYH7 is curious. It may be that MYH7 haploinsufficiency may not be tolerated at all. We do note that predisposition of genes toward 1 type of variation versus another is not uncommon given the poor correlation between rates of different types of variation noted in our data, which may be driven by the resulting effects of such variants (dominant-negative effects in missense variants versus haploinsufficiency states in nonsense variants).
Missense variants remain among the most difficult to interpret in a clinical context. Without a large number of affected and unaffected family members to show cosegregation of variant with disease, it is often difficult to determine if a missense variant truly is pathogenic. Much has been made of the use of measures of evolutionary conservation to prioritize missense variants. Our analysis shows that although these measures can help exclude variants at positions in the genome that do not show conservation, they are unable to efficiently discriminate between likely causative and noncausative variants. Although evolutionary conservation at the nucleotide base level seems to be a necessary characteristic of a pathogenic variant, it is not sufficient in and of itself to classify a variant as causative. Algorithms using the predicted effects of the resulting amino acid substitution showed much better classification potential, although this may in part reflect the use of cardiomyopathy-causative variants as training data for these classifiers.
Our analysis also confirms recent evidence that the overwhelming majority of variation in the human genome is rare (ie, affecting <1% of the population). Interestingly, more than half of variants analyzed were private (found in only 1 person). In fact, taking all 8 commonly sequenced genes for HCM together (ACTC1, TNNT2, TNNI3, MYL2, MYL3, TPM1, MYH7, and MYBPC3), we found 159 private missense variants, 3 private splice site variants, and 2 private nonsense variants for 164 private variants that would have the potential to affect the resulting protein. Assuming that none of these variants was found in the same person, this would imply that 3% of a general population sample who were to be sequenced today would have candidate variants not seen previously on a small HCM disease genetic testing panel. This highlights the continued importance of cosegregation and other supporting data in deciding whether a novel variant is causative of disease.
It was also surprising to find 4 of 46 gold standard pathogenic variants present in this population sample, with a total pathogenic allele count of 9 among 5379 individuals. These data would imply a background prevalence of variants believed causative of HCM of approximately 0.2% (based on 46 variants in 3 genes and, thus, likely a substantial underestimate). However, this is much higher than expected in a general population, where the prevalence of HCM is estimated to be 0.2% in multiple populations,38–40 when considering that the yield of genetic testing is far from 100%. This result is consistent with other recently published studies finding higher than expected prevalence of genetic variants associated with other Mendelian cardiovascular diseases, such as familial DCM14 and long QT syndrome,41 though the burden of evidence of pathogenicity for variants in these studies was variable.
Although it remains possible that some individuals within these cohorts may harbor undiagnosed HCM given that phenotype data for these individuals are not publicly available, the genetic prevalence rate would still be expected to be much lower than that observed in these data. Based on these genetic variant prevalence data, estimates of the incidence of HCM would have to be underestimated by a factor of at least 2 for our current models of HCM disease inheritance to be true. Given that these estimates of HCM disease prevalence were based on multimodality screening in diverse populations, it seems likely that some proportion of the variants thought to be causal of HCM under a single gene model cannot be. Alternatively, we posit that the idea of a single gene disorder with variable penetrance is likely an artifact of a limited genomic window, and that what has commonly been perceived as a single gene disorder may in fact be the result of a combination of multiple genetic variants each contributing a portion of the variance, with variants contributing differently in different individuals. Just as some have suggested that a number of rare variants with strong effect size may be the driver of the inherited component in many common diseases,8,42,43 so too might this be the case for what have historically been perceived as monogenic disorders.
Our study has limitations. No individual phenotype data for the cohorts in NHLBI-ESP, 1000 Genomes, or DGV is publicly available, so it is not possible to determine if those individuals with variants from our curated set may have features of an undiagnosed cardiomyopathy. Although the accumulated set of variants from these 5,379 individuals is available, individual exomes cannot be reconstructed, so it is not possible to determine which variants may be shared on the same chromosome. Also, the family structure of the individuals within the NHLBI ESP data was also unknown. It is thus possible that a rare variant could be overrepresented if many members of the same family were sequenced.
In conclusion, using publicly available exome-wide sequencing data from thousands of individuals, we found that genes associated with Mendelian diseases show much lower rates of protein-altering genetic variation, including missense, nonsense, and splice site variation, with an extreme intolerance of variation noted specifically in cardiomyopathy-associated genes. Cardiomyopathy-associated genes specifically showed intolerance to structural variation as well. Nonsense variants in genes that have been recurrently linked to hypertrophic cardiomyopathy were extremely rare, and our results suggest that such variants in these genes found on clinical testing have a very high likelihood of being pathogenic. In contrast, novel missense variants were present in at least 3% of individuals, and thus, the careful interpretation of missense variants found on clinical genetic testing is critical. Current in silico classification schemes for predicting the pathogenicity of missense variants unfortunately have low power in classifying cardiomyopathy variants. Finally, we note a much higher than expected prevalence of variants with strong evidence for pathogenicity. This finding suggests that, using the power of genome sequencing, a new framework for heterogeneous Mendelian disorders such as inherited cardiomyopathies needs to be developed, where variants found in patients and family members are viewed probabilistically on a spectrum from unlikely to likely contributors of variable individual magnitude. Although this model challenges the classic single variant in a single gene disorder view, it may also begin to explain some of the significant variability in disease expression found in family members with the same causal variant.
The authors would like to thank the NHLBI GO Exome Sequencing Project and its ongoing studies, which produced and provided exome variant calls for comparison: the Lung GO Sequencing Project (HL-102923), the WHI Sequencing Project (HL-102924), the Broad GO Sequencing Project (HL-102925), the Seattle GO Sequencing Project (HL-102926), and the Heart GO Sequencing Project (HL-103010).
Sources of Funding
S. Pan is supported by National Institutes of Health grant 5T15LM007033. This work was also supported in part by National Institutes of Health grants DP2OD004613, R01HL105993, and UL1RR029890 (Dr Ashley).
Dr Ashley reports equity and consulting in relation to Personalis Inc. The other authors report no conflicts.
The online-only Data Supplement is available at http://circgenetics.ahajournals.org/lookup/suppl/doi:10.1161/CIRCGENETICS.112.963421/-/DC1.
- Received February 6, 2012.
- Accepted September 21, 2012.
- © 2012 American Heart Association, Inc.
- MacArthur DG,
- Balasubramanian S,
- Frankish A,
- Huang N,
- Morris J,
- Walter K,
- et al
- Korbel JO,
- Urban AE,
- Affourtit JP,
- Godwin B,
- Grubert F,
- Simons JF,
- et al
- Gersh BJ,
- Maron BJ,
- Bonow RO,
- Dearani JA,
- Fifer MA,
- Link MS,
- et al
- Ackerman MJ,
- Priori SG,
- Willems S,
- Berul C,
- Brugada R,
- Calkins H,
- et al
- Tennessen JA,
- Bigham AW,
- O’Connor TD,
- Fu W,
- Kenny EE,
- Gravel S,
- et al
- Norton N,
- Robertson PD,
- Rieder MJ,
- Züchner S,
- Rampersaud E,
- Martin E,
- et al
- 16.↵Online Mendelian Inheritance in Man, OMIM®. Online Mendelian Inheritance in Man. Retrieved December 11, 2011, from http://omim.org
- Pruitt KD,
- Tatusova T,
- Klimke W,
- Maglott DR
- Fujita PA,
- Rhead B,
- Zweig AS,
- Hinrichs AS,
- Karolchik D,
- Cline MS,
- et al
- Wang K,
- Li M,
- Hakonarson H
- 20.↵NHLBI Program for Genomic Applications, Harvard Medical School. Genomics of Cardiovascular Development, Adaptation, and Remodeling. Retrieved January 20, 2012, from http://www.cardiogenomics.org
- Cooper GM,
- Stone EA,
- Asimenos G,
- Green ED,
- Batzoglou S,
- Sidow A
- Siepel A,
- Bejerano G,
- Pedersen JS,
- Hinrichs AS,
- Hou M,
- Rosenbloom K,
- et al
- Richard P,
- Charron P,
- Carrier L,
- Ledeuil C,
- Cheav T,
- Pichereau C,
- et al
- Goh KI,
- Cusick ME,
- Valle D,
- Childs B,
- Vidal M,
- Barabási AL
- Maron BJ,
- Gardin JM,
- Flack JM,
- Gidding SS,
- Kurosaki TT,
- Bild DE
Recent studies have revealed a high degree of genetic variation in human populations, some of which would be predicted to cause loss of function of the genes encoded. These studies provide a challenge to the careful interpretation of the results from genetic testing and raise concerns about our ability to distinguish between benign and pathogenic variants. We analyzed data from >5000 participants in the National Heart, Lung, and Blood Institute Exome Sequencing Project. To study tolerance of genetic variation in different genes, we derived rates of genetic variation within (1) genes without Mendelian disease association, (2) genes associated with Mendelian disease, and (3) genes associated with inherited cardiomyopathies. We found that genes associated with Mendelian diseases exhibit markedly lower rates of genetic variation. This was even more marked for genes associated with cardiomyopathy. Nonsense variants were extremely rare in most cardiomyopathy genes, suggesting that when such variants are found, they are likely to be pathogenic. We also compared known pathogenic variants in MYH7, MYBPC3, and TNNT2 with those in the population data. We found neither rarity nor nucleotide evolutionary conservation helpful in distinguishing benign from pathogenic variants in these genes. However, the exon distribution of pathogenic and benign variants in MYH7 and TNNT2 was significantly different. Rates of pathogenic variants in population data were higher than would be anticipated, suggesting that a single gene/variant model may not be sufficient to explain many cases of inherited cardiomyopathy. These findings highlight the continued importance of cosegregation and other supporting data in determining variant pathogenicity.