Multiplexed Targeted Resequencing Identifies Coding and Regulatory Variation Underlying Phenotypic Extremes of High-Density Lipoprotein Cholesterol in Humans
Background: Genome-wide association studies have uncovered common variants at many loci influencing human complex traits, such as high-density lipoprotein cholesterol (HDL-C). However, the contribution of the identified genes is difficult to ascertain from current efforts interrogating common variants with small effects. Thus, there is a pressing need for scalable, cost-effective strategies for uncovering causal variants, many of which may be rare and noncoding.
Methods: Here, we used a molecular inversion probe target capture approach to resequence both coding and regulatory regions at 7 HDL-C–associated loci in 797 individuals with extremely high HDL-C versus 735 low-to-normal HDL-C controls. Our targets included protein-coding regions of GALNT2, APOA5, APOC3, SCARB1, CCDC92, ZNF664, CETP, and LIPG (>9 kb) and proximate noncoding regulatory features (>42 kb).
Results: Exome-wide genotyping in 1114 of the 1532 participants yielded a >90% genotyping concordance rate with molecular inversion probe-identified variants in ≈90% of participants. This approach rediscovered nearly all established genome-wide association studies associations in GALNT2, CETP, and LIPG loci with significant and concordant associations with HDL-C from our phenotypic extremes design at 0.1% of the sample size of lipid genome-wide association studies. In addition, we identified a novel, rare, CETP noncoding variant enriched in the extreme high HDL-C group (P<0.01, score test).
Conclusions: Our targeted resequencing of individuals at the HDL-C phenotypic extremes offers a novel, efficient, and cost-effective approach for identifying rare coding and noncoding variation differences in extreme phenotypes and supports the rationale for applying this methodology to uncover rare variation—particularly noncoding variation—underlying myriad complex traits.
Although genome-wide association studies (GWAS) have elucidated the role of common genetic variation to many human complex traits, such as blood lipids, the role of rare genetic variation remains poorly defined.1 This is especially true for rare noncoding variants, which are not captured by whole-exome sequencing currently being applied to large numbers of participants. One strategy to capture novel variation that may include putatively causal variants is targeted resequencing of genes at candidate loci for lipid traits. Indeed, this approach has been applied to the follow-up of initial GWAS studies for low-density lipoprotein cholesterol and triglycerides.2,3 These efforts have largely sequenced the coding regions of candidate genes, with the goal of identifying protein-altering variants that may have a profound functional impact. However, given that the majority of GWAS-implicated variants are in the noncoding genome,4 the contribution of rare noncoding variants to these traits is underexplored.
Plasma levels of high-density lipoprotein cholesterol (HDL-C) are highly heritable. There are >70 loci significantly associated with HDL-C levels through testing of common variants (minor allele frequency [MAF], >0.05) on genome-wide genotyping arrays.5 However, pinpointing the causal variants and genes from these associated loci is challenging. Current efforts to resolve this have included fine mapping of identified loci to determine causal variants,6 but these methods are limited in that they focus on common single-nucleotide polymorphisms (SNPs) with generally small effect sizes. Given that common SNPs are estimated to explain only a fraction of the heritability of HDL-C levels,7 additional variance may be explained by low frequency (MAF, 0.01–0.05) or rare variation (MAF, <0.01) not yet captured in existing genotyping arrays and imputation reference panels. Furthermore, the identification of rare, causal, noncoding variants with strong effect sizes on HDL-C may help to delineate causal and heritable mechanisms governing HDL metabolism that could directly relate to coronary heart disease risk. One limitation hampering targeted sequencing efforts for the noncoding genome is the relatively poor annotation of functional elements most likely to harbor variants of significance. A related issue is that targeted sequencing efforts are costly and scale with the size of the genomic targets, so methods have largely been developed for reliably amplifying and sequencing coding regions of genes. Thus, there is a pressing need for efficient and scalable method for capturing the noncoding genome to apply to large populations to uncover causal variation underlying complex traits, such as HDL-C.
Here, we investigated the feasibility of targeting the noncoding regions of candidate gene loci to identify rare variants underlying HDL-C phenotypic extreme using a novel, cost-effective approach. We adapt a recently reported target capture method involving molecular inversion probes (MIPs)8,9 utilized for autism spectrum disorder candidate gene sequencing to date. Modifying this method, we show the ability to capture noncoding genomic regions in a cohort of high HDL-C cases versus low HDL-C controls. Our results validate previously reported coding and noncoding SNP associations with HDL-C, identify gene-level associations in these 7 regions with this trait, and also show the promise of expanding large-scale targeted resequencing of noncoding regions via this approach for complex traits.
Materials and Methods
The data, analytic methods, and study materials will be made available to other researchers for purposes of reproducing the results or replicating the analyses reported here. Data may be made available on request to the corresponding authors. All recruitment of human participants of this study was approved by the Institutional Review Board of the Perelman School of Medicine at the University of Pennsylvania, and all participants provided informed consent. Full Materials and Methods are available in the Data Supplement of the article.
Multiplexed Targeted Sequencing Approach and Variants Identified
We sought to apply multiplexed targeted sequencing with the purpose of identifying novel and known noncoding variants underlying HDL-C. In doing so, we also sought to test the hypothesis that noncoding variation at HDL-C loci could underlie extreme HDL phenotypes similarly to coding variation traditionally identified by targeted resequencing approaches to date. Thus, we used MIPs to resequence 7 HDL candidate gene regions (Figure 1) in 1532 participants with either extremely high HDL-C versus low HDL-C controls (Table 1; Materials and Methods in the Data Supplement; Figure I in the Data Supplement). After initial sequencing read quality control, alignment, genotyping, filtering, and principal component analysis-based additional sample filtering (Materials and Methods in the Data Supplement; Tables I through IV in the Data Supplement; Figures II through X in the Data Supplement), a total of 1500 of 1532 original samples remained for further variant analysis.
To validate the variants identified from our MIP sequencing, we genotyped 1114 of the 1532 participants (681 high HDL-C individuals and 433 low HDL-C individuals) on the probe-based Illumina Exome Array. Among the variants genotyped on this array, 38 were within our target regions. We observed a high concordance rate in variant discovery between MIP sequencing and genotyping results, with 32 of 38 SNPs (84.2%) overlapping on the Exome Array called with >90% concordance across all participants, and 987 of 1114 participants (88.6%) demonstrating >90% concordance of all genotyped SNPs (Figure 2).
The final MIP sequencing variant call set contained 1956 SNPs and 689 distinct insertion/deletion events (indels; 78 insertions and 611 deletions) for a total of 2645 unique variants in 1500 samples. Of these, 556 correspond with previously reported variants in dbSNP (v141), suggesting that the remaining 2089 were novel discoveries without any previous annotation (Tables 2 and 3). We found that our MIP sequencing capture approach did not preferentially identify variants of a given annotation across our selected genomic targets (Figure XI in the Data Supplement). After these efforts, the MIP sequencing variants were then tested for association with HDL-C using a framework sensitive to MAF and protein coding status of the different variants (Figure XII in the Data Supplement).
Association of Single Variants From Targeted Sequencing With Extremely High HDL-C
We tested the association of 336 common and low-frequency (MAF, ≥0.01) SNPs and indels identified with high versus low HDL levels and observed 34 alleles at significantly greater frequencies among the high HDL-C participants (P<1.49×10−4, score test; Table 4). Of these, 17 were previously reported by the Global Lipids Genetics Consortium GWAS.7
Replication of HDL-C Associations From GWAS Through MIP Sequencing
In addition to rare, noncoding variants identified from MIP sequencing, we also recovered common variants previously associated with HDL-C through the Global Lipids Genetics Consortium plus MetaboChip GWAS.7 In the Global Lipids Genetics Consortium study, 49 variants that exceeded genome-wide significance (P<5×10−8) in their associations with HDL-C are located in regions that overlap with MIP sequencing targets. We observed all of the 49 variants in the MIP sequencing variant call set and likewise observed all of them at common or low frequencies (MAF, >0.01) in the 1500 samples. A total of 17 of the 49 exceeded an experimental statistical threshold (score test P<1.49×10−4), with an additional 10 that were nominally significant (score test P<0.01; Table V in the Data Supplement; Figures XIII through XVII in the Data Supplement). All of the experiment-wide significant and nominally significant associations we identified were directionally consistent with prior reports of SNPs as those loci with HDL-C levels and with comparable MAFs to those reported for each variant from 1000 Genomes Project (phase 3 v5a, European sample set).10,11 An additional 17 SNPs and 2 indels not previously identified in the Global Lipids Genetics Consortium GWAS were significantly associated with HDL-C (Table VI in the Data Supplement).
Rare, Novel, Noncoding Variants With Nominally Significant Associations With HDL-C
Because of the small sample size of our study, we expected modest power to demonstrate association beyond a reasonable doubt. Thus, we examined variants that exhibited nominally significant associations (P<0.01, score test) with elevated HDL-C and identified 68 such SNPs and indels (Table VII in the Data Supplement). These included 54 noncoding variants (ie, located outside of protein-coding sequence) and 14 coding variants. Among the coding and noncoding variants identified with nominally significant associations were 11 rare variants (MAF, ≤0.01), 6 low-frequency variants (0.01<MAF<0.05), and 8 variants not previously described in Single Nucleotide Polymorphism database. Of the noncoding variants identified, 12 were found to have Combined Annotation Dependent Depletion scores of ≥10, suggestive of deleteriousness to gene expression or function (Table VII in the Data Supplement).12 We evaluated the putative impact of the noncoding variants we identified across our regions by exploring overlap between these SNPs and transcription factor binding sites and microRNA seed sites, which identified multiple common noncoding variants across our loci that overlapped such regulatory features (Table VII in the Data Supplement). Among the noncoding SNPs with potential functional impact on gene expression is a proximal variant 21 bp upstream of the transcription start site of CETP, rs34498052 (chr16:56995814 G>A), that was previously identified in a resequencing study of 68 genes in French Canadian myocardial infarction cases and controls. Although this variant overlaps multiple epigenetic marks from Encyclopedia of DNA Elements, including CpG methylation marks in HepG2 hepatocytes and Human Microvascular Endothelial Cells endothelial cells, it was extremely rare (MAF, 0.001; allele count [AC], 3), which made statistical interpretation challenging because the score test is not intended or calibrated for that end of the frequency spectrum given our sample size. More conservatively, for variants identified with >5 allelic copies among the 1500 participants, we identified a single rare, novel, noncoding SNP in a splice region of the CETP gene (chr16:57005300 G>A) that was nominally associated with high HDL-C (P=0.009, score test; AC, 8).
We also investigated the association of these SNPs with expression of genes as expression quantitative trait loci from the Genotype-Tissue Expression Project (Tables VII and VIII in the Data Supplement).13 Analysis of expression quantitative trait loci across human tissues identified 21 of the 54 noncoding SNPs with at least 1 significant expression quantitative trait loci in a human tissue. Among these are a set of noncoding SNPs at the CCDC92 locus associated with reduced CCDC92 expression and that of other genes in subcutaneous adipose tissues, consistent with the recent identification of a sentinel SNP at this locus in linkage disequilibrium with our identified SNPs that was associated with coronary artery disease and also with decreased CCDC92 expression in the same tissue.14 As another example, we show that another set of SNPs downstream of the LIPG gene are associated with LIPG gene expression in skeletal muscle and skin tissues. These SNPs are in linkage disequilibrium with other GWAS-implicated SNPs downstream of LIPG that we previously showed to reduce endothelial lipase protein levels.15 Thus, our MIP sequencing experiment identified multiple regulatory variants underlying high HDL-C that also correlated with cis-regulatory effects on gene expression across human tissues.
Rare Variant Burden Associations With Extremely High HDL-C
Lastly, we tested the hypothesis that the genomic regions we targeted harbor rare variants that collectively contribute to the relationship of these genes with HDL-C levels. We performed aggregate rare variant burden using a framework that categorized rare variants (MAF, <0.01) on the basis of their coding status, deleteriousness, and genic region (Tables 5 and 6; Figure XII in the Data Supplement). We first identified rare coding variants believed to be nonbenign in their putative functional consequence (n=213), organized them based on their predicted impact on protein function (ie,  disruptive,  disruptive plus missense, or  loss-of-function; see Materials and Methods for definitions), and then tested aggregate rare coding variant burden across all targeted genic regions for each predicted impact category. We found that for each predicted impact category, the collection of all rare coding variants did not exhibit a level of rare variant burden that was significantly associated with HDL-C. Similarly, variant aggregation over the coding regions of the individual gene targets separately (n=8) did not identify any individual region with significant variant burden associated with high versus low HDL-C (collapsing test; Tables 5 and 6).
We then asked whether the burden of rare noncoding variants across all targets contributed to extremely high HDL-C. Because of the fact that a methodological framework for predicting the potential regulatory impact of noncoding variants genome wide has yet to be widely accepted, the rare noncoding variants were not subdivided into putative functional categories like the coding variants described above. Thus, we first analyzed all rare noncoding variants as a single group, which resulted in a variant burden that was not significantly associated with high HDL-C in our cohort (P=0.5; Tables 5 and 6). We next grouped rare noncoding variants by physical genic region (n=10) and performed variant burden analyses separately on each region. This approach identified a collection of 151 rare variants in the APOA4-APOA5 intergenic region that were nominally significantly associated with extremely high HDL-C (P=9.43×10−3, collapsing test; Tables 5 and 6). Within this region, we noted a collection of 3 different indels as multiple alternative alleles at the position chr11:116678249 (hg19). Of these, a rare deletion CAA>C (MAF, 0.003; AC, 7) exhibited nominally significant association with high HDL-C (P=0.0427, score test). The second allele was a common deletion (MAF, 0.06; AC, 138) that was not associated with high HDL-C (P=0.75, score test). The third allele was the same common (MAF, 0.26; AC, 605) yet previously unreported insertion of CAA>CAAA at chr11:116678249 that was significantly associated with high HDL-C (P=8.9×10−4, score test) in the single-variant analysis.
We hypothesized that these particular common alternative alleles were driving the nominally significant rare variant burden association signal for the APOA4-APOA5 intergenic region. To test this, we removed it (and all other nonrare variants at multiallelic sites) and reassessed rare variant burden and found a complete attenuation of the association (P=0.43; Tables 5 and 6), thus suggesting that the originally significant association of the cluster of APOA4-APOA5 intergenic variants with HDL-C was driven by common alleles alone.
Translating GWAS trait- and disease-associated common variants to bona fide causal variants, genes, and biological mechanisms has been a major challenge for human genetics. This is due in part to small effect sizes of GWAS variants, and thus resequencing of candidate genes at GWAS loci at the phenotypic extremes of complex traits has become a leading approach to identify rare variants with larger effects. To date, this approach has been applied to coding regions of GWAS candidate genes, yet coding variants account for only a small fraction (≈11%) of all variants tagged complex trait GWAS studies,16 underscoring the need to search the noncoding genome for rare, putatively causal variants. Here, we utilized an inexpensive, modular, and scalable targeted sequencing approach for identifying rare noncoding variants in candidate genes influencing HDL-C—a complex trait with 72 associated loci from GWAS.7 Our proof-of-principle resequencing study of 7 candidate gene regions in 797 extremely high HDL-C versus 735 low HDL-C participants rediscovered and validated nearly all prior GWAS-implicated tag SNPs and revealed ≈2000 variants in noncoding regions of targets, including rare, novel noncoding variants that were nominally associated with HDL-C in our study. As such, our findings provide one of the first applications of a multiplexed targeted resequencing study of noncoding variants across multiple loci at the phenotypic extremes of a complex trait.
We selected 7 candidate loci for targeted sequencing of coding and noncoding regions in our cohorts (Figure 1). Four of the targeted loci, APOC3, SCARB1, CETP, and LIPG, have known roles in HDL metabolism for which loss of function has been shown to elevate HDL-C in humans.17 To explore the hypothesis that rare noncoding variants may underlie GWAS-implicated loci for HDL-C levels, we selected 3 HDL-C loci newly identified through GWAS, GALNT2, SBNO1, and the CCDC92-ZNF664 region for our targeted sequencing. Some sequencing efforts have suggested that GALNT2 coding variants segregate with elevated HDL-C, whereas a recent report from our group found an opposite result for 2 rare coding variants.18 Similarly, the contribution of either coding or noncoding rare variants at the CCDC92-ZNF664 and SBNO1 loci to HDL metabolism remains completely unexplored. Therefore, we evaluated these loci for rare coding and noncoding variants to better determine the directional relationship of these genes with HDL-C beyond the initial common variant associations.
We rediscovered previously implicated variants in our cohort, along with the initial discovery of a few novel candidates requiring statistical support. Most notably, we found significant or nominally significant associations for a majority (55%) of GWAS-implicated HDL-C variants overlapping our targeted regions with consistent directionality to prior associations of these variants. However, we replicated these associations at <1/100th the cohort size of the most recent GWAS for HDL-C (188 577 participants7 versus 1532 participants in our study) through our phenotypic extremes design. We also identified 3 rare (MAF, <0.01) or low-frequency (MAF, <0.05) nonsynonymous coding variants associated with HDL-C levels with directionalities consistent with previous reports (CETP Ala390Pro,19 CETP Arg468Gln,20 and LIPG Asn396Ser21). Collectively, these findings support the utility of candidate gene and noncoding locus resequencing at the extremes of a continuous trait distribution to enrich for trait-associated alleles, which may allow ascertainment of genetic associations in smaller populations than historical sizes for complex trait GWAS, such as understudied ethnicities and population isolates.
Our study also has important methodological implications for future targeted resequencing efforts. To date, MIP sequencing has been applied to targeted sequencing of coding regions of candidate genes with a sample preparation cost of <$1 per participant.8 Our use of MIPs to interrogate noncoding regions of HDL-C candidate genes represents one of the first applications of this methodology for regulatory DNA regions. Our sequencing efforts were completed at a comparable cost with the prior applications, with similar target-coverage depths across coding and noncoding targets. Additionally, our modified dual-barcoding approach allowed us to multiplex all 1532 samples for sequencing in a single lane of an Illumina HiSeq2500 sequencing run with a median base coverage per participant of 110-fold—a robust depth for novel and rare variant identification at a sequencing cost of ≈$2000. Thus, our study highlights the utility of an MIP-based approach for sequencing of noncoding regions at a low per-sample cost.
Among the variants we identified were multiple noncoding variants in CETP and LIPG associated with high HDL-C in our cohort. CETP is a circulating regulator of HDL metabolism, and genetic inactivation of CETP is associated with increased HDL-C in humans.17 Similarly, LIPG, which encodes an enzyme catabolizing HDL called endothelial lipase, contributes to heritable differences in HDL-C. LIPG loss-of-function genetic variants are causal contributors to elevated HDL-C in humans.18,22 Here, we expanded the allelic spectrum of rare noncoding variation in these 2 HDL-C modulating genes contributing to high HDL-C levels in humans. In both cases, the frequency of these variants and our limited sample size requires further analysis in follow-up cohorts to demonstrate conclusive association of these rare alleles with HDL-C.
Our current study has limitations, which serve as opportunities for further study. First, our total cohort size of 1532 participants limits both the ascertainment of the full spectrum of rare variants that may underlie extremely high HDL-C levels, as well as the power of our statistical tests of common variant association and rare variant burden. Second, our population of high and low HDL-C participants was largely of European ancestry, thus limiting our ability to extrapolate the variants discovered to other populations. Additionally, our ability to validate variant discovery in this population for our MIP sequencing-identified variants was limited to SNPs largely in coding regions as we utilized the Exome Array for confirmatory genotyping; thus we were restricted in our ability to validate indels overlapping these regions. Also, we used conventional strategies for rare-variant grouping, which focused on gene-level aggregation. However, for noncoding sequences, it was not obvious which variant grouping strategy is optimally powered, which remains an open question in the field. Finally, because we selected a finite sequence of noncoding genome with genomic annotations that we believed a priori would be functional and lipid related (eg, enhancer marks in liver) and focused on annotations in HepG2 cell lines for this selection, it remains possible that rare-variant burden either exists in other sequences we did not target here and that such selection could have differed if we had used alternative cell sources for regulatory annotation, such as primary human hepatocytes.
In conclusion, our MIP-based targeted sequencing approach has demonstrated the successful capture of noncoding regions for the discovery of rare, noncoding variants associated with HDL-C in a cohort of extremely high versus low HDL-C participants. Though efforts to better identify the spectrum of noncoding variants underlying complex traits have initiated, including denser genotyping of noncoding variants20 and whole-genome sequencing,22 these approaches remain expensive and not readily applicable to the study of large populations or large case-control designs. Our results offer a scalable and cost-effective targeted approach that complement future, larger candidate loci resequencing efforts for the discovery of putatively causal noncoding variants. These efforts, coupled with appropriate functional investigation of identified variants for impact on gene regulation, may substantially refine the causal genes at loci implicated from GWAS studies and also help further explain the missing heritability underlying complex traits, such as HDL-C.
The authors would like to thank J. Shendure, B. Martin, and J. Schug for their expertise and guidance in the initiation and optimization of our molecular inversion probe sequencing pipeline.
Sources of Funding
This work is supported by the National Institutes of Health grants F30 HL124967 (Dr Khetarpal), R37HL055323, R01 HL111398 and R01 HL089309 (Dr Rader), and R01DK101478 (Dr Voight).
Dr Voight currently serves as an associate editor at Circulation: Genomics and Precision Medicine. The other authors report no conflicts.
↵* Drs Khetarpal and Babb contributed equally to this work as co-first authors.
Guest Editor for this article was Christopher Semsarian, MBBS, PhD, MPH.
The Data Supplement is available at http://circgenetics.ahajournals.org/lookup/suppl/doi:10.1161/CIRCGEN.117.002070/-/DC1.
- Received December 18, 2017.
- Accepted May 9, 2018.
- © 2018 American Heart Association, Inc.
- Bomba L,
- et al
- Patel AP
- Liu DJ
- O’Roak BJ,
- Vives L,
- Fu W,
- Egertson JD,
- Stanaway IB,
- Phelps IG,
- et al
- Zhao W
- Maurano MT
- Larach DB,
- et al
- Vitali C,
- et al