Multiple Associated Variants Increase the Heritability Explained for Plasma Lipids and Coronary Artery DiseaseCLINICAL PERSPECTIVE
Background—Plasma lipid levels as well as coronary artery disease (CAD) have been shown to be highly heritable with estimates ranging from 40% to 60%. However, top variants detected by large-scale genome-wide association studies explain only a fraction of the total variance in plasma lipid phenotypes and CAD.
Methods and Results—We performed a conditional and joint association analysis using summary-level statistics from 2 large genome-wide association meta-analyses: the Global Lipids Genetics Consortium (GLGC) study, and the Coronary Artery Disease Genome-Wide Replication and Meta-Analysis (CARDIoGRAM) study. There were 100 184 individuals from 46 GLGC studies for plasma lipids, and 22 233 cases and 64 762 controls from 14 studies for CAD. We detected several loci where multiple independent single-nucleotide polymorphisms were associated with lipid traits within a locus (12 out of 33 loci for high-density lipoprotein cholesterol, 10 of 35 loci for low-density lipoprotein cholesterol, 13 of 44 loci for total cholesterol, and 8 of 28 loci for triglycerides), reaching genome-wide significance (P<5×10−8), nearly doubling the heritability explained by genome-wide association studies (from 3.6 to 7.6% for high-density lipoprotein cholesterol, from 5.0 to 8.8% for low-density lipoprotein cholesterol, from 5.5 to 8.8% for total cholesterol, and from 5.7 to 8.5% for triglycerides). Multiple single-nucleotide polymorphisms were also associated with CAD (3 of 15 loci; an increase from 9.6% to 11.4% of heritability explained).
Conclusions—These results demonstrate that a portion of the missing heritability for lipid traits and CAD can be explained by multiple variants at each locus.
Plasma lipids and lipoproteins are heritable risk factors for coronary artery disease (CAD),1 with heritability estimates ranging from 40% to 60% for total cholesterol (TC), low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol cholesterol (HDL-C), and triglycerides, and 30% to 60% for CAD.2–5 Genome-wide association studies (GWAS) for plasma lipids and CAD have successfully identified >95 gene regions for plasma lipids and 46 for CAD.1,6 Despite the success of GWAS, single-nucleotide polymorphisms (SNPs) at these loci explain only a modest proportion of the heritability—20% to 25% of the heritability for plasma lipids and <10% of the heritability of CAD.1 These observations have led to the general question of how to account for the unexplained heritability.
Clinical Perspective on p 587
Typically GWAS uses a single-locus test, in which each variant is tested individually for association with a specific phenotype and the best SNP at each locus is reported. A single SNP may not capture the overall amount of variation at a locus because there may be multiple causal variants at a locus. If individual-level genotype data are available, we can detect additional SNPs using conditional analysis; however, conditional analysis is often infeasible when many studies have contributed results to a large-scale meta-analysis.
Recently, Yang et al7 developed a conditional and joint association analysis tool that leverages summary-level statistics and estimated linkage disequilibrium from a reference sample with individual-level genotype data. In the current study, we applied this approach to lipid traits studied by the Global Lipids Genetics Consortium (GLGC) as well as to the dichotomous trait of CAD studied by the Coronary Artery Disease Genome-Wide Replication and Meta-Analysis (CARDIoGRAM) Consortium.5,6
GWAS Summary-Level Statistics for Lipids and CAD
We obtained the summary-level statistics (rsID, effect allele, the other allele, frequency of the effect allele, effect size, standard error, P value, and sample size) from the GLGC meta-analysis study for lipids6 and from the CARDIoGRAM meta-analysis study for CAD.5 These studies included ≤100 184 individuals from 46 studies for lipids, and 22 233 cases and 64 762 controls for CAD, respectively.
Reference Samples for Linkage Disequilibrium
We used the European ancestry individual-level genotype of 9796 individuals and phenotype data of the Atherosclerosis Risk in Communities Study (ARIC) cohort8 as a reference sample. ARIC represents a large population-based cohort, and this cohort contributed to the GLGC and CARDIoGRAM meta-analyses. SNP quality control was performed, excluding SNPs with missingness >2%, minor allele frequency <0.01 or Hardy–Weinberg equilibrium P value <1×10−6. Among a total of 805 437 genotyped SNPs, 617 428 SNPs were retained in the ARIC data. We discarded samples with missingness >3% and 1 of each pair of samples with an estimated genetic relatedness >0.25. A total of 8682 individuals of European ancestry in the ARIC cohort were included for linkage disequilibrium calculation. The SNP data for ARIC were phased by MaCH and imputed into the HapMap Phase 2 CEU panel by minimac, the same panel that was used for the initial GWAS.9,10 We used the best guess genotypes of the imputed SNPs and excluded imputed SNPs with Hardy–Weinberg equilibrium P value <1×10−6, imputation quality Rsq <0.3, or minor allele frequency <0.01 and retained 2 490 789 SNPs in the ARIC cohort.
Conditional and Joint GWAS Analysis
We performed a stepwise model selection procedure to select independently associated SNPs using the GCTA tool available online (http://www.complextraitgenomics.com/software/gcta/massoc.html) for each lipid trait and CAD. Briefly, the procedure begins with the most significant SNP with P<5×10−8 in the single-SNP meta-analysis and tests all the remaining SNPs conditional on the selected SNP(s) in the model. It then selects the SNP with the minimum conditional P value and fits all the selected SNPs in a model, dropping the SNP with the largest P value >5×10−8. The algorithm iterates until no SNP is added to or removed from the model. The joint effects of all selected SNPs are estimated after the model has been optimized. We define a locus as a chromosomal region at which adjacent pairs of associated SNPs are <1 Mb distant. Details about the conditional and joint analysis are fully described elsewhere.7
Estimation of the Variance Explained by the Joint Association
We calculated the variance explained using the following equation where βM and βJ are effect sizes in standard deviation units obtained from the original meta-analysis and the joint analysis, respectively.11
q2 = 2 × βM × βJ × MAF × (1−MAF) × 100
Variance explained was calculated for (1) the top SNPs from original meta-analysis and (2) the top original SNPs plus the additionally associated SNPs found from the conditional and joint analysis.
Replication Analysis for Lipid Traits
We replicated the variance explained by all jointly associated SNPs detected from the GLGC data using 7312 individuals from the Malmö Diet and Cancer (MDC) cohort as an independent sample in 2 models: method A—a multiple regression of the SNPs selected from the discovery set (GLGC); method B—a replication analysis using the SNPs with their effect sizes estimated from the discovery sample (GLGC) to predict the phenotype in MDC. In method A, 2 predictors were created in the MDC cohort by PLINK12 for each lipid trait, 1 based on all additionally associated SNPs or its proxy SNPs (r2>0.8) and the other based on the GWAS top SNPs only, and the observed lipid phenotypes were regressed on the predictors. In method B, we created 2 predictors in the MDC cohort, but with SNP effects estimated from the GLGC data set, and regressed the observed lipid phenotypes on the predictors. In both methods, adjusted R2 values of MDC were compared with the explained variances for lipid traits in GLGC. Also, we checked the independency of multiple associated SNPs within a locus by comparing β from the model using multiple SNPs and β from the model using each SNP within each locus that has the largest number of multiple associated SNPs for each trait.
Informed Consent and Institutional Review Board Approval
Most of the analyses used summary statistics from prior publications. For genetic association analyses in the MDC cohort using deidentified genotype and phenotype data, each participant had provided written informed consent and approval was given by the institutional review board at Partners Healthcare.
Using summary statistics of ≈2.5 million SNPs from the GLGC meta-analysis of 100 184 individuals for 4 lipid fractions along with SNP linkage disequilibrium estimated in 8682 unrelated European Americans selected from the ARIC cohort study (see Methods), we identified 62, 61, 68, and 41 jointly associated SNPs for each lipid trait (HDL-C, LDL-C, TC, and triglycerides) with P<5×10−8 (Tables I–IV in the Data Supplement), respectively. When compared with previous results conducted by conventional conditional analysis in the original GLGC study, we could detect more associated SNPs with each trait (11 versus 29 for HDL-C, 12 versus 26 for LDL-C, 12 versus 24 for TC, and 9 versus 13 for triglycerides; Table V in the Data Supplement).
For the loci where the increasing alleles of at least 2 SNPs were negatively correlated, some associated variants were undetected in the original GWAS. For example, rs180349 and rs3741298 at the APOA1-C3-A4-A5 locus on chromosome 11 did not exhibit a significant association with HDL-C in single-SNP meta-analyses (P value from the single-SNP meta-analysis [PM]=8.67×10−3 and 4.12×10−4, respectively), but both SNPs reached genome-wide significance when fitted jointly (Table I in the Data Supplement). In addition, the significance and effect size of the leading SNP at the locus also increased (PM=2.94×10−42 to PJ=1.84×10−73 for rs964184; Table I in the Data Supplement). There were 12 of 33 HDL-C, 10 of 35 LDL-C, 13 of 44 TC, and 8 of 28 triglyceride loci harboring >2 associated SNPs, with the maximum number of 9, 9, 6, and 7 SNPs at a locus, respectively. The lead SNPs (33 for HDL-C, 35 for LDL-C, 44 for TC, and 28 for triglycerides) explained 3.6%, 5.0%, 5.5%, and 5.7% of phenotypic variance, respectively. These values were almost doubled (7.6% for HDL-C, 8.8% for LDL-C, 8.8% for TC, and 8.5% for triglycerides) when all jointly associated SNPs (62 for HDL-C, 61 for LDL-C, 68 for TC, and 41 for triglycerides) were taken into account.
Coronary Artery Disease
Using summary statistics of ≈2.5 million SNPs from the CARDIoGRAM meta-analysis of 22 233 cases and 64 762 controls for CAD along with the same reference SNP data from the ARIC cohort described above, we identified 18 jointly associated SNPs for CAD with P<5×10−8. Of these SNPs associated with CAD, 3 of 15 loci represent multiple associated SNPs within a single locus (Figure 1; Table VI in the Data Supplement).
We found 2 association signals for CAD at the 9p21 CDKN2A and CDKN2B locus. We found a significant joint association with CAD in the LDLR region, where multiple common variants for LDL-C and rare mutations in familial hypercholesterolemia have been previously reported.13 Two SNPs, rs8099996 and rs1122608, which are 11 024 bp apart, were retained in the stepwise model selection as jointly associated SNPs with PJ<3.5×10−11 (Table VI in the Data Supplement). The secondary SNP (rs8099996; PM=0.67) was masked by the primary SNP (rs1122608; PM=9.7×10−10) in single-SNP analyses, but it appeared significant (rs8099996; P=5.0×10−13) in conditional analysis on the primary SNP (Figure 2). This region was also significant for LDL-C and TC (Figure 1C). The APOA5-A4-C3-A1 gene cluster locus was significant for 4 lipid traits as well as CAD (Figure 1A and 1B). When they were fitted jointly, their effects, as well as statistical significance, were substantially increased compared with those in single-SNP analyses. The 15 leading SNPs explained 9.6% of phenotypic variance. The 3 additional SNPs detected by the joint analysis accounted for 1.8% of the variance explained.
For a set of 184 SNPs in Tables I to IV in the Data Supplement, we evaluated the associations with CAD in CARDIoGRAM meta-analysis. Of these 184 SNPs, 38 SNPs were nominally associated (PM<0.05) and 20 SNPs showed a significant association after the Bonferroni correction (PM<2.2×10−4; Tables I–IV in the Data Supplement).
Replication of the Lipid Results in an Independent Sample
We validated the direction of effect in each SNP between GLGC results and MDC results for all the jointly associated variants (Figure I in the Data Supplement). The multiple regression analysis showed that the prediction R2 values of top primary SNPs were 4.6%, 5.8%, 6.7%, and 5.2%, consistent with the estimate of 3.6%, 5.0%, 5.5%, and 5.7% of variance explained in the discovery sample (GLGC), for HDL-C, LDL-C, TC, and triglycerides, respectively (Figure II in the Data Supplement; method A). And the R2 values of the additionally associated SNPs were 5.2%, 3.1%, 2.7%, and 2.2%, in line with the estimate of 4.0%, 3.8%, 3.3%, and 2.8% of variance explained by these SNPs in the discovery sample (GLGC), respectively (Figure II in the Data Supplement; method A).
In addition, when we used the SNP effects estimated from the GLGC data set (the second method), the R2 values of top primary SNPs were 3.4%, 4.6%, 4.6%, and 4.4%, consistent with the estimate in the discovery sample (GLGC), and those of the additionally associated SNPs were 4.1%, 3.1%, 2.6%, and 2.2%, in line with the estimate of those explained by these SNPs in the discovery sample (GLGC), respectively (Figure II in the Data Supplement; method B). Therefore, these replication analyses in an independent sample confirmed that additional associated variants could explain ≈2% to 4% of phenotypic variation for each lipid trait.
Figure III in the Data Supplement shows that β from the multiple SNP model and β from the single-SNP model are consistent for each locus with the largest number of multiple SNPs (A: HDL-C for CETP locus; B: LDL-C for APOE-C1-C2 locus; C: TC for APOE-C1-C2 locus; D: triglycerides for LPL locus), suggesting that the variants from the multiple SNP model are independent.
In this study, we detected several loci where multiple independent SNPs were associated with lipid traits, and accounting for these variants nearly doubled the heritability explained by the previous GWAS results (HDL-C, LDL-C, TC, and triglycerides). In addition, the joint associations of lipid traits were validated in an independent sample.
GWAS results have explained only a fraction of the heritability of complex traits. There has been extensive debate regarding this unexplained heritability, with hypotheses ranging from rare variants to epistasis.14–16 Here, we explored the possibility of multiple independent signals at a given locus as a contributor to the unexplained heritability.
Conditional analysis has been used as a tool to identify secondary association signals at a locus, starting with the top associated SNP, across the whole genome followed by a stepwise procedure of selecting additional SNPs, one by one, according to their conditional P values. However, nearly always, pooled individual level genotype data are unavailable in large-scale meta-analyses. In that sense, the tool we used is useful in terms of saving computational time and cost because it does not require individual genotype data except for the samples used as a linkage disequilibrium reference. As a result, the current study clearly indicates that an increased portion of the missing heritability could be explained by the joint influence of multiple variants within a locus, suggesting the importance of digging into known loci to identify causal variants and understand the genetic architecture of complex diseases.
For plasma lipids, we found that many signals at specific loci can explain a large proportion of the variance. From our estimation, 9 SNPs in the CETP locus could explain as much as 2.6% of variance explained in HDL-C; 9 SNPs in the APOE-C1-C2 locus explained 2.1% of that in LDL-C; 6 SNPs in the APOE-C1-C2 locus explained 1.1% of that in TC; 7 SNPs in the LPL locus explained 1.7% in that of triglycerides; and 2 SNPs in the CDKN2A/CDKN2B locus explained 2.8% of that in CAD trait. This is consistent with a previous report that suggests greater heritability of common variants in known loci.17
The variance explained by the top SNPs for each lipid trait from the GLGC data set was relatively small compared with the one from the original article.6 This could be because of variability in estimates from different studies as well as the different method used in this analysis compared with that of original GWAS, where only 1 study (ie, the Framingham Heart Study) contributed to the estimation of variance explained. Although we showed evidence of multiple associations at several loci using summary statistics of the CARDIoGRAM meta-analysis, recently published data of the CARDIoGRAMplusC4D meta-analysis with 63 746 CAD cases and 130 681 controls based on the Metabochip array might also be useful to find additional associated signals.1
In summary, we detected several loci where multiple associated SNPs within a single locus were associated with lipid traits or CAD. For lipid traits, these variants nearly doubled the heritability explained.
Sources of Funding
H. Tada is supported by a grant for studying overseas from Japanese Circulation Society. The Malmö Diet and Cancer study was made possible by grants from the Swedish Cancer Society, the Swedish Medical Research Council, the Swedish Dairy Association, the Albert Påhlsson and Gunnar Nilsson Foundations, and the Malmö city council. O. Melander is supported by the European Research Council (StG-282255); the Swedish Heart and Lung Foundation; Swedish Research Council; the Novo Nordisk Foundation; the Skåne University Hospital donation funds; the Medical Faculty, Lund University; the governmental funding of clinical research within the national health services; the Albert Påhlsson Research Foundation; Region Skåne; the King Gustav V and Queen Victoria Foundation; and the Marianne and Marcus Wallenberg Foundation. G.M. Peloso is supported by award number T32HL007208 from the National Heart, Lung, and Blood Institute. S. Kathiresan is supported by a Research Scholar award from the Massachusetts General Hospital (MGH), the Howard Goodman Fellowship from MGH, the Donovan Family Foundation, R01HL107816, and a grant from Fondation Leducq.
Guest Editor for this article was Robert A. Hegele, MD.
The Data Supplement is available at http://circgenetics.ahajournals.org/lookup/suppl/doi:10.1161/CIRCGENETICS.113.000420/-/DC1.
- Received November 8, 2013.
- Accepted June 27, 2014.
- © 2014 American Heart Association, Inc.
- Yang J,
- Ferreira T,
- Morris AP,
- Medland SE,
- Madden PA,
- Heath AC,
- et al
- 8.↵The ARIC investigators.The Atherosclerosis Risk in Communities (ARIC) Study: design and objectives. Am J Epidemiol. 1989;129:687–702.
- Zuk O,
- Hechter E,
- Sunyaev SR,
- Lander ES
- Linsel-Nitschke P,
- Götz A,
- Erdmann J,
- Braenne I,
- Braund P,
- Hengstenberg C,
- et al
Plasma lipids are heritable risk factors for coronary artery disease (CAD), with heritability estimates ranging from 40% to 60% for plasma lipids and 30% to 60% for CAD. Genome-wide association studies have successfully identified 157 loci for plasma lipids and 46 for CAD. Despite the success of genome-wide association studies, single-nucleotide polymorphisms at these loci explain only a modest proportion of the heritability. These observations have led to the general question of how to account for the unexplained heritability. Typically genome-wide association studies uses a single-locus test, in which each variant is tested individually with a specific phenotype and the best SNP at each locus is reported. A single SNP may not capture the overall amount of variation at a locus because there may be multiple causal variants at a locus. If individual-level genotype data are available, we can detect additional single-nucleotide polymorphisms using conditional analysis; however, conditional analysis is often infeasible when many studies have contributed results to a large-scale meta-analysis. Recently, Yang et al developed a conditional and joint association analysis tool that leverages summary-level statistics and estimated linkage disequilibrium from a reference sample. In this study, we applied this approach to lipid traits studied by the Global Lipids Genetics Consortium as well as to CAD studied by the Coronary Artery Disease Genome-Wide Replication and Meta-Analysis Consortium. And we detected several loci where multiple associated single-nucleotide polymorphisms within a single locus were associated with lipid traits or CAD. For lipid traits, these variants nearly doubled the heritability explained. This is consistent with recent studies that suggest a polygenic architecture for complex traits that includes hundreds of common variants.