基于多步筛选法的心脑血管疾病全基因组关联研究
Genome-Wide Association Study of Cardiovascular and Cerebrovascular Diseases Based on Multi-Step Screening
-
摘要: 全基因组关联研究是研究复杂疾病和性状遗传效应的一种有效手段。现有关联分析主要用的是边缘统计检验的方法, 但未考虑特征间相关性、阈值选取不稳定等问题。该文以心脑血管疾病为研究对象, 提出了一种基于多步筛选法的全基因组关联分析新方法。该方法可以简要概括为以下两步:首先利用 Gini 指数做特征初始筛选, 获得一个候选单核苷酸多态性子集, 再用基于随机森林的递归聚类消除法从单核苷酸多态性子集中发现关联单核苷酸多态性。实验结果表明, 多步筛选法比单步特征选择的效果更好, 基于 Gini 指数的基于随机森林的递归聚类消除法筛选的单核苷酸多态性子集与疾病的关联性更高。Abstract: Genome-wide association study (GWAS) is an effective method to study genetic variants associated with complex diseases or traits. Marginal statistical test is the common method of GWAS, however there following weakness such as lack of consideration of correlation between the features and unstable threshold selection. In this paper, we discuss a new method of GWAS based on multi-step tests model for cardiocerebrovascular disease. The method can be divided into the following two steps: Gini index is used for first step feature selection to achieve a subset of single-nucleotide polymorphisms (SNPs), and then random forest recursive cluster elimination (RF-RCE) filters the associated SNPs subset from first-step candidate SNP set. Experiment results show that the multi-step feature selection is better than the single-step feature selection, and the selected SNPs are more suitable for cardio-cerebrovascular disease prediction.