The application of random forest for high dimensional DNA methylation data
-
摘要: 目的 将随机森林算法用于类风湿性关节炎病例对照研究的高维甲基化数据的分析,并探讨应用效果。方法 实例数据来自基因表达数据库(gene expression omnibus,GEO),检索号为GSE42861,包含354名病例、335名对照,本文选取类风湿性关节炎相关基因区域所在的第9号染色体,共纳入2 433个胞嘧啶-磷酸-鸟嘌呤双核苷酸(cytosine-phosphate-guanine pairs of nucleotides,CpGs)位点。利用随机森林计算变量的重要性评分并排序;对排序后的变量进行逐步随机森林过程,寻找最有可能与结果存在关联的变量子集;对降维后的变量子集进行逐步Logistic回归。结果 逐步随机森林筛选出80个重要的CpG位点,Logistic回归模型中有13个位点具有统计学意义。纳入这些位点建立Logistic回归模型,该模型的预测正确率达88.29%。结论 随机森林算法可以大大减少噪音变量,提高检验效能,适用于高维甲基化数据分析。Abstract: Objective To study the application of random forest algorithm for the high dimensional case-control DNA methylation data of rheumatoid arthritis(RA). Methods The RA dataset was obtained from gene expression omnbius (GEO) data repository (accession number GSE42861), which contained 689 samples (354 patients and 335 controls). A total of 2 433 cytosine-phosphate-guanine pairs of nucleotides(CpGs) sites on chromosome 9 were included because the identified RA associated area was located in this chromosome. First, these variables were sorted by the importance sores, by which were calculated through random forest. Second, stepwise random forest was carried out to find the subset variables which were most possibly associated with the outcome variable. Third, we conducted stepwise Logistic regression in the subset variables. Results Eighty important CpG sites were picked out by random forest. In our Logistic model, there were 13 statistically significant CpGs. The accuracy of the model contain these 13 CpGs was 88.29%. Conclusions Random forest algorithm can dramatically reduce the redundant variables and is applicable for high dimensional DNA methylation data.
-
Key words:
- Arthritis, rheumatoid /
- DNA methylation /
- Epidemiologic methods
-
Manolio TA, Brooks LD, Collins FS. A HapMap harvest of insights into the genetics of common disease [J]. J Clin Invest, 2008,118(5):1590-1605. Maher B. Personal genomes:The case of the missing heritability [J]. Nature, 2008,456(7218):18-21. Rakyan VK, Down TA, Balding DJ, et al. Epigenome-wide association studies for common human diseases [J]. Nat Rev Genet, 2011,12(8):529-541. Kulis M, Queiros AC, Beekman R, et al. Intragenic DNA methylation in transcriptional regulation, normal differentiation and cancer [J]. Biochim Biophys Acta, 2013,1829(11):1161-1174. Laird PW. Principles and challenges of genome-wide DNA methylation analysis [J]. Nat Rev Genet, 2010,11(3):191-203. Breiman L. Random forests [J]. Machine Learning, 2001,45(1):5-32. Liu Y, Aryee MJ, Padyukov L, et al. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis [J]. Nat Biotechnol, 2013,31(2):142-147. Kurreeman FA, Padyukov L, Marques RB, et al. A candidate gene approach identifies the TRAF1/C5 region as a risk factor for rheumatoid arthritis [J]. PLoS Med, 2007,4(9): e278. Perricone C, Ceccarelli F, Valesini G. An overview on the genetic of rheumatoid arthritis: a never-ending story [J]. Autoimmun Rev, 2011,10(10): 599-608. 朱晶晶,赵杨,陆凤,等. 高维肺癌病例-对照研究资料的随机森林降维分析 [J]. 中华预防医学杂志, 2012,46(9): 845-849. Orozco G, Goh CL, Olama A, et al. Common genetic variants associated with disease from genome-wide association studies are mutually exclusive in prostate cancer and rheumatoid arthritis [J]. BJU Int, 2013,111(7):1148-1155. Choi HK, Kang HR, Jung E, et al. Early estrogen-induced gene 1, a novel RANK signaling component, is essential for osteoclastogenesis [J]. Cell Res, 2013,23(4):524-536. Nassirpour R, Shao L, Flanagan P, et al. Nek6 mediates human cancer cell transformation and is a potential cancer therapeutic target [J]. Mol cancer Res, 2010,8(5):717-728. Vincenti MP, Brinckerhoff CE. Early response genes induced in chondrocytes stimulated with the inflammatory cytokine interleukin-1beta [J]. Arthritis Res, 2001,3(6): 381-388. Gómez R, Conde J, Scotece M, et al. What's new in our understanding of the role of adipokines in rheumatic diseases? [J]. Nat Rev Rheumatol, 2011,7(9):528-536.
点击查看大图
计量
- 文章访问数: 450
- HTML全文浏览量: 97
- PDF下载量: 46
- 被引次数: 0