Resampling classification model for predicting blood glucose control in middle-aged and elderly diabetic patients in China
WANG Ping and ZHANG Le contributed equally to this article
More Information
-
摘要:
目的 利用重采样算法提高糖尿病患者血糖控制分类模型的预测性能。 方法 对中国健康与养老追踪调查(China health and retirement longitudinal study, CHARLS)数据库中糖尿病患者血糖控制不平衡数据进行重采样,比较重采样前后logistic回归(logistic regression, LR)、支持向量机(support vector machines, SVM)和随机森林(random forest, RF)的分类性能,利用分层五折交叉验证和受试者工作特征(receiver operating characteristic, ROC)曲线下面积(area under curve, AUC)确定模型的最优参数,以准确率、灵敏度、特异度、精确率、几何均值(geometric mean, G-mean)、F1分数和AUC为评价指标,比较重采样前后分类模型的性能。 结果 几种重采样算法均可提高3种分类模型的灵敏度、G-mean和F1分数;重采样算法过采样(adaptive synthetic sampling, ADASYN)、组合采样[合成少数类过采样技术和编辑最近邻(synthetic minority over-sampling technique and edited nearest neighbors, SMOTE-ENN);合成少数类过采样技术和Tomek链接(synthetic minority over-sampling technique tomek, SMOTE-Tomek)]对3种分类模型的AUC值均有不同程度的提高,其中ADASYN使LR分类模型的AUC值提高2.13%,SMOTE-ENN使LR分类模型的AUC值提高3.05%,SMOTE-Tomek使RF分类模型的AUC值提高2.13%。 结论 ADASYN、SMOTE-ENN、SMOTE-Tomek能较好地处理糖尿病患者血糖控制不平衡数据的问题,提高糖尿病患者血糖控制分类模型的预测性能 Abstract:Objective This study aims to improve the prediction performance of blood glucose control classification models for diabetic patients by employing resampling algorithms. Methods Blood glucose control data of diabetic patients in the China health and retirement longitudinal study (CHARLS) database were resampled. We compared the classification performance of logistic regression (LR), support vector machines (SVM), and random forests (RF) before and after resampling. We utilized stratified 5-fold cross-validation and area under curve (AUC) to determine the optimal parameters of the models. The performance of the classification models before and after resampling was evaluted using metrics such as accuracy, sensitivity, specificity, precision, geometric mean (G-mean), F1 score, and AUC. Results All three resampling algorithms, including ADASYN, synthetic minority over-sampling technique and edited nearest neighbors (SMOTE-ENN), and synthetic minority over-sampling technique tomek (SMOTE-Tomek), enhanced the prediction performance of three classification models when dealing with imbalanced blood glucose control data in diabetic patients. These algorithms exhibited varying degrees of improvement in AUC values, with adaptive synthetic sampling (ADASYN) increasing the AUC value of the logistic classification model by 2.13%, SMOTE-ENN by 3.05%, and SMOTE-Tomek by 2.13%, respectively. Conclusions ADASYN, SMOTE-ENN, and SMOTE-Tomek can better deal with the imbalanced blood glucose control data in diabetic patients and improve the performance of blood glucose control classification models. -
Key words:
- Resampling algorithm /
- Imbalance classification /
- Diabetes /
- Blood glucose control
-
表 1 重采样分类模型性能比较
Table 1. Performance comparison of resampling classification models
重采样算法
Resampling algorithm分类模型
Classification model准确率
Accuracy①灵敏度
Sensitivity①特异度
Specificity①精确率
Precision①几何均值
Geometric meanF1分数
scoreAUC值value
(95% CI)不平衡数据Imbalanced data LR 83.67 12.50 97.56 50.00 0.349 0.200 0.692(0.547~0.898) SVM 83.67 0 100.00 0 0 —② 0.692(0.056~0.866) RF 83.67 0 100.00 0 0 —② 0.680(0.494~0.838) RUS LR 65.31 25.00 73.17 15.83 0.428 0.191 0.671(0.568~0.908) SVM 59.18 87.50 53.66 26.92 0.685 0.412 0.689(0.513~0.878) RF 59.18 75.00 56.10 25.00 0.649 0.375 0.668(0.264~0.800) SMOTE0.5 LR 69.39 62.50 70.73 29.41 0.665 0.400 0.732(0.534~0.863) SVM 81.63 37.50 90.24 42.86 0.582 0.143 0.729(0.405~0.735) RF 75.51 12.50 87.80 16.67 0.331 0.716 0.701(0.460~0.808) SMOTE0.7 LR 79.59 75.00 80.49 42.86 0.777 0.546 0.710(0.501~0.829) SVM 69.39 37.50 75.61 23.08 0.533 0.286 0.717(0.516~0.813) RF 61.22 75.00 58.54 26.09 0.663 0.387 0.686(0.453~0.804) SMOTE1 LR 55.10 87.50 48.78 25.00 0.653 0.389 0.695(0.497~0.826) SVM 67.34 50.00 70.73 25.00 0.595 0.333 0.698(0.279~0.721) RF 59.18 75.00 56.10 25.00 0.649 0.375 0.680(0.427~0.817) ADASYN LR 71.43 75.00 70.73 33.33 0.728 0.462 0.713(0.519~0.859) SVM 67.35 87.50 63.41 31.82 0.745 0.467 0.717(0.454~0.786) RF 61.22 75.00 58.54 26.09 0.663 0.387 0.707(0.431~0.807) SMOTE-ENN LR 66.75 74.14 66.47 7.78 0.702 0.141 0.737(0.667~0.804) SVM 69.16 72.41 69.00 8.19 0.707 0.147 0.746(0.679~0.813) RF 58.83 77.59 58.12 6.60 0.672 0.122 0.703(0.644~0.763) SMOTE-Tomek LR 71.12 67.24 71.27 8.19 0.692 0.146 0.752(0.683~0.821) SVM 77.07 65.52 77.51 10.00 0.713 0.174 0.748(0.725~0.769) RF 69.85 68.96 69.88 8.03 0.694 0.144 0.748(0.682~0.814) 注:RUS,随机欠采样;SMOTE,合成少数类过采样技术;ADASYN,自适应合成采样方法;SMOTE-ENN,合成少数类过采样技术和编辑最近邻;SMOTE-Tomek,合成少数类过采样技术和Tomek链接;LR,逻辑回归;SVM,支持向量机;RF,随机森林。
①以百分数/%表示;②“—”表示无法获取。
Note: RUS, random under-sampling; SMOTE, synthetic minority over-sampling technique; ADASYN, adaptive synthetic sampling; SMOTE-ENN, synthetic minority over-sampling technique and edited nearest neighbors; SMOTE-Tomek, synthetic minority over-sampling technique and tomek links; LR, logistic regression; SVM, support vector machines; RF, random forest.
① Percentage/%; ② "—" indicates that it cannot be obtained. -
[1] Chawla NV, Japkowicz N, Kolcx AR, et al. Session details: special issue on learning from imbalanced datasets[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 1-6. DOI: 10.1145/3262579. [2] Cateni S, Colla V, Vannucci M. A method for resampling imbalanced datasets in binary classification tasks for real-world problems[J]. Neurocomputing, 2014, 135(5): 32-41. DOI: 10.1016/j.neucom.2013.05.059. [3] 武海滨, 李康, 杨丽, 等. 非平衡分类技术在人群糖尿病疾病风险预测模型中的应用[J]. 中国卫生统计, 2019, 36(4): 502-506.Wu HB, Li K, Yang L, et al. Application of imbalance classification techniques in population disease diabetes risk prediction model[J]. Chinese Journal of Health Statistics, 2019, 36(4): 502-506. [4] 方德刚, 郑桃林, 杨柳, 等. 长沙地区老年2型糖尿病患者血糖控制情况及其影响因素[J]. 中国卫生工程学, 2021, 20(5): 766-767. DOI: 10.19937/j.issn.1671-4199.2021.05.021.Fang DG, Zheng TL, Yang L, et al. Blood glucose control and its influencing factors in elderly patients with type 2 diabetes in Changsha[J]. Chinese Journal of Public Health Engineering, 2021, 20(5): 766-767. DOI: 10.19937/j.issn.1671-4199.2021.05.021. [5] Lozovey NR, Lamback EB, Mota RB, et al. Glycemic control rate in type 2 diabetes mellitus patients at a public referral hospital in Rio de Janeiro, Brazil: demographic and clinical factors[J]. J Endocrinol Metab, 2017, 7(2): 61-67. DOI: 10.14740/jem390w. [6] 周小琦, 李芳, 刘新会, 等. 不同性别老年糖尿病患者血糖控制情况及影响因素分析[J]. 公共卫生与预防医学, 2022, 33(6): 80-85. DOI: 10.3969/j.issn.1006-2483.2022.06.019.Zhou XQ, Li F, Liu XH, et al. Glycemic control and influencing factors among male and female elderly diabetic patients[J]. J Pub Heal Prev Med, 2022, 33(6): 80-85. DOI: 10.3969/j.issn.1006-2483.2022.06.019. [7] 李巧娥, 胡晓斌, 车鑫垚, 等. 甘肃省15岁及以上糖尿病患者血糖管理状况及影响因素分析[J]. 公共卫生与预防医学, 2022, 33(3): 63-67. DOI: 10.3969/j.issn.1006-2483.2022.03.014.Li QE, Hu XB, Che XY, et al. Current situation and influencing factors of blood glucose management in diabetic patients aged 15 and above in Gansu[J]. J Pub Heal Prev Med, 2022, 33(3): 63-67. DOI: 10.3969/j.issn.1006-2483.2022.03.014. [8] 张乐, 王如意, 杨慧, 等. 重采样技术在中老年居民糖尿病不平衡数据分类中的应用[J]. 现代预防医学, 2023, 50(7): 1339-1344. DOI: 10.20043/j.cnki.MPM.202210439.Zhang L, Wang RY, Yang H, et al. Application of resampling technique in the classification of imbalanced diabetes data in middle-aged and elderly residents[J]. Modern Preventive Medicine, 2023, 50(7): 1339-1344. DOI: 10.20043/j.cnki.MPM.202210439. [9] Manal A, Mouaz A M, Steven K, et al. Predicting diabetes mellitus using SMOTE and ensemble machine learning approach: the Henry Ford exercIse testing (FIT) project[J]. PLoS ONE, 2017, 12(7): e0179805. DOI: 10.1371/journal.pone.0179805. [10] 周玉, 孙红玉, 房倩, 等. 不平衡数据集分类方法研究综述[J]. 计算机应用研究, 2022, 39(6): 1615-1621. DOI: 10.19734/j.issn.1001-3695.2021.10.0590.Zhou Y, Sun HY, Fang Q, et al. Review of imbalanced data classification methods[J]. Application Researchof Computers, 2022, 39(6): 1615-1621. DOI: 10.19734/j.issn.1001-3695.2021.10.0590.