• 中国精品科技期刊
  • 《中文核心期刊要目总览》收录期刊
  • RCCSE 中国核心期刊(5/114,A+)
  • Scopus收录期刊
  • 美国《化学文摘》(CA)收录期刊
  • WHO 西太平洋地区医学索引(WPRIM)收录期刊
  • 《中国科学引文数据库(CSCD)》核心库期刊 (C)
  • 中国科技核心期刊
  • 中国科技论文统计源期刊
  • 《日本科学技术振兴机构数据库(中国)》(JSTChina)收录期刊
  • 美国《乌利希期刊指南》(UIrichsweb)收录期刊
  • 中华预防医学会系列杂志优秀期刊(2019年)

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于机器学习的非酒精性脂肪性肝病预测模型的构建及验证

程琳傑 袁晴 李宛凇 刘英 杨磊

程琳傑, 袁晴, 李宛凇, 刘英, 杨磊. 基于机器学习的非酒精性脂肪性肝病预测模型的构建及验证[J]. 中华疾病控制杂志, 2025, 29(6): 682-687. doi: 10.16462/j.cnki.zhjbkz.2025.06.009
引用本文: 程琳傑, 袁晴, 李宛凇, 刘英, 杨磊. 基于机器学习的非酒精性脂肪性肝病预测模型的构建及验证[J]. 中华疾病控制杂志, 2025, 29(6): 682-687. doi: 10.16462/j.cnki.zhjbkz.2025.06.009
CHENG Linjie, YUAN Qing, LI Wansong, LIU Ying, YANG Lei. Construction and validation of a prediction model for nonalcoholic fatty liver disease based on machine learning[J]. CHINESE JOURNAL OF DISEASE CONTROL & PREVENTION, 2025, 29(6): 682-687. doi: 10.16462/j.cnki.zhjbkz.2025.06.009
Citation: CHENG Linjie, YUAN Qing, LI Wansong, LIU Ying, YANG Lei. Construction and validation of a prediction model for nonalcoholic fatty liver disease based on machine learning[J]. CHINESE JOURNAL OF DISEASE CONTROL & PREVENTION, 2025, 29(6): 682-687. doi: 10.16462/j.cnki.zhjbkz.2025.06.009

基于机器学习的非酒精性脂肪性肝病预测模型的构建及验证

doi: 10.16462/j.cnki.zhjbkz.2025.06.009
基金项目: 

河北省自然人群健康趋势队列研究 226Z7705G

详细信息
    通讯作者:

    刘英,E-mail: wayymbb@126.com

    杨磊,E-mail: yanglei1127@hebmu.edu.cn

  • 中图分类号: R575.5

Construction and validation of a prediction model for nonalcoholic fatty liver disease based on machine learning

Funds: 

A Cohort Study of Natural Population Health Trends in Hebei Province 226Z7705G

More Information
  • 摘要:   目的  构建及验证用于预测非酒精性脂肪性肝病(non-alcoholic fatty liver disease, NAFLD)的机器学习(machine learning, ML)模型,并筛选出最优模型,通过SHapley加性解释(SHapley Additive exPlanations, SHAP)框架解释该模型。  方法  选取美国国家健康与营养调查数据库中2017年1月―2020年3月的数据,按7∶3随机分为训练集和测试集。最小绝对收缩和选择算子回归用于特征选择,采用6种算法构建预测模型。使用受试者工作特征曲线下面积(area under curve, AUC)对模型进行评价,并通过校准曲线、决策曲线分析、变量重要性图、SHAP图进行解释。  结果  6 918名研究对象中,3 974人(57.44%)被诊断为NAFLD。极限梯度提升(eXtreme gradient boosting, XGBoost)模型综合表现优于其他模型,在测试集上的AUC为0.851,准确率为0.757,灵敏度为0.760,特异度为0.754。主要预测因子包括身体圆度指数、腰围、三酰甘油-葡萄糖指数、谷丙转氨酶、糖化血红蛋白和高密度脂蛋白胆固醇。在模型应用方面,开发了一个用户界面供医务人员使用。  结论  研究构建并验证了6种用于预测NAFLD的ML模型,其中XGBoost模型更具优势,可为临床早期筛查NAFLD高危患者提供可靠的参考依据。
  • 图  1  LASSO回归筛选过程

    A:变量系数的变化特性;B:LASSO回归交叉验证曲线;LASSO:最小绝对收缩和选择算子。

    Figure  1.  LASSO regression selection process

    A: the variation characteristics of the variable coefficient; B: the cross-validation curve of the LASSO regression; LASSO: least absolute shrinkage and selection operator.

    图  2  模型评价

    A:不同模型在测试集上的受试者工作特征曲线;B:不同模型在测试集上的决策曲线分析;C:不同模型在测试集上的校准曲线;AUC:受试者工作特征曲线下面积;DT:决策树;EN:弹性网络;LR:逻辑回归;MLP:多层感知器;SVM:支持向量机;XGBoost:极限梯度提升。

    Figure  2.  Model evaluation

    A: the receiver operating characteristic curves of different models on the test set; B: decision curve analysis of different models on the test set; C: calibration curves of different models on the test set; AUC: area under the receiver operating characteristic curve; DT: decision tree; EN: elastic net; LR: logistic regression; MLP: multilayer perceptron; SVM: support vector machine; XGBoost: eXtreme gradient boosting.

    图  3  使用SHAP直观地解释机器学习模型

    A:根据平均(|SHAP值|)对不同变量进行重要性排序;B:SHAP摘要图;SHAP:SHapley加性解释;WC:腰围;BRI:身体圆度指数;TyG:三酰甘油-葡萄糖指数;HbA1c:糖化血红蛋白;HDL-C,高密度脂蛋白胆固醇。

    Figure  3.  Visually interpret machine learning model by using SHAP

    A: the importance of different variables was ranked according to the average (|SHAP value |); B: SHAP summary plot; SHAP: SHapley Additive exPlanations; WC: waist circumference; BRI: body roundness index; TyG: triglyceride-glucose index; HbA1c: glycosylated hemoglobin; HDL-C, high-density lipoprotein cholesterol.

    表  1  非NAFLD组和NAFLD组的基线特征比较

    Table  1.   Comparison of baseline characteristics in the non-NAFLD and NAFLD groups

    变量Variable 非NAFLD组Non NAFLD group (n=2 944) NAFLD组NAFLD group (n=3 974) Z2值value P值value
    年龄/岁Age/years 43.00(27.00, 62.00) 54.00(39.00, 65.00) -15.22 < 0.001
    性别Gender 31.55 < 0.001
      男性Male 1 314(44.63) 2 045(51.46)
      女性Female 1 630(55.37) 1 929(48.54)
    种族Race 96.25 < 0.001
      非西班牙裔白人Non-Hispanic whites 921(31.28) 1 337(33.64)
      非西班牙裔黑人Non-Hispanic blacks 897(30.47) 908(22.85)
      非西班牙裔亚裔Non-Hispanic Asian 410(13.93) 490(12.33)
      墨西哥裔美国人Mexican Americans 265(9.00) 600(15.10)
      其他西班牙裔Other Hispanics 301(10.22) 449(11.30)
      其他种族Other races 150(5.10) 190(4.78)
    BMI/(kg·m-2) 25.10(22.10, 28.60) 31.60(27.70, 36.70) -42.25 < 0.001
    臂围Arm circumference/cm 30.60(27.80, 33.70) 35.20(32.10, 38.70) -37.05 < 0.001
    腰围Waist circumference/cm 88.50(79.88, 97.80) 106.60(97.33, 118.20) -46.28 < 0.001
    血尿素氮Blood urea nitrogen/(mmol·L-1) 5.00(3.93, 6.07) 5.00(4.28, 6.07) -5.48 < 0.001
    血清肌酐Serum creatinine/(μmol·L-1) 74.26(62.76, 87.52) 75.14(62.76, 88.40) -0.83 0.407
    血清尿酸Serum uric acid/(μmol·L-1) 291.50(243.90, 340.84) 333.10(279.60, 386.60) -20.87 < 0.001
    白蛋白Albumin/(g·L-1) 41.00(39.00, 43.00) 41.00(38.00, 43.00) -9.72 < 0.001
    碱性磷酸酶Alkaline phosphatase/(U·L-1) 72.00(59.00, 85.00) 77.43(66.00, 92.00) -12.69 < 0.001
    球蛋白Globulin/(g·L-1) 30.00(28.00, 33.00) 31.00(28.35, 34.00) -6.92 < 0.001
    ALT/(U·L-1) 15.00(11.35, 20.00) 20.00(14.00, 28.00) -23.78 < 0.001
    AST/(U·L-1) 18.00(15.84, 22.00) 19.00(16.00, 24.00) -7.72 < 0.001
    GGT/(U·L-1) 17.00(12.00, 23.79) 24.00(17.00, 34.00) -25.51 < 0.001
    总胆红素Total bilirubin/(μmol·L-1) 6.84(5.13, 10.26) 6.84(5.13, 8.55) -3.22 0.001
    总蛋白Total protein/(g·L-1) 72.00(69.00, 74.00) 72.00(69.00, 74.00) -0.75 0.456
    乳酸脱氢酶Lactate dehydrogenase/(U·L-1) 151.00(135.00, 170.00) 156.00(141.00, 174.00) -7.92 < 0.001
    TC/(mmol·L-1) 4.58(4.01, 5.22) 4.78(4.14, 5.41) -7.29 < 0.001
    TG/(mmol·L-1) 1.01(0.76, 1.40) 1.52(1.10, 2.11) -31.48 < 0.001
    FPG/(mmol·L-1) 5.00(4.66, 5.33) 5.33(4.94, 6.00) -25.79 < 0.001
    HDL-C/(mmol·L-1) 1.43(1.22, 1.68) 1.19(1.03, 1.42) 27.16 < 0.001
    白细胞White blood cell/(×109·L-1) 6.50(5.40, 7.80) 7.20(6.02, 8.60) -15.36 < 0.001
    中性粒细胞Neutrophils/(×109·L-1) 3.62(2.80, 4.70) 4.10(3.27, 5.20) -12.76 < 0.001
    淋巴细胞Lymphocyte/(×109·L-1) 2.00(1.60, 2.40) 2.20(1.80, 2.70) -11.66 < 0.001
    红细胞Red blood cell/(×1012·L-1) 4.64(4.34, 4.97) 4.79(4.48, 5.12) -11.99 < 0.001
    平均红细胞体积Mean corpuscular volume/fL 89.10(86.00, 92.03) 88.20(85.00, 91.20) -7.05 < 0.001
    血小板Platelet/(×109·L-1) 237.00(203.00, 276.00) 244.00(207.25, 286.00) -5.00 < 0.001
    糖化血红蛋白Glycosylated hemoglobin/% 5.40(5.20, 5.70) 5.70(5.40, 6.30) -25.79 < 0.001
    SBP/mmHg 116.79(107.67, 130.67) 123.67(113.33, 135.33) -13.39 < 0.001
    DBP/mmHg 70.33(64.67, 77.08) 75.33(69.00, 82.00) -18.60 < 0.001
    三酰甘油-葡萄糖指数Triglyceride-glucose index 8.30(7.99, 8.67) 8.81(8.45, 9.22) -35.06 < 0.001
    身体圆度指数Body roundness index 4.03(2.97, 5.28) 6.40(5.10, 8.11) -44.58 < 0.001
    注:NAFLD,非酒精性脂肪性肝病;ALT,丙氨酸氨基转移酶;AST,天门冬氨酸氨基转移酶;GGT,γ-谷氨酰转移酶;TC,总胆固醇;TG,三酰甘油;FPG,空腹血糖;HDL-C,高密度脂蛋白胆固醇;SBP,收缩压;DBP,舒张压。
    ①以M(P25, P75)或人数(占比/%)表示。
    Note: NAFLD, non-alcoholic fatty liver disease; ALT, alanine aminotransferase; AST, aspartate aminotransferase; GGT, gamma glutamyltransferase; TC, total cholesterol; TG, triglycerides; FPG, fasting blood glucose; HDL-C, high-density lipoprotein cholesterol; SBP, systolic blood pressure; DBP, diastolic blood pressure.
    M(P25, P75) or number of people (proportion /%).
    下载: 导出CSV

    表  2  测试集上机器学习模型的性能

    Table  2.   Performance of machine learning models on the test set

    模型Model AUC 准确率Accuracy 灵敏度Sensitivity 特异度Specificity F1分数F1 score Brier评分Brier score
    DT 0.825 0.759 0.697 0.804 0.707 0.170
    EN 0.838 0.763 0.682 0.821 0.705 0.194
    LR 0.843 0.770 0.694 0.824 0.716 0.157
    MLP 0.847 0.764 0.723 0.794 0.719 0.179
    SVM 0.849 0.771 0.688 0.831 0.714 0.154
    XGBoost 0.851 0.757 0.760 0.754 0.722 0.153
    注:XGBoost, 极限梯度提升; DT,决策树;EN,弹性网络;LR,逻辑回归;MLP,多层感知器;SVM,支持向量机;XGBoost,极限梯度提升;AUC,曲线下面积。
    Note: XGBoost, eXtreme gradient boosting; DT, decision tree; EN, elastic net; LR, logistic regression; MLP, multilayer perceptron; SVM, support vector machine; XGBoost, eXtreme gradient boosting; AUC, area under curve.
    下载: 导出CSV
  • [1] Ji WD, Xue MY, Zhang YS, et al. A machine learning based framework to identify and classify non-alcoholic fatty liver disease in a large-scale population [J]. Front Public Health, 2022, 10: 846118. DOI: 10.3389/fpubh.2022.846118.
    [2] Perakakis N, Polyzos SA, Yazdani A, et al. Non-invasive diagnosis of non-alcoholic steatohepatitis and fibrosis with the use of omics and supervised learning: a proof of concept study [J]. Metabolism, 2019, 101: 154005. DOI: 10.1016/j.metabol.2019.154005.
    [3] Ma XF, Yang C, Liang K, et al. A predictive model for the diagnosis of non-alcoholic fatty liver disease based on an integrated machine learning method [J]. Am J Transl Res, 2021, 13(11): 12704-12713.
    [4] Peduzzi P, Concato J, Kemper E, et al. A simulation study of the number of events per variable in logistic regression analysis [J]. J Clin Epidemiol, 1996, 49(12): 1373-1379. DOI: 10.1016/s0895-4356(96)00236-3.
    [5] Zhao YP, Li HL. Association of serum vitamin C with liver fibrosis in adults with nonalcoholic fatty liver disease [J]. Scand J Gastroenterol, 2022, 57(7): 872-877. DOI: 10.1080/00365521.2022.2041085.
    [6] Zou HX, Zhao FR, Lyu XH, et al. Development and validation of a new nomogram to screen for MAFLD [J]. Lipids Health Dis, 2022, 21(1): 133. DOI: 10.1186/s12944-022-01748-1.
    [7] Yuan KC, Tsai LW, Lee KH, et al. The development an artificial intelligence algorithm for early sepsis diagnosis in the intensive care unit [J]. Int J Med Inform, 2020, 141: 104176. DOI: 10.1016/j.ijmedinf.2020.104176.
    [8] Yi FL, Yang H, Chen DR, et al. XGBoost-SHAP-based interpretable diagnostic framework for Alzheimer′s disease [J]. BMC Med Inform Decis Mak, 2023, 23(1): 137. DOI: 10.1186/s12911-023-02238-9.
    [9] Dong BT, Zhang H, Duan YY, et al. Development of a machine learning-based model to predict prognosis of alpha-fetoprotein-positive hepatocellular carcinoma [J]. J Transl Med, 2024, 22(1): 455. DOI: 10.1186/s12967-024-05203-w.
    [10] Zuo D, Yang LX, Jin Y, et al. Machine learning-based models for the prediction of breast cancer recurrence risk [J]. BMC Med Inform Decis Mak, 2023, 23(1): 276. DOI: 10.1186/s12911-023-02377-z.
    [11] Ferraioli G, Soares Monteiro LB. Ultrasound-based techniques for the diagnosis of liver steatosis [J]. World J Gastroenterol, 2019, 25(40): 6053-6062. DOI: 10.3748/wjg.v25.i40.6053.
  • 加载中
图(3) / 表(2)
计量
  • 文章访问数:  4
  • HTML全文浏览量:  2
  • PDF下载量:  1
  • 被引次数: 0
出版历程
  • 收稿日期:  2025-01-06
  • 修回日期:  2025-04-12
  • 网络出版日期:  2025-07-07
  • 刊出日期:  2025-06-10

目录

    /

    返回文章
    返回