Lasso降维策略下SIS与MDS在乳腺癌转录组数据机器学习建模中的比较

马瑄; 张华麟; 梁佳琪; 杨开鑫; 刘龙

doi:10.16462/j.cnki.zhjbkz.2023.11.019

Lasso降维策略下SIS与MDS在乳腺癌转录组数据机器学习建模中的比较

doi: 10.16462/j.cnki.zhjbkz.2023.11.019

马瑄¹,
张华麟¹,
梁佳琪¹,
杨开鑫¹,
刘龙^{1, 2, ,}

1.
山西医科大学公共卫生学院卫生统计学教研室，太原 030607
2.
重大疾病风险评估山西省重点实验室，太原 030607

基金项目:

国家自然科学基金 81903418

国家自然科学基金 82173632

详细信息

通讯作者:
刘龙，E-mail：biostat-ll@sxmu.edu.cn

中图分类号: R737.9;TP181
计量
- 文章访问数: 503
- HTML全文浏览量: 180
- PDF下载量: 26
- 被引次数: 0
出版历程
- 收稿日期: 2022-09-09
- 修回日期: 2022-12-07
- 网络出版日期: 2023-11-20
- 刊出日期: 2023-11-10

Application of a dimensionality reduction strategy based on SIS and MDS and machine learning statistical modeling methods to breast cancer transcriptome data

1.
Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan 030607, China
2.
Shanxi Provincial Key Laboratory of Major Assessment Disease Risk, Taiyuan 030607, China

Funds:

National Natural Science Foundation of China 81903418

National Natural Science Foundation of China 82173632

More Information

Corresponding author: LIU Long, E-mail: biostat-ll@sxmu.edu.cn

摘要

摘要: 目的探索以确定独立筛选(sure independence screening，SIS)与多维尺度变换(multi-dimensional scaling，MDS)为基础的2步降维策略，及以支持向量机(support vector machine，SVM)、随机森林(random forest，RF)和梯度推进机(gradient boosting machine，GBM)构建乳腺癌淋巴结转移风险预测的统计模型，为高危人群识别及早期干预提供科学依据。方法采用SIS和MDS作为初步降维方法，并以套索算法(least absolute shrinkage and selection operator，LASSO)为第2步降维方法，通过SIS+LASSO和MDS+LASSO的2步降维策略，将筛选的变量分别纳入SVM、RF和GBM 3种机器学习模型。使用受试者工作特征(receiver operating characteristic, ROC)曲线下面积(area under the curve，AUC)作为衡量模型预测性能的评价指标。结果所有预测模型中，SIS+LASSO和MDS+LASSO 2步降维策略相对SIS和MDS单步策略在SVM、RF和GBM 3种预测模型下预测稳定性提升，运行时间和运行内存减少。MDS+LASSO 2步降维策略相对于MDS单步降维策略的预测精度提升。所有策略中，GBM的预测精度均高于SVM和RF。结论在SIS与MDS基础上加入LASSO的2步降维策略，从运算速度、内存消耗、建模方法选择、预测精度等方面弥补了SIS和MDS单步降维的不足。对于不同的降维策略，GBM的预测性能均比SVM和RF好。
- 确定独立筛选 /
- 多维尺度变换 /
- 2步降维 /
- 统计建模 /
- 机器学习 /
- 转录组数据
Abstract: Objective This study aimed to investigate the utility of a two-step dimensionality reduction strategy, incorporating sure independence screening (SIS) and multi-dimensional scaling (MDS), alongside machine learning algorithms, namely support vector machine (SVM), random forest (RF) and gradient boosting machine (GBM), for constructing a statistical model for breast cancer lymph node metastasis risk prediction. This model aims to provide a scientific basis for the identification of high-risk groups and early intervention. Methods SIS and MDS were used as the initial dimensionality reduction method and the last absolute shrinkage and selection operator (LASSO) was used as the second step of dimensionality reduction. The filtered variables were incorporated into three machine learning models, SVM, RF and GBM, respectively, by the two-step dimensionality reduction strategies of SIS+LASSO and MDS+LASSO. The receiver operating characteristic (ROC) area under the curve (AUC) was used as an evaluation metric to measure the prediction performance of the models. Results Among all prediction models, the SIS+LASSO and MDS+LASSO two-step dimensionality reduction strategies have improved prediction stability and reduced running time and running memory relative to the SIS and MDS single-step strategies for the three prediction models SVM, RF, and GBM. The MDS+LASSO two-step dimensionality reduction strategy has improved prediction accuracy relative to the MDS single-step dimensionality reduction strategy. Among all strategies, GBM has higher prediction accuracy than SVM and RF. Conclusions The two-step dimensionality reduction strategy with LASSO added to SIS and MDS compensates for the shortcomings of SIS and MDS single-step dimensionality reduction in terms of computing speed, memory consumption, modeling method selection, and prediction accuracy. For different dimensionality reduction strategies, the prediction performance of GBM is better than that of SVM and RF.
- Sure Independence Screening /
- Multi-dimensional scaling /
- Two-step dimension reduction strategy /
- Statistical modelling /
- Machine learning /
- Transcriptome data

HTML全文

图 1 研究流程图

注：SIS, 确定独立性筛选；MDS, 多维缩放；LASSO, 最小绝对收缩和选择算法；SVM, 支持向量机；RF, 随机森林；GBM, 梯度推进机。

Figure 1. Research workflow

Note: SIS, sure independent screening; MDS, multidimensional scaling; LASSO, least absolute shrinkage and selection operator; SVM, support vector machine; RF, random forest; GBM: gradient propulsion machine.

下载: 全尺寸图片幻灯片

图 2 SIS与SIS+LASSO在不同统计建模方法下的预测性能

AUC：曲线下面积；SVM：支持向量机；RF：随机森林；GBM：梯度推进机；SIS：确定独立性筛选策略；LASSO: 套索算法; SIS+LASSO：确定独立性筛选和LASSO结合的策略。

Figure 2. Prediction performance of SIS and SIS+LASSO under different statistical modelling approaches

AUC: area under the curve; SVM: support vector machine; RF: random forest; GBM: gradient propulsion machine; SIS: sure independent screening strategies; LASSO: least absolute shrinkage and selection operator; SIS+LASSO: the strategy of combining Sure Independent Screening and LASSO.

下载: 全尺寸图片幻灯片

图 3 MDS与MDS+LASSO在不同统计建模方法下的预测性能

AUC：曲线下面积；SVM：支持向量机；RF：随机森林；GBM：梯度推进机；MDS：多维缩放；LASSO: 套索算法; MDS+LASSO：多维缩放和LASSO结合的策略。

Figure 3. Prediction performance of MDS and MDS+LASSO under different statistical modelling approaches

AUC: area under the curve; SVM: support vector machine; RF: random forest; GBM: gradient propulsion machine; MDS: multidimensional scaling; LASSO: least absolute shrinkage and selection operator; MDS+LASSO: the strategy of combining multidimensional scaling and LASSO.

下载: 全尺寸图片幻灯片

表 1 不同降维策略下的AUC以及时间和内存消耗

Table 1. AUC, time and memory consumption under different dimension reduction strategies

策略 Strategy	AUC^①	内存/KB Memory/KB	时间/s Time/s
SIS(88)+
SVM	0.879(0.822, 0.933)	0.020	189.233
RF	0.910(0.875, 0.952)	0.102	484.055
GBM	0.917(0.880, 0.967)	4.020	380.851
SIS+LASSO
SVM	0.875(0.828, 0.928)	0.007	50.761
RF	0.888(0.855, 0.929)	0.057	402.752
GBM	0.895(0.861, 0.949)	3.833	97.875
MDS(88)+
SVM	0.735(0.684, 0.778)	0.018	210.714
RF	0.879(0.846, 0.928)	0.145	641.008
GBM	0.903(0.871, 0.947)	4.739	380.852
MDS+LASSO
SVM	0.873(0.821, 0.919)	0.006	56.422
RF	0.885(0.860, 0.922)	0.108	608.742
GBM	0.900(0.858, 0.943)	4.160	105.042
注：1.AUC：曲线下面积；SIS: 确定独立筛选；LASSO: 套索算法; MDS：多维尺度变换；SVM：持向量机；RF：随机森林；GBM：梯度推进机。 2.SIS(88)、MDS(88)表示该方法筛选出的变量数为88；SIS+LASSO：确定独立性筛选和LASSO结合的策略；MDS+LASSO：多维缩放和LASSO结合的策略。 ①以[M (P₂₅, P₇₅)]表示。 Note: 1. AUC: area under the curve; SIS：sure independence screening；LASSO: least absolute shrinkage and selection operator; MDS：multi-dimensional scaling；SVM：support vector machine；RF：random forest；GBM：gradient boosting machine. 2.SIS (88) and MDS (88) indicate that the number of variables screened by this method is 88; SIS+LASSO: The strategy of combining Sure Independent Screening and LASSO; MDS+LASSO: The strategy of combining multidimensional scaling and LASSO. ① [M (P₂₅, P₇₅)].

下载: 导出CSV

参考文献(13)

[1]	Pfeiffer RM, Park Y, Kreimer AR, et al. Risk prediction for breast, endometrial, and ovarian cancer in white women aged 50 y or older: derivation and validation from population-based cohort studies[J]. PLoS Med, 2013, 10(7): e1001492. DOI: 10.1371/journal.pmed.1001492.
[2]	Siegel RL, Miller KD, Fuchs HE, et al. Cancer statistics, 2021[J]. CA Cancer J Clin, 2021, 71(1): 7-33. DOI: 10.3322/caac.21654.
[3]	Woolston C. Breast cancer: 4 big questions[J]. Nature, 2015, 527(7578): S120-S120. DOI: 10.1038/527S101a.
[4]	Oreski D, Oreski S, Klicek B. Effects of dataset characteristics on the performance of feature selection techniques[J]. Appl Soft Comput, 2017, 52: 109-119. DOI: 10.1016/j.asoc.2016.12.023.
[5]	Frank SM, Qi A, Ravasio D, et al. Supervised learning occurs in visual perceptual learning of complex natural images[J]. Curr Biol, 2020, 30(15): 2995-3000.e3. DOI: 10.1016/j.cub.2020.05.050.
[6]	Aflalo Y, Dubrovina A, Kimmel R. Spectral generalized multi-dimensional scaling[J]. Int J Vis, 2016, 118(3): 380-392. DOI: 10.1007/s11263-016-0883-8.
[7]	刘妍琛, 张晓曙, 崔旭东, 等. 基于Group LASSO Logistic回归分析模型分析流行性乙型脑炎早期临床症状与预后的关联[J]. 中华疾病控制杂志, 2021, 25(8): 891-897, 934. DOI: 10.16462/j.cnki.zhjbkz.2021.08.005. Liu YC, Zhang XS, Cui XD, et al. Study on the relationship between early clinical symptoms and prognosis of Japanese encephalitis: based on Group LASSO Logistic regression model[J]. Chin J Dis Control Prev, 2021, 25(8): 891-897, 934. DOI: 10.16462/j.cnki.zhjbkz.2021.08.005.
[8]	Dai P, Chang W, Xin Z, et al. Retrospective study on the influencing factors and prediction of hospitalization expenses for chronic renal failure in China based on random forest and LASSO regression[J]. Front Public Health, 2021, 9: 678276. DOI: 10.3389/fpubh.2021.678276.
[9]	Heo JN, Yoon JG, Park H, et al. Machine learning-based model for prediction of outcomes in acute stroke[J]. Stroke, 2019, 50(5): 1263-1265. DOI: 10.1161/STROKEAHA.118.024293.
[10]	Jiang F, Jiang Y, Zhi H, et al. Artificial intelligence in healthcare: past, present and future[J]. Stroke Vasc Neurol, 2017, 2(4): 230-243. DOI: 10.1136/svn-2017-000101.
[11]	Ellis K, Kerr J, Godbole S, et al. A random forest classifier for the prediction of energy expenditure and type of physical activity from wrist and hip accelerometers[J]. Physiol Meas, 2014, 35(11): 2191-2203. DOI: 10.1088/0967-3334/35/11/2191.
[12]	Natekin A, Knoll A. Gradient boosting machines, a tutorial[J]. Front Neurorobot, 2013, 7: 21. DOI: 10.3389/fnbot.2013.00021.
[13]	Chin K, DeVries S, Fridlyand J, et al. Genomic and transcriptional aberrations linked to breast cancer pathophysiologies[J]. Cancer Cell, 2006, 10(6): 529-541. DOI: 10.1016/j.ccr.2006.10.009.