Cell-of-origin subtype classification and prognosis of diffuse large B-cell lymphoma based on variable importance analysis
-
摘要:
目的 基因表达谱(gene expression profiling, GEP)是弥漫性大B细胞淋巴瘤(diffuse large B-cell lymphoma, DLBCL)细胞起源(cell-of-origin, COO)分类的金标准。本研究旨在建立一个基于GEP的简约模型来准确预测DLBCL的COO亚型并为其在临床上的应用提供参考。 方法 收集GEO数据库中6个DLBCL数据集中的基因和临床数据,将其中1个数据集作为训练集,其余5个作为验证集。构建基于惩罚回归分析的变量重要性分析策略,识别最优变量子集,并通过logistic回归分析确定最终用于COO分类的六基因模型,采用生存分析评估训练集和验证集预测的两种COO亚型与临床预后的关系。 结果 六基因模型在训练集预测效果较好[AUC(95% CI): 0.999(0.997~1.000),判别斜率及其95% CI为0.944(0.920~0. 966)],在验证集也表现出较好的效果[AUC及其95% CI波动范围从0.910(0.820~0.999)到1.000,判别斜率及其95% CI波动范围从0.506(0.350~0. 966)到0.927(0.841~0.987)]。预后模型显示,在训练集和验证集中6个基因预测的亚型均为风险预测因子(均P < 0.05)。 结论 六基因模型中的6个基因对DLBCL的分型和预后有重要的临床应用价值。基于变量重要性的基因排序为基因功能和靶向药物的进—步研究提供了参考依据。 -
关键词:
- 弥漫性大B细胞淋巴瘤 /
- 惩罚回归 /
- 预后 /
- 亚型分类 /
- 变量重要性分析
Abstract:Objective Gene expression profiling (GEP) is the gold standard for cell-of-origin COO classification of diffuse large B-cell lymphoma (DLBCL). The aim of this study was to establish a GEP-based parsimony model to accurately predict the COO subtypes of DLBCL and provide a reference for its clinical application. Methods Genetic and clinical data from 6 DLBCL datasets in the GEO database were collected, and one dataset was used as the training set and the remaining five as the validation set. A variable importance analysis strategy based on penalized regression analysis was constructed to identify the optimal subset of variables, and a logistic regression analysis was performed to determine the six-gene model that was ultimately used for COO classification. Survival analysis was used to assess the relationship between the two COO subtypes predicted by the training and validation sets and clinical prognosis. Results The six-gene model predicted better in the training set [AUC(95% CI): 0.999 (0.997~1.000), discriminant slope and its 95% CI were 0.944 (0.920~0. 966)], and also showed better results in the validation set [AUC and its 95% CI fluctuated from 0.910 (0.820~0.999) to 1.000, and the discriminant slope and its 95% CI fluctuated from 0.506 (0.350~0. 966) to 0.927 (0.841~0.987)]. The prognostic modeling showed that the six genetically predicted subtypes were risk predictors in both the training and validation sets (all P < 0.05). Conclusions The six genes in the six-gene model have important clinical applications for the classification and prognosis of DLBCL. The gene ordering based on variable importance provides a reference basis for further-research on gene function and targeted drug research. -
表 1 不同惩罚回归方法的变量排名表和秩整合的汇总排名列表
Table 1. Variable ranking lists of different penalized regression methods and aggregated ranking list by rank aggregation
最小绝对收缩选择算子
LASSO自适应最小绝对收缩选择算子
aLASSO弹性网
EN岭回归
Ridge regression极小极大凹惩罚
MCP光滑链接绝对偏差惩罚
SCAD秩整合
Rank aggregationTNFRSF13B TNFRSF13B MYBL1 AFFX- HUMISGF3A/M97935_MB_at MYBL1 MYBL1 MYBL1 MYBL1 MYBL1 TNFRSF13B CCL5 TNFRSF13B TNFRSF13B TNFRSF13B MAML3 MAML3 BATF SCARB1 MAML3 MAML3 MAML3 CYB5R2 CYB5R2 CYB5R2 PXK CYB5R2 CYB5R2 CYB5R2 BATF BATF ASB13 PXK S1PR2 BATF BATF ASB13 ASB13 MAML3 C15orf40 ENTPD1 S1PR2 S1PR2 S1PR2 S1PR2 LIMD1 PDE7A LMO2 ASB13 ASB13 LIMD1 LIMD1 S1PR2 CLEC12A PALD1 LIMD1 LIMD1 SERPINA9 SERPINA9 SERPINA9 LACTB ZBTB32 LMO2 SERPINA9 ENTPD1 ENTPD1 MME LACTB FUT8 SERPINA9 ENTPD1 LMO2 LMO2 PIM2 NLRP11 ASB13 ENTPD1 LMO2 FUT8 FUT8 FUT8 CDC42SE2 PTK2 FUT8 FUT8 ZBTB32 ZBTB32 ENTPD1 1552621_at PIM2 ZBTB32 PALD1 PALD1 PALD1 ENTPD1 1552622_s_at SERPINA9 PALD1 ZBTB32 BCL2L10 BCL2L10 CFLAR TRNT1 ARHGAP24 BCL2L10 BCL2L10 PIM2 PIM2 BCL6 ARHGAP5 215164_at STAG3 PDE7A 注:LASSO,最小绝对收缩选择算子;aLASSO,自适应最小绝对收缩选择算子;EN,弹性网;SCAD,光滑链接绝对偏差惩罚;MCP,极小极大凹惩罚。
Note:LASSO,least absolute shrinkage and selection operator;aLASSO,adaptive least absolute shrinkage and selection operator;EN,elastic net;SCAD,smoothly clipped absolute deviation penalty;MCP,minimax concave penalty.表 2 训练数据集和验证数据集中的单变量和多变量Cox比例风险回归模型
Table 2. Univariate and multivariable Cox proportional hazards regression analyses in training and validation datasets
变量 Variables 单变量分析 Univariate analysis 多变量分析 Multivariable analysis β值
valuesx HR值value (95% CI) χ2值
valueP值
valueβ值
valuesx HR值value (95% CI) χ2值
valueP值
value训练集 GSE10846 (n=350) 六基因 Six-genes 1.040 0.179 2.830(1.993~4.018) 33.813 <0.001 0.981 0.202 2.668(1.795~3.968) 23.514 <0.001 国际预后指数 International prognostic index 1.091 0.192 2.978(2.044~4.340) 32.252 <0.001 1.135 0.195 3.110(2.120~4.562) 33.688 <0.001 处理 Treatment -0.554 0.177 0.574(0.406~0.812) 9.851 0.002 -0.685 0.205 0.503(0.337~0.754) 11.077 0.001 性别 Gender 0.031 0.173 1.032(0.735~1.449) 0.032 0.858 ― ― ― ― ― 验证集 Combat data (n=624) 六基因 Six-genes 0.457 0.157 1.579(1.161~2.147) 8.470 0.004 0.425 0.167 1.530(1.104~2.121) 6.500 0.011 国际预后指数 International prognostic index 0.966 0.165 2.627(1.900~3.632) 34.121 <0.001 0.914 0.166 2.494(1.799~3.458) 30.090 <0.001 性别 Gender -0.086 0.158 0.918(0.674~1.251) 0.292 0.588 ― ― ― ― ― 注:“―”表示数据无法获得。
Note:"―" stands for date is not available. -
[1] Tilly H, Gomes da Silva M, Vitolo U, et al. Diffuse large B-cell lymphoma (DLBCL): ESMO clinical practice guidelines for diagnosis, treatment and follow-up [J]. Ann Oncol, 2015, 26(Suppl 5): v116-v125. DOI: 10.1093/annonc/mdv304. [2] Alizadeh AA, Eisen MB, Davis RE, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [J]. Nature, 2000, 403(6769): 503-511. DOI: 10.1038/35000501. [3] Li SY, Young KH, Medeiros LJ. Diffuse large B-cell lymphoma [J]. Pathology, 2018, 50(1): 74-87. DOI: 10.1016/j.pathol.2017.09.006. [4] Zou H. The adaptive lasso and its oracle properties [J]. J Am Stat Assoc, 2006, 101(476): 1418-1429. DOI:10.1198/01621450600 735. [5] Zou H, Hastie T. Addendum: regularization and variable selection via the elastic net [J]. J Royal Stat Soc Ser B Stat Methodol, 2005, 67(5): 768. DOI: 10.1111/j.1467-9868.2005.00527.x. [6] Tibshirani R. Regression shrinkage and selection via the lasso [J]. J Royal Stat Soc Ser B Methodol, 1996, 58(1): 267-288. DOI: 10.1111/j.2517-6161.1996.tb02080.x. [7] Zhang CH. Nearly unbiased variable selection under minimax concave penalty [J]. Ann Statist, 2010, 38(2): 894-942. DOI: 10.1214/09-aos729. [8] Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems [J]. Technometrics, 1970, 12(1): 55-67. DOI: 10.1080/00401706.1970.10488634. [9] Pihur V, Datta S, Datta S. RankAggreg, an R package for weighted rank aggregation [J]. BMC Bioinformatics, 2009, 10: 62. DOI: 10.1186/1471-2105-10-62. [10] Lenz G, Wright G, Dave SS, et al. Stromal gene signatures in large-B-cell lymphomas[J]. N Engl J of Med, 2008, 359(22): 2313-2323. DOI: 10.1056/NEJMoa0802885. [11] Golay J, Broccoli V, Lamorte G, et al. The A-Myb transcription factor is a marker of centroblasts in vivo [J]. J Immunol, 1998, 160(6): 2786-2793. [12] Muppidi JR, Schmitz R, Green JA, et al. Loss of signalling via Gα13 in germinal centre B-cell-derived lymphoma [J]. Nature, 2014, 516(7530): 254-258. DOI: 10.1038/nature13765. [13] Flori M, Schmid CA, Sumrall ET, et al. The hematopoietic oncoprotein FOXP1 promotes tumor cell survival in diffuse large B-cell lymphoma by repressing S1PR2 signaling [J]. Blood, 2016, 127(11): 1438-1448. DOI: 10.1182/blood-2015-08-662635. [14] Onishi H, Yamasaki A, Kawamoto M, et al. Hypoxia but not normoxia promotes Smoothened transcription through upregulation of RBPJ and Mastermind-like 3 in pancreatic cancer [J]. Cancer Lett, 2016, 371(2): 143-150. DOI: 10.1016/j.canlet.2015.11.012. [15] Muppidi JR, Schmitz R, Green JA, et al. Loss of signalling via Gα13 in germinal centre B-cell-derived lymphoma [J]. Nature, 2014, 516(7530): 254-258. DOI: 10.1038/nature13765. [16] Devaney JM, Wang S, Funda S, et al. Identification of novel DNA-methylated genes that correlate with human prostate cancer and high-grade prostatic intraepithelial neoplasia [J]. Prostate Cancer Prostatic Dis, 2013, 16(4): 292-300. DOI: 10.1038/pcan.2013.21. [17] Lotem J, Sachs L. Epigenetics and the plasticity of differentiation in normal and cancer stem cells [J]. Oncogene, 2006, 25(59): 7663-7672. DOI: 10.1038/sj.onc.1209816. [18] Liu Q, Liu YX, Li WL, et al. Genetic, epigenetic, and molecular landscapes of multifocal and multicentric glioblastoma [J]. Acta Neuropathol, 2015, 130(4): 587-597. DOI: 10.1007/s00401-015-1470-8. [19] Care MA, Cocco M, Laye JP, et al. SPIB and BATF provide alternate determinants of IRF4 occupancy in diffuse large B-cell lymphoma linked to disease heterogeneity [J]. Nucleic Acids Res, 2014, 42(12): 7591-7610. DOI: 10.1093/nar/gku451. [20] Scott DW, Wright GW, Williams PM, et al. Determining cell-of-origin subtypes of diffuse large B-cell lymphoma using gene expression in formalin-fixed paraffin-embedded tissue [J]. Blood, 2014, 123(8): 1214-1217. DOI: 10.1182/blood-2013-11-536433. -