基于 XGBoost 的基因静态数据调控网络推断方法
XGBoost-Based Gene Network Inference Method for Steady-State Data
-
摘要: 对于静态基因表达数据来说, 推断基因调控网络仍是系统生物学中的一个挑战——存在大量识别难度高的直接或间接调控关系, 而传统方法的准确性和可靠性还有待进一步提高。为此, 该文提出一种基于 Boosting 集成模型的方法(XGBoost), 应用随机化和正则化来解决模型过拟合问题, 同时针对建模所得权重不一致的问题, 对初始权重增加归一化和统计学方法处理。最终, 采用 DREAM5挑战的基准数据集对所提出方法进行性能验证。实验结果表明, XGBoost 比现有其他方法获得更好的性能:在 in-silico 生成的模拟数据集中, 接受者操作特征曲线面积(AUPR)和正确率-召回率曲线面积(AUROC)两个评估指标均显著优于现有方法;在 E.coli 和 S.cerevisiae 两种生物的真实实验数据中, AUROC 指标均高于现有最优方法。Abstract: Inferring gene regulatory networks (GRNs) from steady gene expression data remains a challenge in systems biology. There are a large number of potential direct or indirect regulatory relationships that are difficult to be identified by traditional methods. To address this issue, we propose a new method based on boosting integrated model, and apply randomization and regularization to solve the model over fitting problem. For the inconsistent weights from different subproblems, we integrate normalization and statistical methods to deal with the initial weights. Using the benchmark datasets from DREAM5 challenges, it shows that our method achieves better performance than other state-of-the-art methods. In the simulated data set generated by in-silico, the two evaluation indicators of area under precision-recall curves (AUPR) and area under receiver operating characteristic (AUROC) are significantly better than existing methods, and the accuracy is higher in the real experimental data of two organisms, E.coli and S.cerevisiae. Especially for AUROC, the indicators are higher than the existing best methods.