学术沙龙主题:Category encoding method to select feature genes for the classification of bulk and single-cell RNA-seq data
报告人:周彦 深圳大学特聘研究员
报告时间:2022年6月16日(周四);下午15:00—16:30
报告地点:腾讯会议ID:696750498
报告人简介:周彦,深圳大学特聘研究员,博士生导师。2013年本硕博毕业于东北师范大学伟德国际BETVlCTOR,随后在美国伊利诺伊大学香槟分校从事博士后工作,2015年进入深圳大学工作。期间经常访问香港大学,香港浸会大学,香港城市大学等。主要从事统计学,生物统计,机器学习,医学统计等数据科学方面的研究。获得深圳市孔雀计划奖励C类和“南山区领航人才”。主持国家面上项目,国家青年项目等总共七项。以第一作者身份在Genome Research(影响因子:14.38),bioinformatics(影响因子:7.38),Statistics in Medicine, BMC Genomics(影响因子:3.86)等国际顶级期刊上发表高水平SCI论文三十余篇。单篇最高引用次数100多次。目前任职广东省高等学校教学指导委员会委员。协会兼职广东省现场统计协会副秘书长,常务理事;中国工业统计协会理事;中国环境资源统计会议理事。
报告摘要:Bulk and single-cell RNA-seq (scRNA-seq) data are being used as alternatives to traditional technology in biology and medicine research. These data are used, for example, for the detection of differentially expressed (DE) genes. Several statistical methods have been developed for the classification of bulk and single-cell RNA-seq data. These feature genes are vitally important for the classification of bulk and single-cell RNA-seq data. The majority of genes are not differentially expressed and they are thus irrelevant for class distinction. To improve the classification performance and save the computation time, removal of irrelevant genes is necessary. Removal will aid the detection of the important feature genes.
In this paper, a category encoding (CAEN) method is proposed to select feature genes for bulk and single-cell RNA-seq data classification. This novel method encodes categories by employing the rank of sequence samples for each gene in each class. Correlation coefficients are considered for gene and class with the rank of sample and a new rank of category. The sure screening method was also established for rank consistency properties of the proposed CAEN method. Simulation studies show that the classifier using the proposed CAEN method performs better than, or at least as well as, the existing methods in most settings. Existing real datasets were analyzed, with the results demonstrating superior performance of the proposed method over current competitors. The application has been coded into an R package to facilitate wide use.