Identifying and analyzing different cancer subtypes using RNA-seq data of blood platelets
Metrics: PDF 643 views | HTML 2813 views | ?
Yu-Hang Zhang1,2,*, Tao Huang2,*, Lei Chen4,*, YaoChen Xu5, Yu Hu2, Lan-Dian Hu2, Yudong Cai3 and Xiangyin Kong2
1Department of General Surgery, Shanghai Jiao Tong University Affiliated Sixth People’s Hospital, Shanghai 200233, People’s Republic of China
2Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai 200031, People’s Republic of China
3School of Life Sciences, Shanghai University, Shanghai 200444, People’s Republic of China
4College of Information Engineering, Shanghai Maritime University, Shanghai 201306, People’s Republic of China
5Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, People’s Republic of China
*These authors have contributed equally to this work
Lan-Dian Hu, email: email@example.com
Yudong Cai, email: firstname.lastname@example.org
Xiangyin Kong, email: email@example.com
Keywords: cancer detection, liquid biopsy, RNA-seq data, support vector machine, maximum relevance minimum redundancy
Received: June 15, 2017 Accepted: August 16, 2017 Published: September 15, 2017
Detection and diagnosis of cancer are especially important for early prevention and effective treatments. Traditional methods of cancer detection are usually time-consuming and expensive. Liquid biopsy, a newly proposed noninvasive detection approach, can promote the accuracy and decrease the cost of detection according to a personalized expression profile. However, few studies have been performed to analyze this type of data, which can promote more effective methods for detection of different cancer subtypes. In this study, we applied some reliable machine learning algorithms to analyze data retrieved from patients who had one of six cancer subtypes (breast cancer, colorectal cancer, glioblastoma, hepatobiliary cancer, lung cancer and pancreatic cancer) as well as healthy persons. Quantitative gene expression profiles were used to encode each sample. Then, they were analyzed by the maximum relevance minimum redundancy method. Two feature lists were obtained in which genes were ranked rigorously. The incremental feature selection method was applied to the mRMR feature list to extract the optimal feature subset, which can be used in the support vector machine algorithm to determine the best performance for the detection of cancer subtypes and healthy controls. The ten-fold cross-validation for the constructed optimal classification model yielded an overall accuracy of 0.751. On the other hand, we extracted the top eighteen features (genes), including TTN, RHOH, RPS20, TRBC2, in another feature list, the MaxRel feature list, and performed a detailed analysis of them. The results indicated that these genes could be important biomarkers for discriminating different cancer subtypes and healthy controls.
All site content, except where otherwise noted, is licensed under a Creative Commons Attribution 3.0 License.