Comparing progression molecular mechanisms between lung adenocarcinoma and lung squamous cell carcinoma based on genetic and epigenetic networks: big data mining and genome-wide systems identification

Non-small-cell lung cancer (NSCLC) is the predominant type of lung cancer in the world. Lung adenocarcinoma (LADC) and lung squamous cell carcinoma (LSCC) are subtypes of NSCLC. We usually regard them as different disease due to their unique molecular characteristics, distinct cells of origin and dissimilar clinical response. However, the differences of genetic and epigenetic progression mechanism between LADC and LSCC are complicated to analyze. Therefore, we applied systems biology approaches and big databases mining to construct genetic and epigenetic networks (GENs) with next-generation sequencing data of LADC and LSCC. In order to obtain the real GENs, system identification and system order detection are conducted on gene regulatory networks (GRNs) and protein-protein interaction networks (PPINs) for each stage of LADC and LSCC. The core GENs were extracted via principal network projection (PNP). Based on the ranking of projection values, we got the core pathways in respect of KEGG pathway. Compared with the core pathways, we found significant differences between microenvironments, dysregulations of miRNAs, epigenetic modifications on certain signaling transduction proteins and target genes in each stage of LADC and LSCC. Finally, we proposed six genetic and epigenetic multiple-molecule drugs to target essential biomarkers in each progression stage of LADC and LSCC, respectively.


Parameter estimation of the stochastic regression models of candidate GENs via system identification method and system order detection
To identify the parameters in equations (1), (2), (4), (5) in the main article, these equations can be rewritten in the following regression forms:      , and ϕ q,L [n] denote the regression vector for sample n, comprising protein/gene/miRNA/ lncRNA expression data and DNA methylation profiles. θ j,P , θ i,G , θ p,M , and θ q,L are the parameter vectors associated with the protein interactions and gene/miRNA/lncRNA regulations, respectively, containing protein interaction abilities, transcriptional regulatory abilities, post-transcriptional regulatory abilities, and basal levels.
Besides, for all N samples, the regression forms in (10), (11), (12), (13) at different samples can be augmented as follows, respectively: [ which can be also simply represented as follows, respectively: (16) Hence we can use the constrained least square estimation to estimate the parameter vectors θ j ,P, θ i ,G, θ p ,M, θ q ,L, containing protein interaction abilities, transcriptional regulatory abilities, post-transcriptional regulatory abilities, and basal levels by the corresponding NGS data and DNA methylation profiles. The constrained least square estimation problem of parameter vectors θ j,P , θ i,G , θ p,M , θ q,L in equation (18), (19), (20), (21) can be solved by the following form, respectively: subject to 0 0 1 0 0 The inequality constraints in the above constrained least square parameter estimation problem can ensure that the posttranscriptional regulatory abilities of miRNA on genes/miRNAs/lncRNAs are always non-positive. Therefore, we could solve the constrained least square parameter estimation problem to obtain protein interactive parameters, i.e., θ j,P and gene/miRNA/lncRNA regulatory parameters, i.e., θ i,G , θ p,M , and θ q,L through the system identification method in MATLAB optimization toolbox based on a reflective Newton method for minimizing a quadratic function [1].
Due to large-scale measurement of expressed cellular proteins has not been realized yet and mRNA abundance can explain 73% of the variance in protein abundance [2], it allows us to use NGS of gene expressions to substitute protein expressions. Thus we can use the gene, miRNA, and lncRNA expression data and DNA methylation profiles of each stage (normal stage, early stage, middle stage, and advanced stage) in LADC and LSCC, which are obtained from the Cancer Brower website (https://genome-cancer. (22), (23), (24) and (25) to identify the protein interactive abilities, transcriptional regulatory abilities, post-transcriptional regulatory abilities and basal levels in θ j,P , θ i,G , θ p,M , and θ q,L . Since the candidate interactions and regulations were constructed by big data mining from numerous databases and experimental datasets which may contain some plausible information, it is possible that many false positives protein interactive abilities, transcriptional regulatory abilities, and post-transcriptional regulatory abilities are included. To eliminate false positives protein interactive abilities, transcriptional regulatory abilities, and post-transcriptional regulatory abilities, we applied system order detection scheme to prune out these insignificant abilities out of system order as false positives to obtain real GENs.
Akaike Information Criterion (AIC) is a system order detection method based on the identification method in solving (22), (23), (24) and (25) for detecting the real system order. The AICs of j-th protein of PPIN, i-th gene of GRN, p-th miRNA of GRN, and q-th lncRNA of GRN are shown in the following form, respectively: where Δ j,P denotes the number of parameters of protein j, i.e., Δ j,P = J j +1 in the estimation problem of protein interactive model of the PPIN in equation (1); Δ i,G , Δ p,M , and Δ q,L represent the number of parameters, i.e., Δ i,G = J i +Q i +P i +1, Δ p,M = J p +P p +1, and Δ q,L = J q +P q +1 in the estimation problem of gene i, miRNA p, and lncRNA q in the regulatory model of the GRN, respectively; 2 , j P σ is estimated residual error obtained from the system identification method, i.e.,  ∆ q L respectively by tradeoff between residual error and parameter association number. Therefore, we could prune the false positives protein interactive abilities, transcriptional regulatory abilities, and post-transcriptional regulatory abilities in candidate GENs to obtain real GENs at each stage of LADC and LSCC through deleting the insignificant interactions and regulations out of true system order one protein by one protein, one gene by one gene, one miRNA by one miRNA, and one lncRNA by one lncRNA.

Extracting core GENs from the real GENs by using the PNP method
However, these real GENs remain large and complex. It is difficult to reveal the significant information to get an insight to the genetic and epigenetic mechanisms of progression in LADC and LSCC to allow us to compare and investigate progression mechanisms between LADC and LSCC. Hence, we used PNP method to extract the core GENs from the real GENs in different lung conditions (i.e. normal stage, early stage, middle stage, advanced stage of LADC and LSCC). The system models, describing the PPIs, gene regulations, miRNA regulations, and lncRNA regulations, and epigenetic regulations by DNA methylation of real where J j , J i , Q i , P i , J p , P p , J q , and P q represent the denotes the matrix associated with interactive abilities of proteins; The sub-network structure matrix H tg =ˆˆ1 represent the matrices associated with transcriptional regulatory abilities of TFs on genes, lncRNAs, and miRNAs, respectively; The sub-network structure matrix H mg =ˆˆ1 indicate the matrices associated with post-transcriptional regulatory abilities of miRNAs on genes, lncRNAs, and miRNAs, respectively; The sub-network structure matrix H lg = 11 1 1 is the matrix associated with transcriptional regulatoryt abilities of lncRNAs on genes. In the real GENs, if an interaction between any two proteins or a regulation between any two elements of TFs, genes, miRNAs, and lncRNA is disconnected, its corresponding ability will be set to zero. Based on singular value decomposition (SVD), the network structure projection method, PNP, can be described as follows: where ( ) ( ) for k=1, ...,J+Q+P and t=1, ...,J+I+Q+P Hence we can obtain the projection values of all node, including proteins, genes, miRNAs, and lncRNAs in real GENs. If the projection value D L (k) or D R (t) is close to zero, it means that the corresponding node is almost independent to the top M left-singular vectors or top M right-singular vectors. In other words, if a node of the real GEN has a higher projection value, it is more important for the principal network structure of GEN. Since the aim of this study is to compare progression genetic and epigenetic mechanisms between LADC and LSCC, we can identify the core proteins and TFs with top projection values, and then with these core proteins and TFs to connect with their miRNAs/lncRNAs to form the core GENs for further carcinogenesis investigation.