Sense-antisense gene-pairs in breast cancer and associated pathological pathways

More than 30% of human protein-coding genes form hereditary complex genome architectures composed of sense-antisense (SA) gene pairs (SAGPs) transcribing their RNAs from both strands of a given locus. Such architectures represent important novel components of genome complexity contributing to gene expression deregulation in cancer cells. Therefore, the architectures might be involved in cancer pathways and, in turn, be used for novel drug targets discovery. However, the global roles of SAGPs in cancer pathways has not been studied. Here we investigated SAGPs associated with breast cancer (BC)-related pathways using systems biology, prognostic survival and experimental methods. Gene expression analysis identified 73 BC-relevant SAGPs that are highly correlated in BC. Survival modelling and metadata analysis of the 1161 BC patients allowed us to develop a novel patient prognostic grouping method selecting the 12 survival-significant SAGPs. The qRT-PCR-validated 12-SAGP prognostic signature reproducibly stratified BC patients into low- and high-risk prognostic subgroups. The 1381 SAGP-defined differentially expressed genes common across three studied cohorts were identified. The functional enrichment analysis of these genes revealed the GABPA gene network, including BC-relevant SAGPs, specific gene sets involved in cell cycle, spliceosomal and proteasomal pathways. The co-regulatory function of GABPA in BC cells was supported using siRNA knockdown studies. Thus, we demonstrated SAGPs as the synergistically functional genome architectures interconnected with cancer-related pathways and associated with BC patient clinical outcomes. Taken together, SAGPs represent an important component of genome complexity which can be used to identify novel aspects of coordinated pathological gene networks in cancers.

Potential tumor antigen marker in bladder cancer [6] C11orf57 chromosome 11 open reading frame 57 Gene is deleted in a patient of head and neck paraganglioma [7] C2orf3 chromosome 2 open reading frame 3 Association with childhood acute lymphoblastic leukemia [8] C6orf120 chromosome 6 open reading frame 120 CCNE2 cyclin E2 Prognostic marker for lymph node-negative breast cancers [9] CCPG1 cell cycle progression 1 (CCPG1) Alteration in cell cycle progression [10] CYB561D2 cytochrome b-561 domain containing 2 Tumor suppressor in lung cancer, potential drug target [11] DBF4 activator of S phase kinase Activating subunit of CDC7. Together with CDC7 demonstrates overexpression upon p53 loss in various tumors.

Identification of BC-relevant SAGPs using Kendall's Tau correlation analysis
From the set of 728 SAGPs (see Results section in the main manuscript), we obtained 334 nonredundant SAGPs for which each gene-member was supported by at least one U133 A&B Affymetrix probeset (Table S1, A). Using microarray expression data from the Uppsala, Stockholm [2] and Harvard [38] cohorts, we selected two clinical subgroups in each cohort: a subset of grade 3 breast tumours, which belong to the basal-like subtype; and grade 3 breast tumours, which belong to the luminal A, luminal B, Normal-like and ERBB2 subtypes collectively termed the "non-basal-like" subgroup.
For each patient cohort and clinical subgroup, Kendall's Tau correlation coefficients were calculated for the probeset pairs. High-confidence correlated probeset pairs (p<0.05) were selected, and we identified common probeset pairs (73 SAGPs, Table S1B and S1C) among the three cohorts and two subsets. Common SAGPs were selected via Gene Symbols. For genes with multiple probesets, we selected the probeset that displayed the highest correlation with its probeset partner for a given SAGP. Thus, we identified SAGPs that were meaningful for breast cancer as complex gene pairs architectures rather than as individual genes. We assumed SAGP as an overlapping transcriptional architecture [46] which is composed not only of protein coding-coding SA host genes but includes all the transcripts within the territory of a SAGP and which could be tightly transcriptionally/post-transcriptionally related. Hence, we included also those Affymetrix probesets that might not represent a transcript with sense (or antisense) overlap, but still belongs to the same complex locus of a SAGP.

Negative control gene sets used for analyses of SAGPs.
The nearest genes neighbours (NGNs) set served as a negative control for SAGPs in correlation, DNA CNV and proximal promoters analyses, and it was selected using the following criteria: 1) the nearest upstream or downstream neighbouring gene for each gene partner of a SAGP together composing a nearest gene pair; 2) NGNs should not be involved in the same complex sense-antisense architecture [47] according to the RefSeq track of the UCSC Genomic browser; 3) every NGN should be a protein-encoding gene represented by a RefSeq NM_ID; 4) the NGN should not be the same for two different SAGPs, and it should be driven from the studied set of 73 SAGPs; 5) the NGN must be represented by at least 1 probeset from Affymetrix U133A&B platform (Tables S3A and S3B).
Additionally, we used the set of reproducibly correlated pairs of non-overlapping neighbouring genes (PNGs) identified using the same procedure and BC subgroups cohorts as 73-SAGPs for proximal promoter analysis; the procedure for their isolation is described in detail below.
We suggested a selection of the genes in the pairs located in vicinity of each other, which are located on opposite DNA strands, but not sharing common nucleotides (an essential property of sense-antisense gene pairs). According to this criteria, we found at least 2473 non-overlapping neighbouring genes in total in the human genome ( Figure S3). The total available genomic pool of non-overlapping neighbouring genes was quite limited due to the following factors: 1) requirement of the absence, or at least depletion, of sense-antisense architectures/overlaps between non-overlapping neighbouring genes; 2) gene partners of non-overlapping neighbouring genes must be located on the opposite DNA strand of the same chromosome; 3) the distance between genes in the non-overlapping neighbouring genes (distance between the 5'-ends or

Survival prognosis via Data Driven Grouping (DDg) Analysis
The Rotated 2D Data-Driven grouping (2D RDDg) Briefly, X -and Y-axes rotations were performed in order to find more optimal partition of patients into high risk and low risk subgroups for each pair of Affymetrix probesets.
The rotated 2D Data-Driven grouping (2-D RDDg) is a generalization of the 2-D DDg [48] that considers patients' grouping using different angles for separating the data. In other words, the original X, Y axes are iteratively rotated by angle α (Figure 3B), without losing their orthogonallity property, and in each rotation the patients are grouped similarly as in 2-D DDg ( Figure S6) where each design is divided into two sub-designs ( Figure 3A and 3C).
The best grouping is the one that minimizes the Wald P value of the β coefficient of the Cox proportional model. The algorithm runs with the corresponding R function.
Note that instead of rotating (transforming) the data by using trigonometric functions: where Χ΄, Y΄ and Χ, Y denote the new and the old coordinates, respectively, we rotate the axes themselves ( Figure 3B). Analytically, the algorithm works as follows: 1. Assume probesets ID list of size n that used to form all possible pairs. For each pair we seek the optimal patient partition into two groups. For probesets i = 1 and j = 2 (i = 1, …, N-1; j = i+1, …, N), form the candidate cutoffs vectors ⃗⃗ = * and ⃗⃗ = * of dimension 1 x Q each, where * is the log-transformed intensities within ( 10 , 90 ) and Q is the size of ⃗⃗ . Each element of the ⃗⃗ is a trial cutoff value (see 1D data-driven function below).
2. Find the cut-off values ⃗⃗ that produce the global minimum P value of the 1-D data-driven grouping and the respective 10 best local minimum P values. Estimate the same for ⃗⃗ .
3. For the first element of the "filtered" ⃗⃗ as ⃗⃗ , z i = 1 and z j = 1, evaluate the prognostic significance of pair i, j for the cutoffs (⃗⃗ ,⃗⃗ ) by model (1) and by each of the seven designs of Figure 3A.
where k = 1, …, K are the number of patients and x ij is a dichotomous variable (specified in iii) Repeat steps 2-3 and 4i, ii for all tan(α) and cut-off combinations.
6. Provided that the assumptions of model (1) are satisfied, the best cutoff pairs, rotation angle and grouping scheme are selected as follows: i) At significance level α (typically α = 5%), select the design(s) with number of significant 2-D cut-offs and rotation angles higher than 15%, that is approximately (100 × (# of unique designs))/100.
ii) Among all combinations of (i), select the rotation angle(s) with number of significant 2D cut-offs higher than (100 × (# of unique angles))/100.  Table S8). Therefore, Affymetrix probesets which were highly survival significant in one of the two cohorts but not significant in another cohort, have been excluded.

Weighted Voting Grouping (WVG)
The multivariate voting algorithm derives a small pair gene signature that is able to separate the patients into two or more significantly different risk groups. It takes into account the grouping information across several independent pairs. Each 2-D RDDg survival significant pair provides a patients' grouping scheme that may or may not be similar to the others. Ideally, we wish to combine the groupings of several significant features into a composite, final grouping and show that it is still able to separate patients into two distinct disease risk groups. To do this we: 1. Select the g significant features of the list, sorted by the 2-D P value in ascending order.
Assign to each pair g the weight = 4. The best signature is the one involving G * pairs that minimize the P value of 1-D DDg (step 3).

Survival analysis in 73-SAGPs
Application of the 1-D DDg procedure [48] to the 146 genes belonging to 73-SAGPs revealed 32 Affymetrix probesets representing 27 individual host genes associated with survival (Wald pvalue <0.05) in the Stockholm and Uppsala cohorts (Table S5) [2]. Current literature analyses showed strong evidence for an association with cancer for 13 genes out of 27 (Table S6)

Spliceosome genes as potential drug targets for BC
We analysed the literature regarding the 26 spliceosomal genes identified by the SAGS that were robustly overexpressed in the HR subgroups (Table S9B and Figure 6) .
SF3B5 and SF3B3 belong to the same SF3b protein complex as an important specific subcomponent of the spliceosome, U2-snRNP. The SF3b complex is particularly interesting because it is a promising anticancer drug target [49]. Spliceostatin A (FR901464) is a potent, natural antitumor product that binds to the SF3b complex and inhibits pre-mRNA splicing in vitro and in vivo. An analogue of FR901464, meayamycin, is a highly effective anti-proliferative agent against human MCF-7 BC cells. Consequently, specific splicing changes induced by SSA can down-regulate genes that are important for cell division, including Cyclin A2 and Aurora A kinase. This mechanism could explain the anti-proliferative effects of SSA [50]. The antispliceosome drug, E7107, entered phase I clinical trials for thyroid cancer, and it resulted in stable disease or delayed disease progression in a subset of patients [51]. Other potential drug targets in BC are the members of the ancient family of Sm and like-Sm (LSm) RNA-binding proteins, LSM3, LSM4 and LSM7, as was shown for LSM1 [52]. The spliceosomal heteroheptameric Sm complex is overexpressed in malignant breast tumours compared to benign lesions [53,54]. Interestingly, the Sm complex defines events that occur during telomerase RNA biogenesis [55]. U2-snRNP-related splicing factors RBM17/SPF45 [56] and SF3B1 [57]) represent potential biomarkers of multidrug resistance. Recently, the whole spliceosome was proposed as a target for anticancer treatment in BC [54]. The authors suggested that targeting the deregulated spliceosome core machinery can block mTOR and induce autophagy in cancer cells [54]. Our data indicate that a specific splicing cycle stage, the precatalytic stage, or complex B, might be affected in the high risk subgroup identified by the SAGS. The naturally occurring bioflavonoid isoginkgetin, which inhibits spliceosome complex B splicing in vitro [58], is another potentially intriguing drug for BC patients.
Another important property of most anti-spliceosome drugs is their highly selective tumor cytotoxicity [59]. A substantial challenge for the application of novel, promising anti-spliceosome drugs is the identification of the specific subsets of tumours that will be susceptible to splicinginhibition therapy [60]. One could suggest that the transient, short term pre-treatment of "spliceosome-enriched" breast tumors with drugs specifically targeting the spliceosome may not lead to substantial drug side effects, though it could potentially lead to a significant increase of tumor sensitivity to the subsequent course of standard chemo/hormonal therapy. In this context, anti-spliceosome drugs could be a promising alternative to inhibit cell cycle progression and tumor growth in breast tumors enriched with deregulated (overexpressed) spliceosome genes.
Specific trial studies in patient subgroups identified by the SAGS could provide the clues to resolve this challenge.

Definitions
Coding-Coding Cis-Sense-Antisense Gene Pair (SAGP) is a complex transcriptional architecture containing a pair of protein-encoding host genes on opposite strands of a chromosome sharing common (overlapping) region. It also includes all other transcripts (with SA overlaps or without) on both DNA strands within the common span of these two genes.  Figures S1 and S4). Used as a negative control in proximal promoter analysis.
Sense-Antisense Gene Signature (SAGS) is comprised of twelve synergistic survival significant SAGPs.