Cell-surface marker discovery for lung cancer

Lung cancer is the leading cause of cancer deaths in the United States. Novel lung cancer targeted therapeutic and molecular imaging agents are needed to improve outcomes and enable personalized care. Since these agents typically cannot cross the plasma membrane while carrying cytotoxic payload or imaging contrast, discovery of cell-surface targets is a necessary initial step. Herein, we report the discovery and characterization of lung cancer cell-surface markers for use in development of targeted agents. To identify putative cell-surface markers, existing microarray gene expression data from patient specimens were analyzed to select markers with differential expression in lung cancer compared to normal lung. Greater than 200 putative cell-surface markers were identified as being overexpressed in lung cancers. Ten cell-surface markers (CA9, CA12, CXorf61, DSG3, FAT2, GPR87, KISS1R, LYPD3, SLC7A11 and TMPRSS4) were selected based on differential mRNA expression in lung tumors vs. non-neoplastic lung samples and other normal tissues, and other considerations involving known biology and targeting moieties. Protein expression was confirmed by immunohistochemistry (IHC) staining and scoring of patient tumor and normal tissue samples. As further validation, marker expression was determined in lung cancer cell lines using microarray data and Kaplan–Meier survival analyses were performed for each of the markers using patient clinical data. High expression for six of the markers (CA9, CA12, CXorf61, GPR87, LYPD3, and SLC7A11) was significantly associated with worse survival. These markers should be useful for the development of novel targeted imaging probes or therapeutics for use in personalized care of lung cancer patients.


INTRODUCTION
Lung cancer is the second leading cause of cancer and the leading cause of cancer deaths in both men and women in the United States [1,2]. Although the mortality rate for lung cancer has declined over the last several decades, the overall 5-year survival rate has not substantially improved over the last 30 years [1,2]. The majority of lung cancers are diagnosed at a distant stage (57%) [1]. Only 16% of lung cancers are diagnosed at a localized stage, for which the 5-year survival rate is 55% [1,2]. The five year survival rate decreases for regional and distant cancers (28% and 4%, respectively) [1,2]. For all stages combined, the five year survival rate is only 18% [1,2]. Thus, there is a need for new ways to diagnose and treat this disease to improve clinical outcomes.
Early detection of lung cancer improves the patient's chance of survival. Computed tomography (CT) is the most commonly used modality for lung cancer early detection, staging, treatment evaluation and follow-up [3][4][5]. Based on the results of the National Lung Screening Trial (NLST), screening by low-dose helical CT has been recommended for the early detection of lung cancer; however this only applies to high risk current and former smokers [2,3,6]. Currently, low-dose CT (LDCT) is the only approved method for lung cancer screening [7]. LDCT is useful for detecting small peripheral masses but other techniques are needed for tumors that arise in the central airways [8]. There is also a need for improved methods to discriminate malignant from benign lesions [3,5]. Positron emission tomography (PET) with 18 F-fluorodeoxyglucose ( 18 F-FDG) can be used for metabolic imaging of lung cancer [3,5,9,10]. It is useful for the detection of metastases and discrimination of malignant from benign lesions [5,9,10]. However, other abnormalities including inflammation and infection, can also be observed using 18 F-FDG PET resulting in false positives [3-5, 9, 10]. Other PET tracers based on alternate pathways, such as proliferation and amino acid uptake, are currently being studied for use in lung cancer [3]. Magnetic resonance imaging (MRI) is used only for limited applications but investigations are being conducted to potentially expand the utility of MRI in the management of lung cancer [3][4][5][6]. Autofluorescence is used during bronchoscopy to identify precancerous and cancerous lesions and post-operatively to detect recurrence [3,11,12]. However, the current approaches lack specificity due to false positives resulting from other abnormalities such as inflammation [3,11,12].
While it is unlikely that molecular imaging agents are practical for use in lung cancer screening, development of novel lung cancer targeted molecular imaging agents has potential to address a number of clinical needs in the diagnosis and management of lung cancer and to augment the personalized care of patients. Since 18 F-FDG PET imaging is not reliable in the context of inflammation, a lung-cancer specific PET imaging tracer is needed for use in this context, e.g., following surgery or radiation therapy [13]. A lung cancer specific PET tracer could also potentially be used to better distinguish malignant from benign nodules of the lung, which is an unmet clinical need that could improve early detection of lung cancer [14]. Additionally, imaging biomarkers that can noninvasively provide predictive or prognostic information are needed to improve the clinical management of lung cancer [15]. Development of fluorescently labeled lung cancer specific agents could improve early detection via fluorescence bronchoscopy. Such lung cancer targeted fluorescent agents could also be used intraoperatively for margin detection and identification of mediastinal lymph nodes that contain metastases [16].
In recent years, kinase targeted therapies have been developed that have shown improved efficacy in treatment of lung cancer compared to standard chemotherapy, e.g., epidermal growth factor receptor (EGFR) tyrosine kinase inhibitors [17] and anaplastic lymphoma kinase (ALK) inhibitors [18]. Immune checkpoint inhibitors, e.g. anti-PD1 and anti-CTLA-4, are another class of targeted therapies that have shown efficacy in treatment of lung cancer [19]. However, these new targeted treatments are only applicable to a fraction of patients, and development of resistance and recurrence has been a considerable problem in patients that do respond [20,21]. Studies involving combination therapies have demonstrated increased efficacy and it has been proposed that combinations of therapies that target distinct pathways or mechanisms could increase the period of disease-free survival, or even be curative [22,23]. However, current targeted therapies are associated with systemic toxicities, lowering the potential for effective combinations. Hence, novel targeted therapies that have low systemic toxicity are needed for use in combination with the existing toolbox of therapies. In addition, companion imaging agents are needed to identify patients that are likely to respond to the corresponding targeted therapy and to non-invasively follow treatment response.
To successfully implement the personalized treatment of lung cancer, molecular imaging agents and targeted therapeutics are needed that can detect the tumor with high specificity and selectivity. Since targeting moieties conjugated to imaging contrast or therapeutic agents cannot cross the plasma membrane, development of agents that target cell-surface markers that are differentially expressed on lung tumors relative to normal tissues or benign lesions is a rational approach toward achieving this objective. Thus, the identification and comparison of cell-surface markers is a crucial first step in the development of novel cancer-specific molecular imaging agents and targeted therapeutics. We have previously identified and validated novel bona fide cancer cell-surface markers by mRNA expression profiling and immunohistochemistry (IHC) of colon, melanoma, pancreatic and breast cancer patient tissue samples [24][25][26][27][28][29][30]. www.impactjournals.com/oncotarget We have also developed imaging agents that target these identified tumor cell-surface markers [24,25,29,[31][32][33][34].
The goal of the current work was to identify a set of cell-surface markers that cover a broad range of lung cancers and analyze the expression of these markers in relation to survival of lung cancer patients. Once determined, such markers may be useful targets for the development of lung cancer targeted imaging and therapeutic agents.

Cell-surface marker identification
Our goal was to identify cell-surface markers that can be used for targeted agent development. However, different classes of targeted agent require different metrics for selection. For example, molecular imaging agents typically deliver tracer levels of radioactivity or nontoxic payloads for image contrast. In this case, the most important metric is target expression in tumor relative to surrounding normal lung tissue. Alternately, targeted therapeutic agents can deliver cytotoxic payloads or inhibit pathways that are important for normal cellular functions. Hence, marker discovery for targeted therapy requires evaluation of expression in tumor versus expression in a range of tissues that are of concern for systemic toxicity.
Gene expression profiling was performed using mRNA expression microarray data from patient samples of lung cancer and normal tissues. Available data sets were evaluated for quality, compiled, normalized and a MarkerScore determined for ranking differential expression in tumor relative to normal (see Methods). The probesets were intersected with a list of potential surface accessible gene products to annotate the target location and filter the data set, yielding a set of 11,838 potential surface accessible probesets for further analysis. Gene expression data for these probesets were sorted by MarkerScore using Excel 2010, and the list was analyzed for probesets exhibiting differentially high expression in lung tumors relative to normal lung tissue samples as determined using a combination of statistical tests described in the Methods. This resulted in a list of 360 probesets (282 genes) (Supplementary Table 1). Note that the number of probesets does not correspond to the number of genes, since several genes are detected by multiple probesets in the arrays. Our cell-surface list includes some genes that are membrane associated but do not have cell-surface domains, e.g. code for proteins that are secreted, are associated with the cytoplasmic side of the plasma membrane or with internal membranes only. We reviewed the literature for the list of 282 genes and 268 probesets (208 genes) were identified that likely have cell-surface domains (annotated as 1 in the Cell Membrane column in Supplementary Table 1). These 208 genes were evaluated for potential use as lung cancer specific cell-surface markers based on intensity and breadth of expression among the lung cancer samples relative to their differentially low expression in non-neoplastic lung tissue samples.
In addition to higher expression in tumor samples relative to normal lung samples, expression in other tissues associated with toxicity and clearance, e.g. liver, kidney, heart, etc., was also considered and markers that were expressed in these tissues were de-emphasized. From the ranked list, 10 markers were selected for further evaluation: CA9, CA12, CXorf61, DSG3, FAT2, GPR87, KISS1R, LYPD3, SLC7A11 and TMPRSS4. Five of these markers, CXorf61, DSG3, FAT2, GPR87, and LYPD3, were selected based primarily on their high and broad expression among the lung cancer samples relative to normal lung. Additional markers were selected based on their profile and that there are currently available molecular imaging probes targeting these markers (CA9, CA12, KISS1R and SLC7A11) [25,. KISS1R and TMPRSS4 have known high affinity ligands and inhibitors, respectively, for potential use in targeting [71][72][73][74][75][76][77][78][79][80]. Despite its relatively low ranking based on marker score, CA9 was selected for further investigation due to an availability of high affinity inhibitors for imaging [49][50][51] and its general applicability among several cancer types, in addition to lung cancer, including cancers of the brain, breast, cervix, colon, head and neck, kidney, ovaries, and pancreas [25,35,[81][82][83]. Figure 1 shows representative mRNA expression profiles of four of the selected markers in patient samples; the mRNA expression profiles for the remaining six selected markers are shown in Supplementary Figure 1. For each of these markers, the mRNA expression is significantly higher in the lung tumor samples in comparison to the normal lung samples (p < 0.0001) ( Table 1 and Supplementary Tables  2-11). However, none of these markers are expressed at a high level in 100% of the lung tumor samples. Nevertheless, for each marker, there are a percentage of cancer cases with very high expression relative to normal lung. Therefore, a combination of markers may be required to cover all types of lung cancer.
The mRNA expression of these markers in organs involved in toxicity and clearance was also evaluated. The expression of the markers GPR87, KISS1R and SLC7A11 are significantly higher in the lung tumors than all of the other normal organs examined (Table 1,  Supplementary Tables 7, 8 Table 1 and Supplementary Tables 2-11). In the case of CA12, the expression is significantly higher in the kidney than in the lung tumors (p < 0.0001) ( Figure 1A and Supplementary Table 3). For CA9, the expression is significantly higher in the small intestines than in the lung tumors (p = 0.0039) (Supplementary Figure 1A and Supplementary   Table 2 and Supplementary Tables 12-21. Statistical differences among the different histological classes are reported   Supplementary Tables 22-31. Interestingly, some markers are significantly higher in all NSCLC subclasses relative to normal lung, while other markers have differential expression among the sub-classes.

Confirmation of marker protein expression
Since mRNA expression does not always translate to protein, we needed to confirm protein expression of the selected markers. To do this, we performed immunohistochemistry (IHC) of a tissue microarray (TMA) consisting of lung tumor samples and adjacent normal lung samples, as well as several other control tissues (liver, spleen, and lymph node). Figures 3 and 4 show representative images of IHC-stained sections from lung tumors and nonneoplastic "normal" lung specimens from the TMA. As can be seen from the images, the lung tumor specimens have greater cell density compared to the normal lung specimens. Supplementary Figures 3 and 4 show higher magnification images of the lung tumor samples from the TMA.
The IHC staining was scored by a pathologist who specializes in thoracic oncology (F.K.K.) on a scale from 0 to 3+, with 3+ representing the strongest intensity. A summary of the scoring data for each marker in normal lung and lung tumor tissue is given in Tables 3 and 4. The IHC analysis of the other control tissues for each marker is given in Supplementary Table 32. Samples showing any percentage of cell staining were included in this analysis. For each of the markers, there were a percentage of tumor samples that had higher expression than the normal lung samples. The ten markers were divided into two groups based on expression in normal lung tissue (Tables 3 and 4). The first group consisted of six markers with limited (TMPRSS4) or no expression (CA12, FAT2, GPR87,  (Table 3). For TMPRSS4, staining in normal lung was observed in only one of the eight specimens, and only 5% of the cells in that specimen had staining ( Table 3). The remaining four markers (CA9, CXorf61, DSG3, and KISS1R) showed some expression in normal lung ( Table 4). The only marker with high staining intensity (3+) in some (25%) of the normal lung specimens was CA9. For CXorf61, DSG3 and KISS1R, the expression was of low staining intensity (1+) for ≥50% of the normal lung specimens. The average percentage of cell staining is reported as a heterogeneity score (Tables 3 and 4 and Supplementary Table 32). When samples received a pathology score of 0, they also received a 100% heterogeneity score indicating that they were uniformly unstained. For samples that stained (pathology score of 1 or greater), the heterogeneity score indicates the percentage of cell staining regardless of intensity. Values close to 100% indicate more homogeneous staining. Some markers showed homogeneous staining, i.e. DSG3, KISS1R, and SLC7A11, whereas other markers were very heterogeneous, i.e. CA9 and LYPD3. For five of the markers, staining was observed in the lymphocytes in all  tissues. The staining intensities of the lymphocytes were 3+ for GPR87, 2+ to 3+ for CXorf61, DSG3 and KISS1R, and 1+ to 2+ for FAT2. The TMA used in this study consists of a proportional representation of the histological subtypes of NSCLC as seen in the clinic. The markers were also analyzed for expression in the two predominant histological subtypes of NSCLC, adenocarcinoma and SCC (Tables 5  and 6). The remaining samples were of other histological classes (acinar cell carcinoma, adenosquamous carcinoma, large cell carcinoma, large cell neuroendocrine carcinoma, neuroendocrine carcinoma, mesothelioma or pleomorphic carcinoma), with a sample number ≤ 5 or were not otherwise specified, and these data were combined into a category termed as "other" due to the lower sample numbers.

Marker expression in cell lines
As further confirmation, marker expression was evaluated in established human lung cancer cell lines. We have previously demonstrated that mRNA levels obtained from Affymetrix microarray data derived from cell lines are representative of levels obtained by quantitative realtime reverse-transcriptase polymerase chain reaction (qRT-PCR) of the same cell lines [30]. Hence, for each of the markers, we analyzed mRNA expression microarray data for cell lines (Supplementary Figures 5-14). For each marker, NSCLC cell lines with high and low/no mRNA expression were selected ( Figure 5). These cell lines could be useful when developing models to test imaging or therapeutic agents targeting the markers.

Survival analyses
As a further validation of each marker related to tumor biology and patient prognosis, both mRNA and protein expression data for the selected markers were evaluated in terms of patient survival. The mRNA expression was dichotomized at the median cut-point and the five-year survival was compared for the groups with high and low expression of the marker, and analyses were also conducted by tertile cutpoints (Table 7). High expression for five of the markers (CA9, CA12, CXorf61, LYPD3, and SLC7A11) significantly associated with worse survival (p < 0.05) when the data was dichotomized ( Figure  6). For genes with multiple probesets (CA9 and CA12), the association was significant for all of the probesets ( Table   Figure 4: Representative images of IHC stained patient lung tumor and normal lung tissue specimens from the tissue microarray (TMA) for the remaining selected markers. A representative normal lung sample and representative lung tumor samples with scores of 0, 1+, 2+, and 3+ are shown for each marker. The images are taken at 10x magnification. * Protein expression is stained but gene names are used to conserve space. www.impactjournals.com/oncotarget 7). When we analyzed the data using tertiles of expression, all of these markers were significantly associated with survival, except for CXorf61 (Table 7 and Figure 7). In addition, when analyzed by tertiles, GPR87 expression was significantly associated with survival (Table 7 and Figure  7C). The tertile analysis revealed that the third of specimens with highest LYPD3 expression was associated with worse survival relative to the two thirds of specimens with low expression values ( Figure 7D), and for SLC7A11, two thirds of specimens with higher expression were associated with worse survival relative to the third with the lowest expression levels ( Figure 7E). 12.5% 88% 5% 83% ± 29% * Protein expression is scored but gene names are used to conserve space. # Heterogeneity score indicates the average percentage of cell staining in samples that stained regardless of pathology score. For samples with pathology scores of 0 only, 100% heterogeneity score indicates uniformly unstained. 100% ± 0% 100% ± 0% * Protein expression is scored but gene names are used to conserve space. # Heterogeneity score indicates the average percentage of cell staining in samples that stained regardless of pathology score. For samples with pathology scores of 0 only, 100% heterogeneity score indicates uniformly unstained. www.impactjournals.com/oncotarget A metagene signature was generated using the first principal component analysis (PCA) of the 8 probes that were significantly (p < 0.05) associated with survival based on the median split (three probes in CA12, two in CA9, CXorf61, LYPD3, and SLC7A11). The first principal component was dichotomized by the median into low and high expression. High expression of the metagene was significantly associated with worse survival (P < 0.01) ( Figure 8A). Using a hierarchical analytical classification and regression tree (CART) approach on the same variables we had used for the PCA analysis, we determined LYPD3 and CA12 to be the two most predictive markers and determined their respective cut points. Four subgroups were identified (low LYPD3/low CA12, low LYPD3/high CA12, high LYPD3/low CA12 and high LYPD3/high CA12) and high expression of both markers was correlated with decreased survival, whereas low expression of both markers correlated with increased survival (P < 0.0001) ( Figure 8B).
To analyze protein expression, we used both normalized and non-normalized data (described in the Methods section). We dichotomized the expression of the markers into a group with staining intensity ≥2+ and <2+ and assessed the survival for groups with high vs. low expression of each of the markers. The only marker for which high expression was significantly correlated with poor survival by this analysis was LYPD3 ( Figure 9A). In a second analysis, four groupings were used (<1+, ≥1+ and <2+, ≥2+ and <3+, and ≥3+) and in this analysis CA-IX expression ≥3+ had significantly increased survival compared to CA-IX expression <3+ ( Figure 9B).

DISCUSSION
A major bottleneck in the development of targeted imaging and therapeutic agents for use in personalized medicine has been the availability of adequately vetted molecular targets. Individual targets are often reported in the literature for a given cancer type or clinical application, but it is rare that these target markers are compared with other potential targets simultaneously, using the same tissue specimens and analyses in order to estimate the potential utility of one marker relative to others. Potential targets are often reported based on mRNA expression alone, without confirmation of protein expression, which typically is the intended target. Additionally, elevated mRNA expression in cancer does not necessarily correspond to equivalent protein expression or subcellular localization. Tumor marker expression is often reported for only a small set of patient samples, only in tumor cell lines, or only reported for tumors without consideration of expression in surrounding normal tissues or normal tissues of concern for agent clearance or toxicity. Each of these concerns can lead to inadequately informed decisions about targets to pursue for development of targeted agents for a given application. To identify suitable targets for 92% 91% 79% * Protein expression is scored but gene names are used to conserve space. a given clinical application, studies are needed that can identify and compare marker expression among patient tumor sample sets that are representative of the intended target population and that include secondary levels of confirmation. Without adequate target discovery, costly decisions to undertake agent development may be made that are destined to fail.
Herein we report a systematic lung cancer cellsurface marker discovery effort. Our approach made use of the large amount of microarray data available for many clinical types of cancer. We specifically screened lung cancer array data for the high expression of genes in cancer samples that were poorly expressed in normal lung and several other key tissues. We then further narrowed the list to those genes we expected to be expressed at the cell surface. The goal of this work was to simultaneously identify and validate promising markers in lung cancer that can be used as targets for development of novel agents for use in personalized medicine. We focused on the identification of cell-surface markers because targeted agents that are designed for delivery of imaging contrast or cytotoxic payloads are typically conjugates with greater mass than small molecule drugs that can pass through the cell membrane via common transport mechanisms. By gene expression profiling of patient microarray data, we have identified greater than 200 putative cell-surface    Table 1). From this list, we selected 10 promising markers (CA9, CA12, CXorf61, DSG3, FAT2, GPR87, KISS1R, LYPD3, SLC7A11, and TMPRSS4) for confirmation of protein expression in patient samples. By IHC, we determined differential protein expression of these markers in lung tumor specimens relative to normal lung and other normal tissues of concern for toxicity ( Tables 3 and 4 and  Supplementary Table 32). As secondary confirmations, we also demonstrated that lung cancer cell lines endogenously express these markers ( Figure 5 and Supplementary  Figures 5-14). These lung tumor cell lines can be useful for the development of agents targeted to these markers.
We have also shown that survival correlates with expression for several of the described markers (Table 7 and Figures 6 and 7).
Many of the markers that were identified by this method had been previously reported for lung cancer or other cancer types (see below). This serves to validate our approach to discovery, but also highlights a key feature of our method; to identify and directly compare the relative utility of multiple markers simultaneously. Since the patient specimens used for identification and validation also have corresponding clinical data available, we were able to provide further evidence of the potential clinical relevance of each given marker, i.e. survival prognosis. Hence, we report a practical and systematic process that can be used to discover cell-surface markers that can be used for making decisions about targeted agent development for use in personalized medicine. This approach can also be applied to any class of cancer including rare cancer types that have not had the scrutiny of NSCLC.
Array data may miss many good targets as it is possible to have low mRNA expression with high corresponding protein levels. Other approaches such as proteomics and transcriptional sequencing may find other potential cancer markers. For example, proteomics approaches have been successfully applied toward the identification of membrane-associated proteins in lung tumor tissue relative to normal lung tissue [84,85]. Nonetheless we have identified a suite of 10 potential cellsurface markers that identify the majority of lung tumor samples we analyzed.
Similar approaches have been used by other groups for lung cancer marker discovery. Nakamura et al. have several reports where a similar approach was used to identify lung cancer markers [86][87][88][89][90]. However, only one of these studies focused on cell-surface [86] and there are important differences, e.g. mRNA expression was determined by laser-capture microdissection of tumor cells, which decreases contamination from tumor infiltrating cells but also decreases the sample number that can be practically examined. RNA profiles are typically distorted by the processes required for laser capture microdissection limiting its usefulness in quantitative analyses. Nonetheless, useful markers can be identified in this way. Their cell-surface study identified SEZ6L2 as a cell-surface marker for lung cancer, which was also identified by our approach as having higher mRNA expression relative to normal tissues. However, it was not one of the candidates selected for further validation in our study due to its lower ranking based on marker score. Our initial screen identified 272 additional candidates that were not investigated further for this work, but still might prove useful. For example, SEZ6L2 was higher ranked than CA9 in our analysis but CA9 was selected due to the availability of molecular imaging probes targeting this marker and based on its potential applicability among several cancer types. Another similar study by Gugger et al. was limited to cell-surface G-protein-coupled receptor (GPCR) discovery and identified 5 GPCRs as being overexpressed [91]. Differential mRNA expression identified GPR87 as a lung SCC target, but protein expression was not confirmed [91]. Our current study effectively confirms the differential mRNA expression of GPR87 and goes on to demonstrate that protein expression was also higher in a large set of lung cancer samples.
Recently, Botling et al., used prognostic impact to select NSCLC biomarkers for IHC confirmation [92]. The study was not limited to cell-surface markers and the majority of markers discovered were intracellular, but the CADM1 cell-surface gene was identified and protein expression confirmed. Unfortunately, this (CADM1) protein has lower expression in tumor samples compared to normal lung samples [92]. We made this same observation in our mRNA expression microarray data set and, consequently, this protein was not selected for further study. In contrast to these studies, all of the markers selected for validation in our study had high protein expression in tumor cells compared to normal cells in a large fraction of the samples (Tables 3 and 4). These results suggest that using large sample numbers of whole tumor tissue digests is sufficient for the initial identification of markers that have high and broad expression among tumor cells and that laser-capture microdissection may be unnecessary for the detection of promising targets. As stated above, a number of the markers in this study had previously been reported as expressed as mRNA or protein in lung cancer: CAIX [83,[93][94][95][96][97][98][99][100][101][102][103][104][105][106][107], CAXII [83,108], KK-LC-1 [109][110][111][112][113], desmoglein 3 [114][115][116], GPR87 [91,[117][118][119], Ly6/PLAUR domain-containing protein 3 [118,[120][121][122][123][124][125] and solute carrier family 7 . Each of these markers shows a statistically significant difference in survival for patients with high vs. low mRNA expression. www.impactjournals.com/oncotarget member 11 protein [63,118]. However, we also confirmed protein expression in patient specimens for two novel lung cancer targets, i.e. FAT2 and KiSS-1R. Our results for KiSS-1R conflicted with previous reports that mRNA and protein levels of KISS1R were lower for NSCLC tissue relative to normal lung tissue [126,127]. KISS1R levels were assessed by reverse-transcriptase polymerase chain reaction (RT-PCR) and Western Blot (WB). In addition, KISS1R expression was reported to be associated with better survival in patients with NSCLC [126]. In our study, we found higher mRNA and protein expression of KISS1R in lung tumors relative to normal lung tissue. Also, we did not observe an association between KISS1R mRNA expression in adenocarcinoma and improved survival. There are at least two mRNA splice variants resulting in different protein products. The different methodologies may be detecting different forms of the KISS1R product leading to these conflicting results. See Table 8 for a review of the literature regarding expression of all of the selected markers and comparison of results presented herein. Differences observed herein relative to published results are likely due to differences in the study populations and the way expression was evaluated.
Six of the markers studied are potential targets for development of smart-bomb or Trojan horse therapies that deliver cytotoxic agents or therapeutic radionuclides specifically to cancer cells. CXorf61, DSG3, GPR87, KISS1R, LYPD3, and SLC7A11 genes had high mRNA expression in patient lung tumor specimens and low mRNA expression in tissues of concern for toxicity ( Table 1, Supplementary Tables 4, 5 and 7-10, and Figure 1B-1D, and Supplementary Figure 1B, 1C and 1E). Although DSG3 and LYPD3 are expressed in epithelial layers, and KISS1R and SLC7A11 are expressed in normal brain, the basement membrane and blood-brain barrier would likely inhibit uptake in those normal tissues. A recent report describes an antibody-drug conjugate targeting LYPD3 which showed efficacy in preclinical mouse models of lung cancer and is currently being tested in clinical trials [125]. IHC staining revealed that GPR87, LYPD3, and SLC7A11 had high positivity in lung tumor tissues but did not stain normal lung tissue (Table 3). High mRNA expression of four of these markers, CXorf61, GPR87, LYPD3 and SLC7A11, significantly correlated with decreased survival (Table 7, Figures 6C-6E and 7C -7E) and high IHC staining also correlated with decreased survival for LYPD3 ( Figure 9A), indicating the potential need for improved therapies for these patients.
When combining mRNA expression data for LYPD3 and CA12 we can identify four subgroups (low LYPD3/low CA12, low LYPD3/high CA12, high LYPD3/low CA12 and high LYPD3/high CA12). High expression of both markers was correlated with decreased survival, whereas low expression of both markers correlated with increased survival ( Figure 8B). These results could help guide treatment plans for patients with expression of these markers. In addition, a bivalent targeting ligand with low affinity for each individual target, but high affinity for tumor cells expressing both markers, would focus treatment on tumors with the worst prognosis, but spare normal tissues that express only one of the targets and decrease unwanted systemic toxicities [128]. This bivalent targeting ligand could be used as both a therapeutic agent and companion diagnostic.
We report a systematic approach for the identification of novel lung cancer markers with cell-surface expression that may have potential utility in personalized medicine applications. Our approach supported existing literature describing a number of known markers overexpressed in lung cancers and further showed whether or not expression of these markers significantly correlates with prognosis. The large numbers of patient lung tumor and normal tissue specimens included in the analysis enabled the discrimination of tumor expression from normal lung tissue as well as the analysis of recognized sub-types of NSCLC. Evaluation of both mRNA and protein expression results allows for comparison of the two major molecular manifestations of gene expression and the confirmation of cell-surface markers targetable for molecular imaging and delivery of cytotoxic agents or radionuclides. Inclusion of clinical data with corresponding mRNA and protein expression allowed for the correlation of marker expression with prognosis. Evaluation of a set of promising markers using the same tissue and data sets allows for the simultaneous evaluation of the relative utility of each marker as a target for specific clinical applications. Determination of marker expression in established lung cancer cell lines provides laboratory tools for the development of novel agents that target these specific markers. We identified 208 potential cell-surface markers specifically overexpressed in some lung tumors, but not in normal lung. We further demonstrated that 10 of these targets were detectable by immunohistochemistry and therefore good candidates for the development of novel targeted therapeutics. Some of our candidates are already being targeted in this way. For example, the xCT transporter (SLC7A11) was confirmed to be a potentially robust lung tumor imaging marker and PET imaging agents are already being developed for this transporter but have yet to be applied toward use in lung tumor imaging, except in a small pilot clinical trial [62][63][64][65][66][67][68][69][70]. Additionally, the Ly6/PLAUR domaincontaining protein 3 (LYPD3) emerged as a novel target for development of a lung cancer targeted therapy, which could be co-developed with a companion imaging agent for the personalized treatment of lung cancer.

Tissue data and analyses
Compilation and quality control assessments of public mRNA expression microarray data sets were www.impactjournals.com/oncotarget  Tables 33 and 34). The datasets were combined and normalized together. IRON [130] was used to normalize all samples against the median sample (GSM475685). Affymetrix probesets that do not detect cataloged human genes were removed prior to further analysis. The list of genes evaluated were further filtered using a curated list of probesets (Supplementary Table 35) that correspond to only secreted or outer membrane proteins as derived from manual assessment and Gene Ontology terms [131]. For the remaining probesets, averages (avg) and standard deviations (sd) of log 2 intensities were calculated within lung tumors and normals, separately. A cutoff of avg normal +3 sd normal was used for determining elevated expression in lung tumor samples for each probeset. Percentages of samples with elevated expression were calculated within lung tumors (% elevated tumor ) and normals (% elevated normal ), separately. Log 2 ratios (avg tumor_ elevated -avg normal ) of average elevated tumor (samples above the +3 sd normal cutoff) vs. average normal, two-sided T-tests and Mann-Whitney U-tests, and Hellinger distances were calculated between the lung tumor and normal groups. Probesets were identified as elevated in lung tumors using the following criteria: avg tumor_elevated > 5, % elevated tumor > 25%, log 2 ratio elevated ≥ 2 (4-fold), both lung all-tumor vs. all-normal T-test and U-tests < 4.2237e-6 (Bonferroni correction for P/N = 0.05/11,838), and Hellinger distance > 1/3 rd . Elevated genes were then ranked in decreasing order by a MarkerScore, calculated as (% elevated tumor -% elevated normal ) * (log 2 ratio elevated). These genes were then manually assessed for cell-surface/membrane location using UniProt and the Human Protein Atlas and the gene was kept in the analysis if either source listed the protein as cell-surface/membrane. The Marker Score was used to rank genes in priority for additional manual inspection (including identifying probesets with high intensity and broad expression in lung tumors relative to normal lung and minimal expression outside of lung) and experimental validation as described below.

Cell line data and analyses
Additional verification of lung tumor expression was assessed using non-small cell lung cancer (NSCLC) cell line gene expression data from the Cancer Cell Line Encyclopedia (CCLE) [132]. All 991 CEL files were normalized using IRON [130] against the median sample. Principal component analysis (PCA) was performed, and samples identified that did not cluster with other samples of the same conformed site of origin (SOO). For 51 of these samples, literature and other notations in the metadata were used to support reclassification of the originally reported SOO to a new conformed SOO that agreed with the gene expression metadata. Twenty outlier samples, for which no justification could be found for altering their reported SOO, were discarded due to large disagreement between gene expression and reported SOO. These remaining 971 samples were then de-batched using COMBAT [133], using the batch reported in the metadata, and conformed SOO as covariate. From this batch-corrected data set, 114 cell lines were identified as NSCLC.  Table  37). Slides were stained using a Ventana Discovery XT automated system (Ventana Medical Systems, Tucson) as per the manufacturer's protocol using proprietary reagents. Slides were deparaffinized on the automated system with EZ Prep solution (Ventana). Heat-induced antigen retrieval methods were used in either RiboCC or Cell Conditioning 1 (Ventana) as listed in Supplementary Table 37. Primary antibodies were diluted using Dako diluent (Carpenteria, CA, USA) at the optimal ratios and incubation times listed in Supplementary Table 37. The appropriate anti-mouse or anti-rabbit secondary antibody (Ventana Omnimap or Ultramap) was used for 12 to 20 min incubation. The Ventana ChromoMap kit detection system was used first and then slides were counterstained with hematoxylin. Following staining, slides were dehydrated and coverslipped. Positive controls were used following the antibody manufacturer recommendations. Negative controls were established by omitting the antibodies during the primary antibody incubation step.
Slides were scored by a pulmonary pathologist (F.K.K.) and each sample given a numerical intensity score (0-3) where 0 = negative, 1 = weak, 2 = moderate and 3 = strong staining. The percentage of tumor cell staining was also scored. This percentage is independent of the staining intensity. A heterogeneity score was calculated by determining the average cell staining percent for cells that stained, regardless of pathology score. For samples with pathology scores of 0 only, a 100% heterogeneity score indicates uniformly unstained.

Lung cancer patients and patient data
The protocol for this study was approved by the University of South Florida Institutional Review Board. The study included 442 lung cancer patients that were diagnosed with adenocarcinoma and recruited from Moffitt Cancer Center's Total Cancer Care (TCC ™ ) program [134] between April 2006 and August 2010. Patients for this analysis provided informed consent to the TCC ™ protocol either at Moffitt (No. = 186) or one of eighteen TCC ™ consortium/affiliate institutions (No. = 282). The demographic information of the patient cohort and details of the study design have been published elsewhere [135].

Statistical analyses
GraphPad Prism (Version 5.04, La Jolla, CA, USA) was used to generate the box/whiskers plots. Box plot whiskers represent the minimum to maximum values in the group, the box represents the 50th percentile, and the center line represents the median value. SAS software (Version 9.4, Cary, NC, USA) was used for data analysis. Dunnett's multiple comparison was used for testing lung tumor (control) versus normal tissues and for testing normal lung (control) to different lung cancer histologies. Tukey's all pairwise comparisons were used for testing between different cancer histologies. For all tests, p ≤ 0.05 was considered significant.
Statistical analyses were performed using Stata/ MP 12.1 (StataCorp LP, College Station, TX, USA). Survival analyses were performed using Kaplan-Meier survival curves and the log-rank test. Overall survival was the primary endpoint and was assessed from the date of surgery to the date of last follow-up or death. Among individuals without an event (i.e., death), censoring occurred at either 5 years or date of last follow-up if less than 5 years. Normalized IHC values were calculated by taking the product of the staining intensity and percent tumor cell staining for each marker. Principal component analysis (PCA) was utilized to generate a "metagene" score of mRNA gene probes. We utilized a classification and regression tree (CART) approach to explore potential novel biomarker combinations. CART is a nonparametric data-mining tool that can segment data into meaningful subgroups and has been adapted for failure time data [136] using the Martingale Residuals of a Cox model to approximate chi-square values for any number of biomarker combinations.

ACKNOWLEDGMENTS
The authors wish to acknowledge the Lung Cancer Center of Excellence and Lung SPORE at Moffitt for use of the tissue microarray (TMA). We also acknowledge Noel Clark and the Tissue Core facility, and the Bioinformatics and Biostatistics Core facility at H. Lee Moffitt Cancer Center & Research Institute, for help with the IHC staining of the TMA, profiling of the expression array data and statistical analyses. This study was conducted with approval of the University of South Florida Institutional Review Board (IRB).

CONFLICTS OF INTEREST
AC and DM are listed as inventors on patent number US20160051704 A1 "Molecular Imaging Probes for Lung Cancer Intraoperative Guidance" that covers the described markers. The other authors declare that they have no conflicts of interest.