A smoking-associated 7-gene signature for lung cancer diagnosis and prognosis

Smoking is responsible for 90% of lung cancer cases. There is currently no clinically available gene test for early detection of lung cancer in smokers, or an effective patient selection strategy for adjuvant chemotherapy in lung cancer treatment. In this study, concurrent coexpression with multiple signaling pathways was modeled among a set of genes associated with smoking and lung cancer survival. This approach identified and validated a 7-gene signature for lung cancer diagnosis and prognosis in smokers using patient transcriptional profiles (n=847). The smoking-associated gene coexpression networks in lung adenocarcinoma tumors (n=442) were highly significant in terms of biological relevance (network precision = 0.91, FDR<0.01) when evaluated with numerous databases containing multi-level molecular associations. The gene coexpression network in smoking lung adenocarcinoma patients was confirmed in qRT-PCR assays of the identified biomarkers and involved signaling pathway genes in human lung adenocarcinoma cells (H23) treated with 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanone (NNK). Furthermore, the western blotting results of p53, phospho-p53, Rb and EGFR in NNK-treated H23 and transformed normal human lung epithelial cells (BEAS-2B) support their functional involvement in smoking-induced lung cancer carcinogenesis and progression.


Introduction
Lung cancer remains the leading cause of cancer-related mortality for both men and women, and its incidence is increasing worldwide (1). Smoking is the strongest population-attributable risk factor in lung carcinogenesis and is responsible for approximately 90% of lung cancer incidents (2)(3)(4). Currently, there are no effective diagnostic screening tools for early detection of lung cancer in smokers. CT scans are offered for lung cancer screening in smokers. Nevertheless, neither the American Cancer Society nor the U.S. Preventive Services Task Force recommends CT scans due to concerns about accuracy in the interpretation of results. Furthermore, the mechanistic effect of smoking on lung cancer progression remains unclear. Despite our previous finding that smoking intensity at the time of diagnosis is a significant and independent prognostic factor for lung cancer (5), smoking status in itself is not a prognostic determinant of lung cancer.
Non-small cell lung cancer (NSCLC) accounts for 85-90% of lung cancer cases. NSCLC includes two major subtypes, adenocarcinoma and squamous cell carcinoma. Owing to the limitations of the current screening techniques, most patients with NSCLC are diagnosed at advanced disease stage. A minority (~25-30%) of patients with NSCLC are diagnosed with stage I disease and receive surgical resection as the major treatment option (6). However, 35-50% of stage I NSCLC patients will relapse within five years following surgery (6), indicating that a subgroup of these patients might benefit from adjuvant chemotherapy. Meanwhile, adjuvant chemotherapy of stage II and stage III disease has resulted in only modest survival benefits (7). While tumor recurrence remains the major treatment failure for lung cancer, it is not currently possible to identify specific high-risk patients for adjuvant chemotherapy. As a consequence, current multi-modality therapy is of limited efficacy, with an overall 5-year survival rate of about 15% (8).
In this study, we sought to identify a gene signature for lung cancer diagnosis and prognosis in smokers. Genes implicated in cancer initiation and progression show dysregulated interactions with their molecular partners (9), and these cancer genes are more likely to actively interact with signaling proteins (10). Because tumors utilize different signaling pathways, we modeled crosstalk with a diverse set of signaling pathways to identify gene signatures that perform more uniformly across heterogeneous tumor sets. Specifically, implication networks (11,12) were used to model concurrent coexpression with multiple signaling pathways among a set of genes associated A smoking-associated 7-gene signature for lung cancer diagnosis and prognosis with smoking and lung cancer survival. This approach identified and validated a smoking-associated 7-gene signature using patient microarray profiles of (n=847). Furthermore, BEAS-2B cell line transformed from normal human lung epithelial cells and human lung adenocarcinoma cells (H23) were treated with 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanone (NNK), a major tobacco-specific carcinogen (13,14), for qRT-PCR and western blots validation of the identified biomarkers and involved signaling pathways.

Materials and methods
Microarray profiles and patient samples. Four patient cohorts with published microarray transcriptional profiles were used in this study.  (16). The third cohort contains 111 NSCLC samples from Bild et al (17). The fourth cohort contains 164 airway epithelial cell lung tissue samples from current and former smokers published by Spira et al (2). This cohort has 60 lung cancer samples (48 NSCLC, 11 small cell lung cancer and 1 unknown histology) and 69 normal lung tissue samples. Patient gene expression profiles from Shedden et al (15), Raponi et al (16) and Spira et al (2) were quantified with Affymetrix HG-U133A. The dataset from Bild et al (17) was quantified with Affymetrix HG-U133 Plus 2. The raw microarray data were quantile-normalized and log2 transformed with dChip (18) for further analysis. Implication networks. The implication induction algorithm (11) based on prediction logic (19) was used to derive coexpression between each pair of genes using software Genet (11,12,20). In the biological context, the six foremost implication rules relating Evaluation of gene coexpression networks. The following pathway databases were used to evaluate the biological relevance of the derived coexpression networks, including NCBI Entrez Gene (21), Kyoto Encyclopedia of Genes and Genomes (KEGG) (22), NCI-Nature Pathway Interaction Database (http://pid.nci.nih.gov/), protein-protein interaction database STRING 8 (23), and Pathway Studio 7.0 (Ariadne Genomics, Rockville, MD, USA). In addition, five gene set collections [positional, curated, motif, computational and Gene Oncology (GO)] and canonical pathway databases from the MSigDB (24) were used in the network precision and FDR evaluation. Using these resources, a coexpression relation is considered a true positive (TP) if the pair of genes satisfy any of the following: i) on the same chromosome or cytogenetic band; ii) in the same curated or canonical pathway; iii) sharing a cis-regulator motif, binding motif, or transcription factor binding site; iv) annotated by the same GO term; v) having protein-protein interaction; or vi) within the same computational gene sets mined from cancer-oriented microarray data. The coexpression relation is considered a false positive (FP) if the gene pair do not satisfy all five conditions listed above (25). If at least one gene in the pair is not annotated, a coexpression relation is labeled as nondiscriminatory (ND). Coexpression relations labeled as ND were excluded in this evaluation as they were not confirmed. Network precision is defined as: The portion of FP over all positive cases is defined as q-value: The FDR of the smoking-mediated coexpression networks was calculated by averaging the q-values obtained from the null distribution generated in 1,000 random permutations of the class labels in the test cohort.
The stability of the computationally derived smokingmediated coexpression networks was evaluated using different subsets of patient samples from the training set in 100 iterations. The stability is defined as the portion of the smoking-mediated coexpression relations obtained from the original data that were retrieved by using only a random subset of the training data and the full test data.
Densitometry. Relative EGFR, p53, phospho-p53, and Rb expression was determined by densitometric analysis using ImageJ software provided by NIH (http://rsb.info.nih.gov/ij/ index.html). X-ray films were scanned at 300 and 600 DPI using a CanoScan (Canon, Lake Success, NY, USA) and images were imported into ImageJ for analysis. The raw signal intensity was determined by selecting the peak corresponding to each band and integrating the intensity within that peak. Local background intensity (calculated by averaging the background intensities at the upper and lower bounds of the peak) was integrated and subtracted from each raw intensity to give the backgroundcorrected signal intensity. To account for loading differences, the corrected signal intensity for the assayed proteins was divided by the corrected GAPDH intensity. Samples treated with varying NNK exposure times were compared to untreated controls for the H23 and BEAS-2B cell lines.
RNA isolation, complementary DNA synthesis, and qRT-PCR gene expression profiling. Total-RNA was isolated from both cell lines using the mirVana™ miRNA Isolation kit and following the manufacturer's protocol (Ambion, Austin, TX, USA). Total-RNA was eluted in 100 µl of nuclease-free water and stored at -80˚C. RNA concentration was determined using the NanoDrop 1000 Spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA). RNA quality, 28S/18S ratio, and a visual image of the 28S and 18S bands were evaluated using the 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). Total-RNA (1 µg) was converted into complementary DNA (cDNA) using the High Capacity cDNA Reverse Transcription Kit from Applied Biosystems (Life Technologies, Carlsbad, CA, USA). Thermal cycling conditions were as follows: 25˚C for 10 min, 2 cycles of 37˚C for 60 min and 85˚C for 5 sec followed by a programmed hold at 4˚C.
All qRT-PCR reactions were performed on a 7500 real-time PCR system from Applied Biosystems. The reports were generated using SDS2.3 software (Applied Biosystems). The Ct values obtained were normalized to the UBC housekeeping gene in each sample. Fold changes were computed using the 2 -∆∆Ct method of 5 biological replicates and 3 technical replicates (26). Statistical significance was computed using repeated ANOVA tests in R and is considered statistically significant at P≤0.05.
The coexpression relation of a gene pair derived with the implication induction algorithm was compared with the observed NNK-induced gene expression changes. The coexpression relation is confirmed when the observed NNK-induced gene expression changes are consistent with the predicted Table I. Summary of clinical characteristics of patients from the Director's Challenge Study (15).

UM and HLM (training set)
MSK and DFCI (testing set)

Results
Identification of a smoking-associated 7-gene signature. Lung cancer survival genes were first selected from the whole genome on the training set (UM and HLM; n=256) from Shedden et al (15). A total of 2,310 genes were significantly associated with overall survival (P<0.05, univariate Cox model). Second, from this set of 2,310 genes, 217 genes exhibited significant differential expression (P<0.05, t-tests) in smokers versus non-smokers. These 217 survival and smokingassociated genes, as well as 6 major signaling pathway genes (EGF, EGFR, MET, KRAS, E2F3, and E2F5) were included in the network analysis. These signaling pathways are included in human NSCLC disease mechanisms delineated by the KEGG Pathway Database (http://www.genome.jp/kegg/pathway/hsa/ hsa05223.html). They were selected based on their reported clinical relevance in NSCLC. These 6 signaling pathway genes were not significantly associated with survival nor were they differentially expressed in smokers. Patient samples in the training set were separated into two groups: smokers (patients who smoked in the past or who are currently smoking) and non-smokers (patients who never smoked). For each smoking-defined patient group, a coexpression network among the 223 genes was constructed.
Between each pair of the 223 genes, significant (P<0.05; z-tests) coexpression relations were retrieved in the smoker group and the non-smoker group, constituting smoking-mediated gene coexpression networks in NSCLC. By comparing the coexpression types between each pair of genes in the two networks, differential network components were identified and considered important for further evaluation. These differential components are interactions that were present in the smoker group but missing in the non-smoker group, or conversely, those present in the non-smoker group but absent in the smoker group. From the differential components associated with the smoker group and non-smoker group, genes having direct coexpression relations with all 6 lung cancer signaling pathway genes were identified as the signature genes (Fig. 1). As a result, 6 genes were identified from the smoker group and 1 gene was identified from the non-smoker group. This constituted the smoking-associated 7-gene signature for NSCLC (Table II).
Prognostic validation in lung adenocarcinoma. We sought to investigate if the identified gene signature could provide accurate prognostic prediction of survival in lung adenocarcinoma patients. On the training cohort, the original microarray gene expression profiles of the identified 7 gene probes were fitted into a Cox model as covariates. A survival risk score was generated for each patient in the training set. A training model ( Fig. 2A) was identified and applied to the test set (MSK and DFCI; n=186; Fig. 2B) without re-estimation of parameters. In both training and test sets, this scheme separated patients into two groups with different survival outcomes (P<0.007, Kaplan-Meier analyses).  (Table III). There was no significant difference in prognostic value between the hazard ratio of the 7-gene risk score and cancer stage (II vs. I). The results demonstrate that the 7-gene risk score could provide a more accurate prognosis than some commonly used clinicopathological parameters.
The 7-gene signature gave accurate prognostic prediction in smokers in both training and test sets in Shedden's cohorts (15) (P<0.01; Fig. 2C and D), but not in non-smokers in Kaplan-Meier analyses (P<0.12, results not shown). In addition, gene expression-defined high-and low-risk groups had significant association with smoking (P<0.02, χ 2 tests) and smoking cessation (P<0.00001, χ 2 tests; Table I). These results further confirmed the smoking association of the identified 7-gene signature.

Prognostic validation on other histological subtypes of NSCLC.
The prognostic performance of the 7-gene signature was further evaluated on cohorts from Raponi et al (16) and Bild et al (17), which include another major subtype of NSCLC, squamous cell lung carcinoma. For robust validation, patient samples in these two studied cohorts were randomly partitioned into separate training and test sets. A prognostic classifier was constructed on the training set using the Cox model and validated on the test set without re-estimation of parameters.
In the Raponi cohort (16) of squamous cell carcinoma patients, the 7-gene signature stratified patients into two groups with distinct survival outcomes (log-rank P<0.005, Kaplan-Meier analysis) in the training set (Fig. 2E). This model generated borderline significant stratification (P=0.06, Kaplan-Meier analysis) in the test set (Fig. 2F). This might be owing to the fact that 10 patients (8%) of Raponi's cohort were either non-smokers or their smoking status was not known, whereas the 7-gene signature provides refined prognosis specifically in smoking lung cancer patients.
In the Bild cohort (17) containing both lung adenocarcinoma and squamous cell carcinoma patients, the 7-gene signature   (17) of lung adenocarcinoma and squamous cell carcinoma. Log-rank tests were used to assess the statistical significance in survival probability between the two prognostic groups. Red curves, low-risk patient group; green curves, high-risk patient group.
stratified patients into two distinct survival groups in both training and test sets (P<0.04, Kaplan-Meier analyses) (Fig. 2G  and H). Overall, these results demonstrate that the 7-gene signature could select high-risk NSCLC patients with a smoking history for chemotherapy.
Early diagnostic detection of lung cancer in smokers. We further investigated whether the 7-gene signature could be used for diagnostic screening of lung cancer in smokers. The smoking cohort from Spira et al (2) was separated into a training set (n=77) and two independent test sets (n=52 and n=35). Using a nearest neighbour algorithm implemented in WEKA (27), the 7-gene classifier could accurately identify lung cancer patients from normal patients with an overall accuracy of 73 and 74% in two test sets, respectively. The odds ratio of predicted lung cancer risk was highly significant in all three sets [OR=3.85,  Table II]. These results indicate that the identified 7-gene signature has important implications in diagnostic screening of lung cancer risk in smokers.
To further evaluate the NNK-induced protein expression, western blot assays were performed in NNK-treated BEAS-2B and H23 cells after 15 min, 1 and 16 h exposures. The results show that EGFR had consistent overexpression at both the mRNA and protein levels over the time course in BEAS-2B cells after NNK treatment. In H23 cells, EGFR exhibited overexpression at the mRNA level; however, protein expression was downregulated following NNK exposure (Fig. 3C and D). These results indicate that in normal lung epithelial cells, the EGFR gene is overexpressed upon NNK treatment, consistent with previous findings (28); whereas in lung adenocarcinoma cells, the NNK-induced transcriptional and translational regulation of EGFR are not concordant at the same time points.
p53 had consistent NNK-induced expression patterns at mRNA and protein levels, with short-term downregulation followed by upregulation in BEAS-2B cells. As H23 is p53 deficient, p53 protein was not expressed in these cells (Fig. 3E and F). Interestingly, downregulation of phospho-p53 was consistently observed in both NNK-treated BEAS-2B and H23 cells ( Fig. 3G and H), concordant with its mRNA and total protein expression. These results are consistent with the report that NNK induces damage in the p53 gene (29).
In NNK-treated BEAS-2B cells, Rb gene expression was first significantly downregulated at 15 min, returned to its normal expression at 1 h, and then showed modest overexpression at 16 h; whereas the Rb protein showed a steady overexpression following the NNK treatment at all-time points. In NNK-treated H23 cells, both mRNA and protein expression of Rb were downregulated at all-time points (Fig. 3I and J). These results are consistent with the reported increased phosphorylation of the Rb Ser 795 6-15 h after NNK treatment of normal human bronchial epithelial cells (NHBE) and small airway epithelial cells (SAEC). This, in turn, promoted cells entering into the S phase (at 15-21 h) (30).
Confirmation of smoking-associated gene coexpression network in lung adenocarcinoma. There were 17 gene coexpressions specifically associated with smokers, and one coexpression specifically associated with non-smokers significant (P<0.05) in both training and test sets from Shedden et al (15) (Fig. 4A). Among these 18 coexpression relations, 11 gene associations were confirmed with multiple biological databases ( Fig. 4A; network precision= 0.91; FDR<0.01), and the network structure was stable (Fig. 4B).
The smoking-associated gene coexpression network in lung adenocarcinoma patients (Fig. 4A) was further validated using the NNK-induced gene expression changes in lung adenocarcinoma cell line H23 (Fig. 3A). Results show that most of the coexpressions in the smoking-associated network of lung adenocarcinoma tumors were confirmed by the coexpressions observed in NNK-treated H23 cells, at varying time points (Fig. 4C). All of the 17 smoking-associated gene coexpressions in NSCLC patients were observed in the cell experiments, except two coexpressions, one between EGF and LTF and one between CRTAC1 and GPRC5C. These two unobserved gene coexpressions could be related to other sources of carcinogens in tobacco, because NNK is only one of the carcinogens in tobacco, among about 54 others (14). Overall, the smokingassociated gene coexpression network in NSCLC patients was largely confirmed in NNK-treated cell experiments, elucidating a network of smoking-induced gene alterations in NSCLC.
Comparison with Bayesian belief networks and gene association networks based on Pearson's correlation. This study presents novel implication network formalism for biomarker discovery. The ability to model cyclic relations in Genet overcomes the fundamental drawback of acyclic Bayesian networks in modelling molecular networks (31). In comparison with Bayesian belief networks, expression profiles of the identified 7 signature genes and 6 signalling pathway genes were used to build causal networks with TETRAD IV (http://www.phil.cmu. edu/projects/tetrad/current.html) for smoking and non-smoking lung adenocarcinoma patients in training and test sets, respectively (Fig. 2). There was only one interaction associated with smokers (between MET and SEMA3C; Fig. 2E) present in both training and test sets, which was considered a true positive when evaluated with MSigDB. In contrast, Genet generated significantly more biologically relevant gene coexpression relations that were validated by the biological databases and experimental results, confirming its topological advantage over the Bayesian belief networks.
Large-scale gene coexpression networks have been used in disease classification (32). These studies construct pair-wise gene coexpression networks by using correlation coefficients computed from gene expression profiles. Such networks indicate the distance or similarity between each pair of gene expression profiles but do not provide the direction or causal relations in the gene regulatory patterns. We have compared Genet with gene association networks based on Pearson's correlation. In constructing smoking-mediated coexpression networks using 217 smoking and survival associated genes and 6 signalling hallmarks, both models had the same network precision and FDR. However, Genet generated significantly more biologically relevant gene association relations that were validated by the test set (20). These results indicate that prediction logic is more robust than Pearson's correlation for inducting gene association networks.

Discussion
In the United States, about 90% of male and 75-80% of female lung cancer deaths can be attributed to smoking each year (14). In recent years, lung adenocarcinoma, a rare tumor type in the early 20th century, has replaced squamous cell lung cancer as the most frequent cell type of NSCLC (33). The observations in the United States and abroad suggest that increases in lung adenocarcinoma cases since 1950 are more consistent with changes in smoking behavior and cigarette design than with diagnostic advances or histologic interpretation (34)(35)(36). The gene-smoking interactions and their relationship to lung cancer are not well established in epidemiology studies (14).
This study identified a 7-gene signature for lung cancer diagnosis and prognosis in smokers. The identified biomarker genes are involved in multiple lung cancer signaling pathways through concurrent coexpression with major signaling proteins. The 7-gene signature provided an accurate estimate of risk for tumor development and recurrence (as indicated by lung cancer survival) in smokers. The 7-gene signature also appeared to be a more accurate prognostic factor than commonly used clinicopathological factors for NSCLC. These results indicate the potential utility of this gene signature in predicting lung cancer risk in smokers before symptoms can be detected with morphological assessments in clinic. Such early detection could significantly improve the clinical outcome in lung cancer treatment. Furthermore, the 7-gene assay could potentially be used to identify specific patients at high-risk for tumor recurrence/ metastasis using customized Affymetrix arrays, thus improving patient selection for adjuvant chemotherapy.
The gene expression-defined prognostic groups had a strong association with smoking and smoking cessation. Smokers were more likely to have the poor prognosis gene expression pattern than non-smokers. Furthermore, current smokers showed a stronger association with the poor prognosis gene expression pattern than previous smokers. These results suggest that the identified 7-gene signature is associated with smoking induced lung cancer initiation and progression, and the poor prognosis gene expression pattern might be reversed after smoking cession. Tobacco smoke contains a substantial amount of NNK, and the lowest dose shown to induce lung cancer in animal studies is remarkably close to the total dose of exposure experienced by a smoker in their lifetime (37). The smokingassociated gene coexpression network computationally derived from NSCLC patient transcriptional profiles was confirmed in the NNK-treated H23 cell line, further attesting to its biological relevance and smoking association in lung cancer.
Using the same methodology, a 6-gene (20) and an 8-gene signature were also identified from 217 smoking and survival associated genes, by modeling concurrent coexpression with different sets of 6 signaling hallmarks randomly selected from 10 KEGG human NSCLC signaling pathways (Table III). These 10 signaling proteins were selected based on their reported clinical relevance in NSCLC. The prognostic performance of the 6-and 8-gene signatures was comparable with the 7-gene signature (20) (Fig. 1). The 6-and 7-gene signatures both outperformed the clinicopathological covariates, but the 8-gene signature did not (results not shown). There is one common gene, SEMA3C, between the 6-and 7-gene signatures. In the experimental validation, all 10 signaling pathway genes showed significant differential expression in NNK treated normal lung epithelial cells and lung adenocarcinoma cells. The observed NNK-induced protein expression of p53, phospho-p53, Rb and EGFR was largely concordant with their mRNA expression levels in the BEAS-2B normal lung epithelial cells. In lung adenocarcinoma cell line H23, the NNK-induced gene expression was concordant with protein expression of p53, phospho-p53 and Rb, but not of EGFR. These results indicate that p53, Rb and EGFR might be functionally involved in smoking-induced lung cancer initiation and progression. EGFR mutations, associated with better chemoresponse, are significantly associated with non-smokers compared to smokers in a large epidemiology study (38). The identified gene signatures were concurrently coexpressed with these signaling pathways in patient transcriptional profiles. The association of these gene signatures with smoking, smoking cessation, as well as lung cancer risk and survival, in turn, supports the involvements of these oncoproteins in smoking induced lung cancer initiation and progression.