Identification of diagnostic markers in colorectal cancer via integrative epigenomics and genomics data

Apart from genetic mutations, epigenetic alteration is a common phenomenon that contributes to neoplastic transformation in colorectal cancer. Transcriptional silencing of tumor-suppressor genes without changes in the DNA sequence is explained by the existence of promoter hypermethylation. To test this hypothesis, we integrated the epigenome and transcriptome data from a similar set of colorectal tissue samples. Methylation profiling was performed using the Illumina InfiniumHumanMethylation27 BeadChip on 55 paired cancer and adjacent normal epithelial cells. Fifteen of the 55 paired tissues were used for gene expression profiling using the Affymetrix GeneChip Human Gene 1.0 ST array. Validation was carried out on 150 colorectal tissues using the methylation-specific multiplex ligation-dependent probe amplification (MS-MLPA) technique. PCA and supervised hierarchical clustering in the two microarray datasets showed good separation between cancer and normal samples. Significant genes from the two analyses were obtained based on a ≥2-fold change and a false discovery rate (FDR) P-value of <0.05. We identified 1,081 differentially hypermethylated CpG sites and 36 hypomethylated CpG sites. We also found 709 upregulated and 699 downregulated genes from the gene expression profiling. A comparison of the two datasets revealed 32 overlapping genes with 27 being hypermethylated with downregulated expression and 4 hypermethylated with upregulated expression. One gene was found to be hypomethylated and downregulated. The most enriched molecular pathway identified was cell adhesion molecules that involved 4 overlapped genes, JAM2, NCAM1, ITGA8 and CNTN1. In the present study, we successfully identified a group of genes that showed methylation and gene expression changes in well-defined colorectal cancer tissues with high purity. The integrated analysis gives additional insight regarding the regulation of colorectal cancer-associated genes and their underlying mechanisms that contribute to colorectal carcinogenesis.


Introduction
In recent decades, the incidence of colorectal cancer (CRC) has increased by 2-to 4-fold in many Eastern Asia countries such as China, Japan, South Korea and Singapore (1,2). The high risk of CRC among the Asian population, including Malaysia, is associated with a low fiber diet and high tobacco consumption (3). One of the screening methods to detect early stage of CRC is by measuring the level of carcinoembryonic antigen (CEA) in the serum, however, the sensitivity of the marker was reported to be <80% (4)(5)(6). Therefore, the identification of new biological markers for CRC is crucial.
Epigenetic markers such as methylation markers in CRC were first reported 10 years ago in DNA from stool and blood samples (7). DNA methylation is an epigenetic mechanisms that involves the enzymatic process of adding the methyl group to the 5-carbon position of the cytosine to form 5-methylcytosine (8)(9)(10). This modification mostly occurs in the CG enriched site known as CpG islands (CGI), which are present in 70% of the annotated gene promoter regions (11). Two typical DNA methylation patterns, global hypomethylation and CGI hypermethylation, have emerged as potential signatures in the cancer genome (12). Hypermethylation is generally found in the promoter CGI region, whereas global hypomethylation frequently occurs in CpG dinucleotides that are located in the repetitive sequences of DNA (satellite repeats or retrotransposon) (13).
CGI hypermethylation in the promoter region is thought to be linked with the transcriptional inactivation of

Identification of diagnostic markers in colorectal cancer via integrative epigenomics and genomics data
tumor-suppressor genes (14,15). This is mediated through the methyl-CpG binding proteins (MBDs) resulting in a compacted chromatin conformation (15,16). The compacted chromatin hinders the accessibility of the transcriptional machinery from binding to the promoter region, thereby leading to the repression of gene expression (17). The inverse relationship between DNA methylation and transcript level was reported in a study involving chromosomes 6, 20 and 22 in 43 healthy human tissue and primary cells (18). The study revealed that one-third of the differentially methylated genes (representing 17% of the 873 analyzed genes) were found to be inversely associated with their transcript levels (18). Apart from the conventional polymerase chain reaction (PCR) method, the microarray chip-based study is a widely used approach in exploring methylation markers in CRC (14,19,20). In a study using the Illumina GoldenGate ® methylation array on 28 normal mucosa and 91 CRC samples, 202 CpG sites with 90 hypermethylated and 42 hypomethylated loci that involved 132 genes were identified (21). Using the level of CpG island methylator phenotype (CIMP), CRC was divided into three different subgroups (21). A more recent study identified 169 hypermethylated loci and validated 11 of these loci that could be distinguished between CRC and non-neoplastic colonic mucosa (22). Among the genes were dedicator of cytokinesis 8 (DOCK8), visual system homeobox 2 (VSX2), microRNA 34b (miR-34b), glucagon-like peptide 1 receptor (GLP1R1), B-cell translocation gene 4 (BTG4), BEN domain containing 4 (BEND4), neuronal pentraxin I (NPTX1), ALX homeobox 3 (ALX3), zinc finger protein 583 (ZNF583), homer homolog 2 (HOMER2) and gap junction protein, gamma 1 (GJC1) (22).
Hypermethylated markers identified previously in CRC using a single array profile cannot reflect completely the complexity of the disease. To address this issue, epigenomic and genomic data from microarray analyses have been used to obtain a more comprehensive insight of the molecular mechanisms involved in CRC (17,(23)(24)(25). The Cancer Genome Atlas study revealed a comprehensive molecular image of CRC by integrating data between promoter methylation, DNA copy number, exome sequencing, microRNA expression and messenger RNA expression in 224 CRC and normal samples (26). Apart from identifying gene mutations involved in the well-established signaling pathways, the authors of that study also documented the new critical role of MYC in directing the transcriptional activation and repression in CRC. Additional findings included repetitive mutations in APC membrane recruitment protein 1 (FAM123B) and AT rich interactive domain 1A (ARID1A) as well as a novel mutation in SRY (sex determining region Y)-box 9 (SOX9) (26).
Studies using a similar integrative approach of analyses on Asian patients are lacking. Therefore, the aim of the present study was to investigate the biological complexity of CRC by integrating DNA methylation and gene expression profiling signatures using paired samples from local patients. The integrated signatures may later serve as potential diagnostic markers for the prognostication of CRC. We also hypothesized that CpG hypermethylation in the promoter region of CRC-associated genes led to a low expression of the respective genes.

Materials and methods
Clinical samples. A total of 55 paired colorectal carcinoma and their corresponding adjacent normal epithelial cells (10 cm away from the tumor) were collected at surgery from patients at the Universiti Kebangsaan Malaysia Medical Centre, Kuala Lumpur, Malaysia. Patients provided written informed consent to participate in the present study. This study was approved by the UKM Research Ethics Committee (Reference no: UKM 1.5.3.5/244/UMBI-004-2012).
Sample preparation. The clinical tissues were snap-frozen in liquid nitrogen prior to sectioning. Tissue samples were sectioned into 5-7 µm thickness using a cryostat (Microtome Cryostat HM550; Microm International GmbH, Walldorf, Germany). The sections were stained with hematoxylin and eosin (H&E) and were examined by the histopathologist. Tissue samples were considered representative when >80% malignant cells were present. Normal cells also contained >80% normal epithelial cells and were free from malignant or inflammatory cells. Patients with chemotherapy or radiotherapy prior to surgery were excluded from the study.
DNA methylation profiling assay. Genomic DNA was isolated using the Qiagen DNeasy Tissue kit (Qiagen, Hilden, Germany) according to the manufacturer's instructions. The quantity and purity of DNA were quantified using the NanoDrop ND-1000 spectrophotometer (Thermo Fisher Scientific, Leicester, UK). Gel electrophoresis was used to check the integrity of the DNA. Only samples with good purity were included in the study. Methylation profiling was performed in 110 samples using the HumanMethylation27 Beadchip to analyze 27,578 CpG sites covering 14,495 genes. Bisulphite conversion was carried out in the methylation assay using the EZ DNA Methylation-Gold kit (Zymo Research, Irvine, CA, USA). The microarray study was carried out according to the Infinium II Methylation Assay manual protocol. All the chips were scanned on a single BeadArray reader to avoid bias.
Gene expression profiling assay. Total RNA from 15 paired representative tissues were extracted using the Qiagen QIAampMini Plus kit (Qiagen) according to the manufacturer's instructions. The quantity and purity of RNA were quantified using the NanoDrop ND-1000 spectrophotometer (Thermo Fisher Scientific). The integrity of RNA was measured using the Agilent Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA). Only samples with an OD 260/280 of 1.8-2.1 and RNA integrity number (RIN) ≥6.0 were included in the gene expression study. The GeneChip Human Gene 1.0 ST array (Affymetrix, Santa Clara, CA, USA) that has 764,885 distinct probes covering 28,869 well-annotated genes was used. The assay was carried out using the Affymetrix expression protocol. The microarray chips were scanned using the GeneChip Scanner 3000 7G (Affymetrix). The Affymetrix ® Genotyping Console™ (Affymetrix) was used to extract the expression data and Partek Genomic Suite 6.6 (version 6.12.0713) (Partek Inc., St. Louis, MO, USA) was utilized for subsequent analysis.
Statistical analysis of DNA methylation profiling. The control panel from Illumina BeadStudio software 2011 (version 1.9.0) (Illumina, San Diego, CA, USA) was used to determine the quality of our methylation microarray assay. β-value was used to determine the methylation status of each sample. This value was derived from each locus from the microarray and range from the lowest methylation value (β=0) to the highest methylation value (β=1). β-value was generated as the intensity of the methylated probe/total of the intensity of the methylated probe and the intensity of the unmethylated probe.
The generated β-value was exported to the Partek Genomic Suite 6.6 (version 6.12.0713) (Partek) for subsequent analysis. PCA mapping was used to determine the quality of the samples. Three-way analysis of variance (ANOVA) with ≥2-fold change and P<0.05 with FDR were used to compare the differential methylated CpG loci between the cancer and normal groups. To remove the batch effect, the scanned date of the chips was controlled. A Gene Ontology enrichment analysis was carried out to investigate whether the genes found to be differentially methylated could be classified into a Gene Ontology category more often than expected by chance. A functional group with a high enrichment score was considered the leading group.

Statistical analysis of genome-wide gene expression profiling.
Differential gene expression analysis was carried out using the Partek Genomic Suite 6.6 software with three-way ANOVA analysis. The expression data were normalized using quantile normalization and robust multi-array analysis (RMA) background correction. The differentially expressed genes were reported to be significant in the cancer group when the fold-change was >2.0 and P-value with FDR was <0.05. The batch effect was removed as a source of variation.
Sub-analysis of methylation and expression profile. The analysis was performed using the Partek Genomic Suite 6.6 software. We used the Pearson (linear) correlation to determine the correlation of our data. The correlated genes were considered significant at P<0.05.

Integration of methylation and expression profile statistical analysis.
Overlapped genes were identified based on significant gene symbols from the methylation and expression datasets. Datasets were imported into the MySQL relational database for downstream data analysis. The MySQL database allows rapid and accurate data filtering across different datasets. The datasets were compared in the pair-wise manner. Unique gene symbols identified between the overlapping comparisons were used in downstream analysis. Along with the unique gene list, methylation and expression values were extracted from the datasets for chromosome mapping, circular map generation and KEGG Pathway mapping. Mapping of the integrated gene list allowed visualization of overlapping genes on chromosome map overview, circular map overview and also KEGG pathway maps.
Validation of genes using MS-MLPA. A total of four significant genes (SFRP2, BTG4, APC and GPX7) were selected from the DNA methylation profile for validation purpose. Validation was carried out using the MS-MLPA and followed the manufacturer's instructions. The primers were designed and customized following the guidelines from MRC Holland (MRC-Holland, Amsterdam, The Netherlands).
The amplification PCR product was carried out by using the 3500 Genetic Analyzer (Applied Biosystems, Foster City, CA, USA). The electrophoresis result was assessed by using the Coffalyser software version 1.0.0.43 (MRC-Holland). For the analyses, dataset of MS-MLPA was analyzed by the method described in a previous study (27). To quantify the methylation status of each of the genes, the probe relative peak area ratio of the digested sample was compared with that of the undigested samples. The digested sample with the probes of a relative peak area ratio of ≥0.25 was defined as methylated.

Results
Principal component analysis and sources of variation determine the quality of the microarray studies. Principal component analysis (PCA) showed that the normal group (indicated by red color) was clustered distinctly from the tumor group (indicated by blue color) from the profiling data (Fig. 1). For the methylation study, one normal and one tumor sample were removed since the samples were classified in the opposite group. By applying three-way ANOVA, sources of variation were generated. The 'Sentrix barcode' factor (for methylation study) and 'Scan date' (for expression study) factor were removed to avoid the batch effects.
Genome-wide DNA methylation profiling of CRC in matched samples. The methylation profiling involved 110 CRC samples together with their neighboring non-cancerous colonic cells. The mean age for all patients was 60.4±12.83 years (Table I). A locus was considered to be detected at the two cut-off levels when the mean signal intensity from multiple probes for a particular CpG locus was significantly higher than the negative control on the same chip. A total of 1,123 loci (845 genes) were found to be differentially methylated in CRC compared with the normal colonic epithelial samples. These loci were further classified into 1,081 hypermethylated loci (804 genes), 36 hypomethylated loci (36 genes) and 6 sex-chromosome methylated loci (5 genes). Sex-chromosomes loci were eliminated for the subsequent analysis to avoid bias to the present study. This is due to the methylation involved in the X-inactivation process that silences one out of the two copies of X chromosome in female in order to compensate for the gene dosage effect. Supervised hierarchical clustering of the significant differentially methylated loci in our study is shown in Fig. 2A. The top 10 highly significant hypermethylated loci were protein kinase, cAMP-dependent, regulatory, type I, β (PRKAR1B), cannabinoid receptor interacting protein 1 (C2orf32), zinc finger protein 542 (ZNF542), KH domain containing, RNA binding, signal transduction-associated 3 (KHDRBS), interferon regulatory factor 4 (IRF4), tissue factor pathway inhibitor 2 (TFPI2), potassium voltage-gated channel, KQT-like subfamily, member 5 (KCNQ5), filamin-binding LIM protein 1 (FBLIM1), eyes absent homolog 4 (EYA4) and spastic paraplegia 20 (SPG20). The top 10 most significantly hypomethylated loci were Fc receptor-like 3 (FCRL3), Granzyme K (Granzyme 3; Tryptase II) (PRSS1), bactericidal/permeability-increasing protein (BPI), v-akt murine thymoma viral oncogene homolog 3 (AKT3), pipecolic acid oxidase (PIPOX), peptidase inhibitor 3 (PI3), bactericidal/ permeability-increasing protein-like 3 (BPIL3), solute carrier family 26, member 4 (SLC26A4), long intergenic non-protein coding RNA 152 (MGC4677) and defensin, β 119 (DEFB119).
Genome wide expression profiling of CRC identifies differentially expressed genes in the same group of patients. The data of the genome wide expression profiling have been recently reported (28). Statistical analysis following normalization of data identified 1,408 differentially expressed genes in CRC compared to the normal samples. Supervised hierarchical clustering clearly showed a total of 709 genes were upregulated and 699 were downregulated genes (Fig. 2B). The top 10 most significant differentially upregulated genes were claudin 1 (CLDN1), phosphoribosyl pyrophosphate amidotransferase (PPAT), tumor protein D52-like 2 (TPD52L2), serine hydroxymethyltransferase 2 (SHMT2), cell division cycle associated 7 (CDCA7), chaperonin containing TCP1, subunit 3 (CCT3), eukaryotic translation initiation factor 2, subunit 2 β (EIF2S2), thyroid hormone receptor interactor  profiling data. Data analysis of MS-MLPA showed high methylation for all four validated genes in CRC compared to the normal samples (Fig. 3).

Integrated analysis reveals 32 important genes in CRC.
We identified 32 significant overlapping genes from the methylation and gene expression profiles (Table II). Most of the significant genes (27 genes) have a negative association (hypermethylation and downregulation) and only 5 genes showed a positive association (hypermethylation and upregulation or hypomethylation and downregulation). We then used the Gene Ontology (GO) enrichment analysis to classify the genes into the categories of cellular component, biological process and   (Fig. 5A). For the biological process and molecular function categories, significant genes were highly enriched in the wide group of detection of mechanical stimulus (ES=235.27) (Fig. 5B) and glycosaminoglycan binding (ES=11.19) (Fig. 5C), respectively. Using the Kyoto Encyclopedia of Genes and Genomes (KEGG) database, we identified a total of 95 pathways and 11 of these pathways were associated with the CRC. These includes well-documented pathways in CRC including the cell adhesion molecule, Hedgehog signaling, phosphatidylinositol 3'-kinase (PI3K)-Akt signaling, focal adhesion, pathways in cancer, basal cell carcinoma, colorectal cancer, Wnt signaling, Janus kinase/signal transducers and activators of transcription (JAK-STAT) signaling, transcriptional misregulation in cancer and mitogen-activated protein kinase (MAPK) signaling pathway. Only one pathway was found to be enriched (P<0.05) in our data, the cell adhesion molecules (CAMs) (Fig. 6). Significant genes detected in this pathway were JAM2, NCAM1, ITGA8 and CNTN1. Circular map shows the chromosome distribution of three profiles. We analyzed the frequency of the methylated loci,   the upregulated or downregulated genes and the integrated genes in each chromosome. Fig. 7 shows the distribution of genes for each profile. We found that the genes with hypermethylated promoters in CpG islands and also those that were differentially expressed were generally identified in all chromosomes. Chromosome 1 has the highest frequency of the hypermethylated loci (91 loci) followed by chromosome 7 (78 loci), chromosome 19 (77 loci) and chromosome 6 (73 loci). Conversely, chromosome 20 has the highest frequency of the hypomethylated loci (7 loci) followed by chromosome 1 (6 loci), chromosome 17 (4 loci) and chromosome 7 (3 loci). The integrated genes were distributed in all the chromosomes.
We identified four clusters of genes when we examined closely the distribution of methylated genes in each chromosome. These clusters were located in chromosome 6, 16 and 19 and belonged to specific gene families. One cluster was within the HIST1H family [histone cluster 1, H2bb (HIST1H2BB) with histone cluster 1, H3c (HIST1H3C) and histone cluster 1, H3f (HIST1H3F) with histone cluster 1, H3g (HIST1H3G), another in the FOX family (forkhead box F1 (FOXF1), forkhead box F1 (FOXC2) and forkhead box FL (FOXL1)] and two large clusters in the zinc finger (ZNF) family. For the gene expression study, 16 clusters were distributed across the chromosomes including one single cluster which encompassed >10 genes. This cluster included the metallothionein 2A (MT2A), nucleoporin 93 kDa (NUP93) and 8 other genes derived from the metallothionein 1 (MT1) family.

Discussion
The low sensitivity and specificity of the current serum-based markers in the detection and prognostication of CRC have rendered the identification of new candidates crucial. The emergence of methylation biomarkers for CRC has important implications since the methylation events are reversible. However, methylation markers derived from a single profiling are not sufficient to reveal the overall mechanism involved in CRC. To overcome this issue, the integrative approach combining different genomic profiling analyses could provide new insights into the biology of the CRC and help in identifying potential diagnostic markers.
To show the validity of our results, we correlated our findings with those of a recent methylation study conducted in 2010 which involved 91 cancer and 28 normal samples (21). Our results showed a higher number of hypermethylated genes with 804 genes compared to the 132 genes reported in that study. Overlapping our data with theirs revealed 59 genes, 7 of which were in our top 50 significant gene list based on P-values (21). We then compared our data with those from a Korean study that used 22 paired colorectal cancer and adjacent normal mucosa (29). We found 16 hypermethylated genes overlapped between our data and their top 20 hypermethylation genes with 10 genes appearing on our top 50 significant gene list. These included alcohol dehydrogenase, iron containing, 1 (ADHFE1), bol, boule-like (BOLL), cut-like homeobox 2 (CUTL2), KCNQ5, protocadherin γ Figure 6. Cell adhesion molecules (CAMs) pathway. CAMs pathway (hsa04514) is an enriched pathway in the integration profile with P-value 0.0122. The highlighted genes were the genes found in our integration profile. subfamily C, 4 (PCDHGC4), PRKAR1B, solute carrier family 6 (neutral amino acid transporter), member 15 (SLC6A15), SPG20, TFPI2 and unc-5 homolog C (UNC5C) (29). The ADHFE1, BOLL, SLC6A15 and TFPI2 genes were validated using pyrosequencing (29).
SFRP2 is a dominant-negative inhibitor in the Wnt signaling pathway that is involved in regulating cell proliferation, migration, and differentiation (46,47). Hypermethylation of SFRP2 that resulted in the downregulation of SFRP2 was detected via methylation-specific PCR (MSP) and quantitative PCR (qPCR) in colorectal carcinoma as compared to adenoma (48). Another study reported that hypermethylation of SFRP2 in stool DNA may be a potential biomarker in the detection of CRC (31). Another gene, SPG20, encodes Spartin, which is a multifunctional protein that plays a vital role in the turnover of lipid droplet and intracellular epidermal growth factor receptor trafficking (49,50). Hypermethylation of SPG20 downregulates Spartin and may lead to cytokinesis arrest, which in turn is associated with carcinogenesis (33). A recent study showed that SPG20 has 80.2% sensitivity and 100% specificity in detecting colorectal cancer in stool samples using the MSP approach (22).
We also identified the COLI2A1, MME, HIST1H3B and GRIN2B genes, which were hypermethylated with a high expression. Hypermethylation in the promoters of MME and GRIN2B were previously reported to contribute to Alzheimer's disease and seizures, respectively (51,52). In the integrated list of genes, AKT3 was the only gene with hypomethylation and a low gene expression. These data were supported by a study on hepatocellular carcinoma in which the gene was also found to be hypomethylated (53). In a separate study, the specific knock-down of AKT3 reduced the level of phosphorylated Akt and inhibited cell growth in malignant melanoma (54). The low mRNA level of AKT3 was associated with a higher grade of malignant glioma (55). The positive association exhibited for AKT3 has suggested that there is possibly another layer of regulation. One of the possible explanations for this phenomenon is the existence and influence of long non-coding RNAs (lncRNA). Functional lncRNA may act as an activator or repressor for the expression of genes at the post-transcriptional level via mechanisms such as alternative splicing, influencing RNA polymerase binding efficiency and even by modification of epigenetic state of the gene (56)(57)(58)(59).
From the integrative analysis, we found that 11 out of 95 KEGG pathways were associated with colorectal carcinogenesis. The most enriched pathway identified in our data was cell adhesion molecules, covered by the integrated genes JAM2, NCAM1, ITGA8 and CNTN1. CAM plays a vital physical function in defining the multicellular organism's structure and signaling as well (60,61). It is known to facilitate intercellular adherence and is able to promote cell invasion, motility and migration (62). CAMs such as L1-CAM and neurone glialrelated (Nr)-CAM have been shown to be associated with the Wnt signaling pathway and their overexpression has been associated with poor prognosis (63,64). To the best of our knowledge, there is no data available describing the methylation status of the CAMs genes with the exception of JAM2. One study reported an inverse association between methylation and the gene expression of JAM2 in pre-malignant and malignant colorectal tissues (32). This observation supported our findings on JAM2, which is involved in cell-cell adhesion and cell-environment interaction.
In conclusion, our integrative analysis which combined DNA methylation and gene expression profiling datasets revealed a potential panel of biomarkers for the diagnosis or prognosis of colorectal cancer. Our integrative analysis also revealed a list of novel genes not previously reported in colorectal cancer.