Identification of breast cancer molecular subtypes

Haibe-Kains B

Introduction : Since the advent of array-based technology and the sequencing of the human genome, scientists attempted to bring new insights into breast cancer biology and prognosis. From gene expression data, Perou et al. highlighted the key molecular differences between breast tumors by identifying sets of co-expressed genes and tumors sharing similar « genetic portraits » (Perou et al, Nature 2000). Using a hierarchical clustering method in combination with a large set of genes (called the « intrinsic gene list » in the literature), several subtypes were identified based mainly on ER and HER2 phenotypes and proliferation. Although these early results were promising, the clustering model developed in the original publications suffers from serious drawbacks, i.e. its instability and the difficulty to apply it to new data (Pusztai et al., The Oncologist 2006).

Methods : In order to address these concerns, we recently introduced a new clustering model to robustly identify the breast cancer molecular subtypes. This model consists in : (i) identifying gene modules, i.e. sets of genes that are specifically co-expressed with genes of interest; and (ii) identifying molecular subtypes using a simple model-based clustering in a low dimensional space defined by these gene modules (Wirapati et al., BCR 2008; Desmedt et al. CCR 2008).

Results : From two large microarray datasets (> 600 patients), seven gene modules were built in order to represent key biological processes in breast cancer : ER phénotype (ESR1),  HER2 phenotype (ERBB2), proliferation (AURKA), immune response (STAT1), angiogenesis (VEGF), tumor invasion (PLAU) and apoptosis (CASP3). Since previous publications highlighted the relevance of ER and HER2 phenotypes for breast cancer subtypes identification, recently confirmed by (Kapp et al., BMC Genomics 2006), we used the ESR1 and ERBB2 module scores to fit our model-based clustering. The model was built on a series of 344 breast cancer patients. The resulting classification was shown to be robust in a set of 14 independent microarray datasets, including > 2700 patients.

Conclusion : This method has several advantages compared to the previously published hierarchical clustering: (i) the low-dimensionality of the input space (two dimensions) increases the stability of the clustering and facilitates the visualization of the clustering results; (ii) the low computational cost; (iii) the model is easily applicable to new data; (iv) the model returns probabilities for a patient to belong to each subtype, facilitating the interpretation of the results. Moreover, this novel clustering model yields robust classifications in numerous microarray datasets. Given its easy applicability and its good performance, this new model could be used by doctors in order to study the prognosis and the effect of treatments with respect to the molecular subtypes of breast cancer.