Model selection in meta-analytical framework for prototype-based clustering

Haibe-Kains B, C. Desmedt, Sotiriou C and Bontempi G

Background: The use of dimension reduction methods in microarray analysis is justified by the characteristics of the data such as the high feature-to-sample ratio, the high correlation of coexpressed genes and the high level of noise due to complex technology. Clustering analysis is widely used to perform dimension reduction, keeping the new features interpretable. This method consists in replacing a cluster of correlated genes by a cluster centroid (called feature), and can be used in biologically driven microarray analysis. We aimed at efficiently using a priori biological knowledge to improve clustering methodology for dimension reduction.

Methods: We introduced a new method called prototype-based clustering to identify genes that are coexpressed with one prototype, ie one gene representative of a biological process of interest. For each gene to cluster, we fitted univariate and multivariate linear models with the prototypes which play the role of explanatory variables. We compared these models based on their leave-one-out cross-validation error computed by the PRESS statistic. Using Friedman's test, the models exhibiting the lowest errors were selected to identify the cluster of the gene. This method was used in a meta-analystical framework in order to combine model selection from different datasets.

Results: We applied our method to two public microarray datasets of breast cancer (BC) untreated patients. We used hallmarks of BC involving various biological processes such as estrogen receptor, her2/neu signaling, proliferation, tumor invasion, immune response, angiogenesis, and apoptosis as prototype genes. We reduced the number of variables from ~20,000 to seven in keeping valuable information 1) to define robustly BC molecular subtypes based on estrogen receptor and her2neu features and 2) to investigate the impact of the seven features on clinical outcome. These two questions were addressed in using fifteen public microarray datasets ($ \approx 2100 $ patients).

Conclusions: The use of prototype-based clustering allowed for efficient reduction of the dimensionality of microarray data in focusing on target biologically processes. We successfully applied this method to BC samples in order to gain new insights into BC biology.