Model selection in meta-analytical framework for prototype-based clustering
Haibe-Kains B, C. Desmedt, Sotiriou C and Bontempi G
Background:
The use of dimension reduction methods in microarray analysis is
justified by the characteristics of the data such as the high
feature-to-sample ratio, the high correlation of coexpressed genes and
the high level of noise due to complex technology. Clustering analysis
is widely used to perform dimension reduction, keeping the new features
interpretable. This method consists in replacing a cluster of
correlated genes by a cluster centroid (called feature), and can be
used in biologically driven microarray analysis. We aimed at
efficiently using a priori biological knowledge to improve clustering
methodology for dimension reduction.
Methods: We introduced a new
method called prototype-based clustering to identify genes that are
coexpressed with one prototype, ie one gene representative of a
biological process of interest. For each gene to cluster, we fitted
univariate and multivariate linear models with the prototypes which
play the role of explanatory variables. We compared these models based
on their leave-one-out cross-validation error computed by the PRESS
statistic. Using Friedman's test, the models exhibiting the lowest
errors were selected to identify the cluster of the gene. This method
was used in a meta-analystical framework in order to combine model
selection from different datasets.
Results: We applied our method
to two public microarray datasets of breast cancer (BC) untreated
patients. We used hallmarks of BC involving various biological
processes such as estrogen receptor, her2/neu signaling, proliferation,
tumor invasion, immune response, angiogenesis, and apoptosis as
prototype genes. We reduced the number of variables from ~20,000 to
seven in keeping valuable information 1) to define robustly BC
molecular subtypes based on estrogen receptor and her2neu features and
2) to investigate the impact of the seven features on clinical outcome.
These two questions were addressed in using fifteen public microarray
datasets ($ \approx 2100 $ patients).
Conclusions: The use of
prototype-based clustering allowed for efficient reduction of the
dimensionality of microarray data in focusing on target biologically
processes. We successfully applied this method to BC samples in order
to gain new insights into BC biology.