ArrayMining

Category Genomics>Gene Expression Analysis/Profiling/Tools

Abstract ArrayMining is a free web-based application for microarray analysis combining a broad choice of algorithms based on ensemble and consensus methods, using ‘automatic parameter’ selection and integration with annotation databases.

It provides easy access to a wide choice of ‘feature selection’, clustering, prediction; gene set analysis, and cross-study normalization methods. In contrast to some other microarray-related web-tools, multiple algorithms and data sets for an analysis task can be combined using ‘ensemble feature selection’, ensemble prediction, consensus clustering, and cross-platform data integration.

By interlinking different analysis tools in a modular fashion, new ‘exploratory routes’ become available, e.g. ensemble sample classification using features obtained from a ‘gene set analysis’ and data from multiple studies. The analysis is further simplified by automatic parameter selection mechanisms and linkage to web tools and databases for ‘functional annotation’ and ‘literature mining’.

The ArrayMining tool set consists of six (6) main modules for microarray analysis: Cross-Study Normalization, Gene selection, Class Discovery, Class Assignment, Gene Set Analysis, and Gene Network Analysis.

Each of these modules features multiple analysis methods accessible through a unified web-interface. The user can upload their own data in tab-delimited text-file format or as zip-compressed Affymetrix CEL-files which will be automatically extracted, normalized and summarized using the Robust Microarray Analysis (RMA) method.

Alternatively, various example data sets have been made available directly on the webpage and access to the GEO database (see G6G Abstract Number 20013), the largest public microarray data base, is provided in the ‘class discovery module’.

After submitting an analysis task, an output webpage containing the downloadable results as plots, tables, VRML-files etc. is generated. Depending on the chosen module and algorithm the data can be forwarded to further analysis modules and will be interlinked with annotation data from external web-tools and data bases.

1) Cross-study normalization module -- Current microarray studies often only contain a small number of samples, resulting in limited robustness and reliability of statistical analyses. To alleviate this problem five (5) cross-study normalization methods have been made available on ArrayMining.net to combine samples from two (2) different studies:

An approach based on linked gene- and sample-clustering (XPN), an empirical Bayes method (EB), a median rank score based method (MRANK), an outlier-removing discretization technique (NorDi) and a quantile discretization procedure (QDISC).

While the first three (3) methods provide continuous-valued outputs, the last two (2) are based on discretization to filter out noise, exploiting the fact that for ‘higher-level analysis’ often only a general categorization of ‘gene expression’ levels in different conditions is required (e.g. “unaltered”, “up”- or “down”-regulated), but potentially resulting in a higher loss of biological information.

The input data sets can originate from different microarray platforms, but the associated ‘gene sets’ need to overlap significantly and the samples should be derived from the ‘same tissue type’ under comparable biological conditions. As a result, the combined data can be downloaded or forwarded to other modules, and density and quantile-quantile plots are generated to compare different algorithms.

2) Gene selection module -- Identifying differentially expressed genes is a common starting point for the biological interpretation of microarray data. The manufacturer's gene selection module enables the comparison and combination of a diverse choice of methods for this purpose:

The Empirical Bayes t-statistic (eBayes), the Significance Analysis in Microarrays method (SAM) - (see G6G Abstract Number 20066), a correlation-based combinatorial feature selection approach (CFS), a ranking method based on Random Forest classification (RF-MDA) and a Partial-Least-Squares based filter (PLS-CV) using the weight vectors defining the first latent components in cross-validated PLS-models.

To exploit the synergies of different algorithms, the manufacturer has implemented a method to compute aggregated gene ranks from the sum of ranks of individual methods (ENSEMBLE). The resulting outcome reports provide a ranked list of genes, in which known ‘gene identifiers’ become ‘clickable navigation’ items, referring the user to related entries in functional annotation databases and literature search engines.

Additionally, box plots and heat maps visualize the expression values of top-ranked genes across different sample-groups. If the supplied data uses common gene identifiers, the list of selected genes can be forwarded to external analysis tools, e.g. the ‘functional annotation clustering’ service of the DAVID web database (see G6G Abstract Number 20263).

3) Class discovery module -- Clustering methods allow experimenters to identify natural groupings among microarray samples based on their expression patterns across the genes. To account for the great variety of existing scoring and search space exploration methods, the class discovery module includes both ‘partition-based’ and ‘hierarchical clustering’ algorithms, an evaluation based on multiple validity indices and a ‘consensus clustering’ method.

Currently, the partition-based clustering methods available are k- Means, PAM, SOM and SOTA, and the hierarchical clustering methods are Average Linkage Agglomerative Clustering, Divisive Analysis Clustering and a combination between the agglomerative and divisive approach, Hybrid Hierarchical Clustering.

To combine the information content from multiple clustering’s into a single representative solution, the manufacturer has implemented their own consensus clustering approach, which maximizes a score for the agreement between sample-pair assignments of the consensus clustering and all input clustering’s using a fast simulated annealing approach.

Optionally, different types of data standardization and two gene filtering methods can be applied prior to the analysis. An alternative filtering approach is to first use the gene set analysis module (see below...) to extract “meta-genes” representing biological pathways and forward this data to the class discovery module.

As a result for each analysis, the user will obtain a tabular summary of the calculated validity indices and clustering results and various graphical outputs including a silhouette-plot, a 2D principal components plot and a 3D Virtual Reality Modeling Language (VRML)- visualization, including density estimation contour surfaces based on an ‘Independent Component Analysis’ of the data and the manufacturer's software-package “vrmlgen” for the R programming language.

4) Class Assignment module -- An important goal behind ‘microarray analysis’ is to improve the ‘diagnosis of diseases’ with genetic components by predicting the disease type based on labeled training data. This module is therefore dedicated to supervised learning methods, including various common methods for microarray sample classification (SVM, RF, PAM and kNN).

The manufacturer also provides access to an in-house developed ‘rule- based machine learning’ approach, BioHEL, which learns ‘structured classification rule sets’, known as “decision lists”, by applying a ‘genetic algorithm’ within an iterative rule learning (IRL) framework.

BioHEL has previously been shown to achieve high prediction accuracies on complex biological data sets, while being based on easily interpretable “if-then-else”-rules. The prediction methods can be evaluated and compared based on the widely accepted external two- level ‘cross-validation methodology’, using ‘automatic parameter’ optimization within a nested cross-validation.

As with the other modules, an ensemble of algorithms is available both for selection and prediction to obtain more robust results. Moreover, since prediction models derived from training data of a single study can typically Not be applied to samples from other platforms and laboratories, the combination of cross-study normalization (see above...) with prediction provides a means to obtain more ‘general models’ based on a larger sample size.

The results for an analysis contain various performance measures for evaluation and Z-scores for the genes that were most frequently selected across different cross-validation cycles. To obtain more insights on these genes, similar analysis plots and annotation tools are available.

5) Gene Set Analysis module -- Two common problems in microarray analysis are ‘high noise levels’ for single genes and a high number of redundant or uninformative genes. Using gene set analysis (GSA) to aggregate functionally related genes into gene sets and summarizing their expression values to a robust “meta”-gene expression vector is a promising approach to overcome some of these limitations.

Moreover, ‘differentially expressed gene sets’ can provide insights on the differences between the ‘biological conditions’ of the samples on the level of ‘molecular modules’ and ‘biochemical pathways’.

The manufacturer's gene set analysis module provides access to three (3) functional annotation sources to identify functionally related genes in a data set and extract corresponding gene sets: The Gene Ontology (GO) Database, the KEGG data base, and a collection of 37 cancer- related ‘gene sets’ from the van Andel Institute in Michigan.

Alternatively, users can specify their own gene sets using the gene identifiers for the data set of interest.

Summarized meta-gene expression vectors for a gene set are obtained by transforming the expression levels using Principal Component Analysis (PC-GSA) or Multidimensional Scaling (MDS-GSA).

The outcome is presented as a ‘ranked list’ of gene sets and additionally contains box plots and heat maps similar to those on the gene selection module. Meta-gene expression values derived from the gene sets can be downloaded or forwarded to other analysis modules, e.g. to be used as predictors in ‘sample classification’.

6) Gene Network Analysis module -- This module measures the similarity of ‘expression patterns’ for pairs of genes in microarray data to ‘construct co-expression networks’. These networks are graphs in which the nodes represent genes and the edges connect genes which are regarded as significantly co-expressed (e.g. when their correlation is above a given threshold).

Co-expressed genes might represent co-regulated and/or functionally related genes, which become active in the same biological conditions. Identifying network modules of co-expressed genes can provide insights on the differences between the ‘biological states’ of different microarray samples on a ‘molecular level’.

Due to the high noise levels in many microarray studies, care must be taken, when evaluating and interpreting the results of gene co- expression analysis.

This module is based on the Weighted Gene Co-Expression Network Analysis (WGCNA) method by Zhang and Horvath (2005) and six (6) network visualization approaches, as well as various topological descriptors to analyze the data. Users can either upload their own microarray data or use one of the pre-processed example data sets.

The HTML report resulting from an analysis will provide the user with a table of network statistics, a visualization of the graph, as well as the ‘downloadable network’ in different file format and a list of the connected components. The statistics table contains descriptor values like the average degree and path length, the ‘global clustering coefficient’ and the largest diameter across all connected components, among others.

System Requirements

Web-based.

Manufacturer

Manufacturer Web Site ArrayMining

Price Free

G6G Abstract Number 20547

G6G Manufacturer Number 104162