Expression Profiler: Next Generation

Category Genomics>Gene Expression Analysis/Profiling/Tools

Abstract Expression Profiler: Next Generation is an open, extensible web-based collaborative platform for microarray gene expression, sequence and PPI data analysis, exposing distinct chainable components for clustering, pattern discovery, statistics (thru R), machine-learning algorithms and visualization.

The new architecture of Expression Profiler: Next Generation (EP:NG), modularizes the original design (previous version of this product) and allows individual analysis-task-related components to be developed by different groups and yet still seamlessly work together and share the same user interface look and feel.

Data analysis components for gene expression data preprocessing, missing value imputation, filtering, clustering methods, visualization, significant gene finding, between group analysis and other statistical components are available from the EBI (European Bioinformatics Institute) web site.

The web-based design of Expression Profiler supports data sharing and collaborative analysis in a secure environment.

Developed tools are integrated with the microarray gene expression database ArrayExpress (see G6G Abstract Number 20012) and form the exploratory analytical front-end to those data.

The EP:NG platform gives a unified style to all the constituent components, regardless of the origin of their development, their physical location and their manner of execution. EP:NG supports quick and easy integration of third-party tools and algorithms.

The EP:NG infrastructure is built to support

(1) data sharing between groups of collaborators independent of location and format,

(2)execution of algorithms in a pipeline fashion and

(3) the integration of a constantly expanding collection of methods of data analysis.

Expression Profiler Components (EPCs)--

Data Upload - Users can either upload their own data or select a published dataset from the ArrayExpress database.

To retrieve data from ArrayExpress, one can explore the database via its online interface and subsequently export an expression matrix by selecting the desired samples and measurements (Assays and Quantitation Types) and then send this to EP:NG.

The Data Upload component can accept data in a number of formats including basic delimited files such as those exported by Microsoft Excel.

Data Selection - Data Selection presents a brief statistical overview of the data (data distribution histogram, mean, and standard deviation) and allows the user to select genes and conditions that have particular expression values.

Data Transformation - Either before or after sub-selecting data, various transformation procedures can be applied.

These include K-nearest neighbor imputation to fill in missing values (this can be a lengthy procedure especially on large datasets), LOWESS normalization (an integrated third-party component) and conversion of absolute intensity values from a two-channel experiment to log ratios.

From here one can proceed to further sub-selections or apply a suitable analysis component.

Similarity Search - The Similarity Search provides a means of selecting groups of genes related to given ones. The user specifies one or several genes, chooses a similarity measure and receives those genes most closely co-expressed with the selected genes within the dataset.

A wide variety of distance measures is available, including the Euclidean metric, Pearson correlation, Manhattan distance, Spearman's ranking and chord distance.

Hierarchical and K-groups Clustering - EP:NG provides both hierarchical and partitioning-based clustering methods. As stated above, there are many distance measures available. The clustering algorithms are implemented in C, and results are visualized as publication-quality vector-based SVGs or in a raster format, PNG.

Data can be hierarchically clustered on both experiments and conditions simultaneously and one can interactively zoom in on interesting sub-trees.

There are two (2) K-groups clustering algorithms in EP:NG: K-means and K-medoids. The latter is a variant of another well-known approach to partitioning the data into a specified number of clusters.

It differs in that it uses existing objects from the dataset as cluster centers in its calculations. This allows the use of a 'distance matrix' derived from any measure; hence, this component can be applied to more diverse data types, such as sequences.

Clustering Comparison - A problem that arises naturally with all partitioning clustering methods is to find the appropriate K (number of clusters). The Clustering Comparison component implements an algorithm that takes two K-groups clustering results and matches the clusters by membership.

By examining the output the user can evaluate the optimal (according to some criteria) number of clusters in the dataset.

Signature Algorithm - The Signature Algorithm is an R implementation. It identifies a co-expressed subset in a user-submitted set of genes, removes unrelated genes from the input and identifies additional genes in the same dataset that follow a similar pattern of expression.

Co-expression is identified with respect to a subset of conditions, which is also provided as the output of the algorithm. It is a fast algorithm useful for exploring the modular structure of expression data matrices.

Between Group Analysis and Ordination - Standard multivariate analysis methods, such as principal component analysis (PCA) and correspondence analysis (COA), are provided in the Ordination component. These methods are frequently used to search for underlying structures in datasets.

Between Group Analysis (BGA) presents a multiple discriminant approach that can be used with expression data matrices of any dimensionality. BGA is carried out by ordinating groups (sets of grouped microarray samples) and then projecting the individual sample locations on the resulting axes.

It is used in the framework of conventional ordination techniques such as PCA or COA and, as such, allows for great flexibility with regard to the assumptions that one makes in carrying out the analysis.

When combined with COA it is especially powerful as it allows one to examine in detail the correspondences between the grouped samples and those genes which most facilitate the discrimination of these groupings.

This is a semi-supervised method, so it is important to test its results using a test dataset, or using re-sampling accuracy analysis methods. The Ordination and BGA EPCs are implemented using the R multivariate data analysis package ADE-4.

Pipelines and workflows - It is important Not only to be able to find and execute analytical procedures, but to do this in a logical, user-defined sequence of steps. This naturally leads to the concept of analysis pipelines.

EP:NG was designed to enable the user to combine components into sequences. There are two (2) major ways to do this.

First, via the interface: each component provides annotated links to other EPCs that logically follow (e.g. Data Selection links to Data Transformation and to Hierarchical Clustering, etc.).

Second, programmatically: a program can send a simple XML query to EP:NG indicating which components to launch and in what sequence. The final result will be shown to the user.

Whenever a user or a program runs an EP:NG component, the parameters and the sequence of steps taken to obtain the results are stored in the internal database.

This allows one to define an analysis workflow, a process that can later be applied repeatedly with the same parameters to other datasets.

System Requirements

Web based.


Manufacturer Web Site Expression Profiler: Next Generation

Price Free

G6G Abstract Number 20271

G6G Manufacturer Number 100859