AutoClass@IJM

Category Genomics>Gene Expression Analysis/Profiling/Tools and Intelligent Software>Bayesian Network Systems/Tools

Abstract AutoClass@IJM is an advanced computational resource with a web interface to AutoClass, an unsupervised Bayesian classification system developed by the Ames Research Center at the National Aeronautics and Space Administration (NASA).

AutoClass has many advanced features with broad application in biological sciences:

1) It determines the number of classes automatically;

2) It allows the user to mix discrete and real valued data; and

3) It handles missing values.

End users upload their data sets through the manufacturer's web interface; computations are then queued in the manufacturer's cluster server.

When the clustering is completed, a URL to the results is sent back to the end-user via e-mail.

AutoClass is a general purpose clustering algorithm.

AutoClass is also an unsupervised Bayesian classification system based upon the finite mixture model supplemented by a Bayesian method and an Expectation-Maximization algorithm for determining the optimal classes.

AutoClass uses a maximum likelihood to find the class description that best predicts the data.

AutoClass takes a database of cases described by a combination of real and discrete valued attributes, and automatically finds the natural classes in that data.

It does Not need to be told how many classes are present or what they look like -- it extracts this information from the data itself.

The classes are described probabilistically, so that an object can have partial membership in the different classes, and the class definitions can overlap.

AutoClass generates reports on the classes it has found at the end of its search.

AutoClass has been used and tested on many data sets, both within NASA and by industry, academia and other agencies. These applications typically find surprising classifications that show patterns in the data unknown to the user.

Examples include: discovery of new classes of infra-red stars in the IRAS Low Resolution Spectral catalogue, new classes of airports in a database of all USA airports, discovery of classes of proteins, introns and other patterns in DNA/protein sequence data, and others.

AutoClass@IJM Web Application Description --

Using AutoClass@IJM requires two (2) steps:

1) preparing the data and 2) submitting data files (with optional modifications of default clustering parameters).

A URL to the results is sent back to the user via e-mail. The return time can vary from minutes/hours to days depending on the size of data set and the cluster load.

AutoClass@IJM Preparing Data Files --

AutoClass can handle three (3) different types of data:

1) Singly bounded real numbers (Real Scalar as named by AutoClass), such as length, weight, etc;

2) Real numbers distributed on the two sides of an origin (Real location), such as Cartesian coordinates (in this case, the origin is 0.0), microarray log ratio, elevation (where sea level is the origin), etc; and

3) Discrete data: any qualitative data, such as chromosome number, phenotype, eye color, etc.

For each type of data, the web interface provides a specific input field.

AutoClass@IJM Submitting Data Files --

The user must provide an e-mail address and upload the data files. AutoClass uses several parameters: the manufacturer provides an optimized default set.

The default parameters are AutoClass defaults except for the ‘max_n_cycles’ parameters (the maximum number of cycles).

AutoClass chooses the best among 100 classifications. Each classification is performed as a recursive process: a classification stops if the convergence criteria are met or if the maximum number of cycles is reached.

Gene expression data is especially difficult to cluster because they are very noisy.

Therefore, the AutoClass default maximum number of cycles (200) is reached too often according to the manufacturer's experience.

Thus, the manufacturer decided to set this parameter to 1,000, in order for most classifications to converge before the maximum number of cycles is reached.

However, the user may change the ‘error’ parameter. This ‘error’ is relative (i.e. the ratio of the error to the value) for real scalars, and it is a constant for real location values.

Each analysis is submitted as a single job. Once submitted, the job(s) are queued, and when the job starts running, the first e-mail is sent to the end-user.

AutoClass@IJM Output Data --

AutoClass computes classes using all the inputted data. After completion of the job, a single zipped (file) archive is generated containing:

1) A tab-delimited file that associates each ID with the index of its class;

2) Two (2) CDT files [these results files can be read by JavaTreeview- like software (see below...) (if the input data is exclusively numerical, otherwise you can use your favorite spreadsheet to read the file]:

One file contains the experimental data and the probabilities for each item to belong to different classes; the second file contains only the experimental data (to help the visual identification of classes, blank lines are introduced between classes in the CDT files);

3) A log file from AutoClass.

A second e-mail which contains a URL to the zipped archive is then sent to the end-user to upload their results.

Java Treeview - extensible visualization of microarray data -

Java Treeview is an open-source, cross-platform software product that handles very large datasets well, and supports extensions to the file format that allow the results of additional analysis to be visualized and compared.

System Requirements

Web-based.

Manufacturer

Manufacturer Web Site AutoClass@IJM

Price Contact manufacturer.

G6G Abstract Number 20500

G6G Manufacturer Number 104120