## NIEHS Software Tools

** Category** Genomics>Gene Expression Analysis/Profiling/Tools, Genomics>Genetic Data Analysis/Tools, Proteomics>Mass Spectrometry Analysis/Tools

** Abstract** NIEHS (National Institute of Environmental Health Sciences) Software Tools are a group of biological software tools that was written by scientists within the Institute and is available to the public at No cost.

1) ExP - Expression Predictor Featuring Simplified Fuzzy ARTMAP Supervised Classification and Prediction --

Expression Predictor (ExP) is a stand-alone, desktop application developed with the Java(TM) programming language for classifying and predicting samples based on gene expression data using a simplified fuzzy adaptive resonance theory map (SFAM) neural network (NN) architecture.

Leave-one-out cross validation of the samples at successive vigilance parameter settings permits the determination of the classifier setting that gives the highest accuracy of prediction.

Gene expression data can be imported into ExP using a standard format for analysis and the prediction results are exported in a tab-delimited format.

2) Extracting Patterns and Identifying co-expressed Genes (EPIG) --

EPIG is a method for Extracting microarray gene expression Patterns and Identifying co-expressed Genes.

Through evaluation of the correlations among profiles, the magnitude of variation in gene expression profiles, and profile signal-to-noise ratios, EPIG extracts a set of patterns representing co-expressed genes without a pre-defined seeding of the patterns.

3) fdrMotif - Identifying cis-elements by an EM Algorithm Coupled with False Discovery Rate (FDR) Control --

fdrMotif determines the number of binding sites in each sequence of a probability model by performing statistical tests. fdrMotif is iterative and alternates between updating the position weight matrix (PWM) and significance testing. It starts with an initial PWM and a set of sequences (e.g., from ChIP experiments).

It generates many sets of background (null) sequences under the input sequence probability model. At each model estimation step, fdrMotif determines the number of binding sites in each sequence by performing statistical tests.

The FDR in the original dataset is controlled by monitoring the proportion of background subsequences that are declared as binding sites.

The PWM is updated using an EM algorithm with two (2) iterative steps (the E and M steps) until convergence. In the E-step, fdrMotif normalizes the sum of the probabilities over all positions in a sequence to the number of binding sites found in the sequence.

4) GA/KNN - Variable selection and sample classification using a Genetic Algorithm and K-Nearest Neighbors (GA/KNN) method --

The GA/KNN software selects the most discriminative variables for sample classification.

It can be used for analysis of microarray gene expression data, proteomic data such as those from the SELDI-TOF, or other high-dimensional data.

5) GADEM - A Genetic Algorithm Guided Formation of Spaced Dyads Coupled with an EM Algorithm for Motif Discovery --

GADEM is an unbiased de novo motif discovery tool implementing an expectation-maximization (EM) algorithm. The manufacturers have created an updated version, v1.3.1 that has the following improvements and additions.

The manufacturers added a 'seeded' analysis in which a user-specified position weight matrix (PWM) is the starting PWM model. Seeded analyses are at least 10x faster and perhaps more accurate than the already scalable 'unseeded' analyses, and can identify short and less abundant motifs, and variants of dominant motifs.

The manufacturers created an approach for estimating the number of binding sites in the data, including non-uniform motif priors that take advantage of the high spatial resolution of ChIP-seq data.

GADEM can now produce a report that gives you each motif's fold enrichment for input data vs. background/random sequence data. These changes substantially enhance GADEM's functionality and efficiency for motif discovery in large-scale genomic data.

6) Genetic Algorithm Method for Optimizing a Position Weight Matrix (GAPWM) --

Position weight matrices (PWM) are simple models commonly used in motif finding algorithms to identify short functional elements, such as cis-regulatory motifs, on genes.

When few experimentally verified motifs are available, estimation of the PWM may be poor. The Genetic Algorithm Method for Optimizing a Position Weight Matrix (GAPWM) implements a simple method to improve a poorly estimated PWM using chromatin immunoprecipitation (ChIP) data.

7) Modk-Prototypes for Simultaneous Clustering of Gene Expression Data with Clinical Chemistry and Pathological Evaluations --

The Modk-prototypes algorithm, for clustering biological samples based on simultaneously considering microarray gene expression data and classes of known phenotypic variables such as clinical chemistry evaluations and histopathologic observations involves constructing an objective function with the sum of the squared Euclidean distances for numeric microarray and clinical chemistry data and simple matching for histopathology categorical values in order to measure dissimilarity of samples.

Separate weighting terms are used for microarray, clinical chemistry and histopathology measurements to control the influence of each data domain on the clustering of the samples.

The dynamic validity index for numeric data was modified with a category utility measure for determining the number of clusters in the data sets. A cluster's prototype, formed from the mean of the values for numeric features and the mode of the categorical values of all the samples in the group, is representative of the phenotype of the cluster members.

8) Phase-shifted Analysis of Gene Expression (PAGE) --

PAGE is a Java-based software product for the phase-shifted analysis of gene expression developed along the lines of the original q-Clustering algorithm to analyze gene expression from multiple biological conditions across dose and time series experiments.

Grouping of gene expression patterns is performed in q-intervals of the measurements using phase-shifts to find clusters of genes which share trends of expression profiles within the dataset. The PAGE method has three (3) phases:

Phase 1 - Gene expression pattern matrix transformation into -1, 0, 1 to indicate the direction of expression change from each biological condition at fixed time and dose points. All biological replicates are averaged if provided.

Phase 2 - Generate q-Clusters which have similar patterns of expression of over q-consecutive conditions.

Phase 3 - Assign a significance score for each bi-cluster in all q-Clusters and identify the inhibition patterns of each q-Cluster.

Window size and threshold parameters are used in PAGE to specify the q-interval consecutive points and the slope cut-off for defining the upward and downward trends respectively. The patterns and genes within the q-Clusters are visualized in trend plots and also compared to determine biological relevance from the gene annotations.

PAGE Application Features --

PAGE is an interactive tool that uses a line graph to dynamically illustrate the phase-shifted patterns of gene expressions based on the q-Cluster selected by the users. Each line shown on the line graph represents the trend of a bi-cluster whose score is equal to or below the maximum threshold value.

The line graph is capable of zooming and can be exported in JPG format. Also, the genes associated with the trends shown on line graph can be exported to a text file for other analysis. Furthermore, all phase-shifted patterns in each of the q-Clusters can be exported as a tab-delimited text file, which can be readily imported to other spreadsheet applications.

9) Principal Variance Component Analysis (PVCA) --

Principal Variance Component Analysis (PVCA) is a hybrid approach using principal components analysis (PCA) and variance component analysis as a methodology to determine and quantify sources of variability most prominent in microarray gene expression data.

Often times batch effects are present in microarray data due to any number of factors, including e.g. a poor experimental design or when the gene expression data is combined from different studies with limited standardization.

To estimate the variability of experimental effects including batch, a novel hybrid approach known as principal variance component analysis (PVCA) has been developed.

The approach leverages the strengths of two (2) very popular data analysis methods: first, principal component analysis (PCA) is used to efficiently reduce data dimension while maintaining the majority of the variability in the data, and then, the variance components analysis (VCA) fits a mixed linear model using factors of interest as random effects to estimate and partition the total variability.

The PVCA approach can be used as a screening tool to determine which sources of variability (biological, technical or other) are most prominent in a given microarray data set.

Using the eigenvalues associated with their corresponding eigenvectors as weights, associated variations of all factors are standardized and the magnitude of each source of variability (including each batch effect) is presented as a proportion of total variance.

Although PVCA is a generic approach for quantifying the corresponding proportion of variation of each effect, it can be a handy assessment for estimating batch effect before and after batch normalization.

10) SA-Modk-Prototypes -- SA-Modk-Prototypes for Simultaneous Clustering of Gene Expression Data with Clinical Chemistry and Pathological Evaluations using Simulated Annealing --

The SA-Modk-prototypes algorithm, for clustering biological samples based on simultaneously considering microarray gene expression data and classes of known phenotypic variables such as clinical chemistry evaluations and histopathologic observations involves constructing an objective function with the sum of the squared Euclidean distances for numeric microarray and clinical chemistry data and simple matching for histopathology categorical values in order to measure dissimilarity of samples.

Simulated annealing is used to avoid local minima in search of the global solution.

Separate weighting terms are used for microarray, clinical chemistry and histopathology measurements to control the influence of each data domain on the clustering of the samples.

The dynamic validity index for numeric data was modified with a category utility measure for determining the number of clusters in the data sets.

A cluster's prototype, formed from the mean of the values for numeric features and the mode of the categorical values of all the samples in the group, is representative of the phenotype of the cluster members.

11) Systematic variation normalization (SVN) --

Systematic variation normalization (SVN) is a procedure for removing systematic variation in microarray gene expression data.

Based on an analysis of how systematic variation contributes to variability in microarray data sets, the SVN procedure includes background subtraction determined from the distribution of pixel intensity values and log conversion, linear or non-linear regression, restoration or transformation, and multi-array normalization.

In the case of when a non-linear regression is required, an empirical polynomial approximation approach is used. Either the high terminated points or their averaged values in the distributions of the pixel intensity values observed in control channels may be used for rescaling multi-array datasets.

These pre-processing steps remove systematic variation in the data attributable to variability in microarray slides, assay-batches, the array process, or experimenters. Biologically meaningful comparisons of gene expression patterns between control and test channels or among multiple arrays are therefore unbiased using normalized datasets.

*System Requirements*

Contact manufacturer.

*Manufacturer*

- National Institute of Environmental Health Sciences (NIEHS)
- Research Triangle Park
- North Carolina USA 27709-2233
- Fax: (919) 541-4395

** Manufacturer Web Site**
NIEHS Software Tools

** Price** Contact manufacturer.

** G6G Abstract Number** 20788

** G6G Manufacturer Number** 104362