## Multifactor Dimensionality Reduction (MDR) plus additional Tools

** Category** Genomics>Genetic Data Analysis/Tools and Genomics>Gene Expression Analysis/Profiling/Tools

** Abstract** Multifactor Dimensionality Reduction (MDR) is an open-source software package that encompasses the Multifactor Dimensionality Reduction (MDR) method.

Multifactor dimensionality reduction (MDR) (the method) was developed as a nonparametric and model-free data mining method for detecting, characterizing, and interpreting epistasis in the absence of significant main effects in genetic and epidemiologic studies of complex traits such as disease susceptibility.

Epistasis --

Epistasis or gene-gene interaction is a fundamental component of the genetic architecture of complex traits such as disease susceptibility. Epistasis has been recognized for many years and has been described essentially from two (2) different perspectives, biological and statistical.

Biological epistasis results from physical interactions among biomolecules in gene regulatory networks (GRNs) and biochemical pathways at the cellular level in an individual.

Statistical epistasis is the deviation from additivity in a linear mathematical model that describes the relationship between multi-locus genotypes and phenotype variation at the population level.

Epistasis, along with other phenomena such as locus heterogeneity, phenocopy, and gene-environment interaction are major sources of complexity in the mapping relationship between genotype and phenotype.

Multifactor dimensionality reduction (MDR) --

Multifactor dimensionality reduction (MDR) was developed as a nonparametric and model-free data mining method for detecting, characterizing, and interpreting epistasis in the absence of significant main effects in genetic and epidemiologic studies of complex traits such as disease susceptibility (as stated above...).

The goal of MDR is to change the representation of the data using a constructive induction algorithm to make non-additive interactions easier to detect using any classification method such as naïve Bayes or logistic regression.

This is accomplished by first labeling each genotype combination as high-risk or low-risk using some function of a discrete endpoint such as case-control status.

A new MDR variable with two (2) levels is constructed by pooling all high-risk genotype combinations into one group and all low-risk combinations into another group.

Traditionally, variables constructed using MDR have been evaluated with a probabilistic naïve Bayes classifier that is combined with 10-fold cross validation.

Ten-fold cross validation allows estimation of a testing accuracy of a model by leaving out 1/10 of the data as an independent test set. The model is developed on 9/10 of the data and then evaluated on the remaining test set.

This process is repeated for each 1/10 of the data, and the resulting prediction accuracies are averaged. Permutation testing has been used to statistically evaluate the results from MDR.

In this process, the endpoint labels are randomized thus creating a relationship between the variables and the endpoint under the null hypothesis of No association that can be used to determine what would be expected from MDR by chance.

For example, when using a 1,000-fold permutation test, one obtains a distribution of 1,000 testing accuracies that one can use to determine where in that distribution one can find the testing accuracy of the manufacturer’s model and assign a *p-value* accordingly.

The advantage of permutation testing is that it controls for false-positives due to multiple testing as long as the entire MDR model fitting process is repeated in each permuted dataset.

The disadvantage of this approach is that permutation testing is computationally expensive and often Not practical for large datasets such as those from genome-wide association studies (GWAS).

The MDR method combines attribute selection, attribute construction, and classification with cross-validation and permutation testing to provide a comprehensive and advanced approach to detecting nonlinear interactions.

Exploratory Visual Analysis (EVA) --

EVA is a database and Graphical User Interface (GUI) for the exploratory visual analysis of statistical results (Not raw data) from high-throughput genetic and genomic experiments.

How often have you been handed an Excel spreadsheet with >30,000 Affymetrix gene IDs and *p-values* from a statistical analysis and been left with the daunting challenge of extracting something biologically meaningful?

The EVA system allows you to database these results with knowledge about each gene from public databases such as Entrez Gene.

The GUI allows you to visually explore the *p-values* in the context of Gene Ontology (GO), biochemical pathway, protein domain, chromosomal location, or phenotype thus facilitating biological interpretation.

The prototype EVA database was programmed in Oracle while the prototype EVA GUI was programmed in Visual Basic.

An open-source version of EVA in Java is under development and is available upon request from the manufacturer.

Symbolic Modeler (SyMod) --

The SyMod software package provides open-source access to two (2) different methods.

The first method, Symbolic Discriminant Analysis (SDA), was developed by the manufacturer as a nonlinear alternative to Fisher’s Linear Discriminant Analysis (LDA).

The goal of SDA is to identify the optimal combination of attributes and mathematical functions for predicting a discrete endpoint.

Unlike LDA, SDA makes No assumptions about the functional form of the model.

Given a list of attributes (e.g. gene expression variables) and mathematical functions (e.g. +, -, *, /, log, sqrt, abs, AND, OR, <, >, etc.), SDA optimizes model discovery using any wrapper algorithm.

The manufacturers have used ‘genetic programming’ (see the Genetic Programming Systems Category for additional info…) as a wrapper for SDA although other stochastic search methods such as simulated annealing could be used.

The second method that will be included in SyMod is symbolic regression.

Symbolic regression is similar to SDA but is used for continuous endpoints.

The alpha version of SyMod is ready for public testing, contact the manufacturer for additional info.

Weka-CG --

Weka is an open-source data mining software package with a number of advanced machine learning methods such as decision trees, neural networks (NNs), and support vector machines (SVMs).

The manufacturers are distributing their own version of Weka with integrated tools for computational genetics (CG).

The first new tool added to Weka-CG is the Multifactor Dimensionality Reduction (MDR) method (see above...).

Here, MDR has been added to Weka-CG as a filter for constructive induction so that constructed attributes (i.e. SNP combinations) can be analyzed with any number of different methods included in Weka (e.g. logistic regression).

*System Requirements*

Contact manufacturer.

*Manufacturer*

- Computational Genetics Laboratory
- Department of Genetics
- Dartmouth Medical School
- Lebanon, NH, USA

** Manufacturer Web Site**
Multifactor Dimensionality Reduction (MDR) plus additional Tools

** Price** Contact manufacturer.

** G6G Abstract Number** 20766

** G6G Manufacturer Number** 104344