OpenBiomind

Category Genomics>Gene Expression Analysis/Profiling/Tools and Genomics>Genetic Data Analysis/Tools

Abstract OpenBiomind is a toolkit for the analysis of ‘gene expression’, SNP and other ‘biological datasets’ using advanced ‘machine learning’ and ‘pattern mining’ techniques, and includes traditional clustering, hybrid clustering, and other techniques.

OpenBiomind is both modular and command-line driven. Components communicate via standardized file formats.

OpenBiomind commands (see below...) cover:

1) Dataset enhancement, with information extracted from gene and protein ontologies, as well as other dataset treatments/transformations (such as ‘feature selection’ and the creation of validation folds).

2) Multiple classification model generation, for a given dataset using several modalities of Genetic Programming (GP) - (The option for using MOSES will be integrated soon).

MOSES - Meta-optimizing semantic evolutionary search (MOSES) is a new approach to ‘program evolution’, based on representation-building and probabilistic modeling. MOSES has been successfully applied to solve hard problems in domains such as ‘computational biology’, sentiment evaluation, and agent control.

Results tend to be more accurate, and require less objective function evaluations, in comparison to other program evolution systems. Best of all, the result of running MOSES is Not a large nested structure or numerical vector, but a compact and ‘comprehensible program’ written in a simple Lisp-like mini-language.

3) Important Feature computation.

4) Clustering using direct expression or MOBRA/MUTIC transformations.

5) Cluster visualization (conventional, raster, and color-coded).

6) Graph visualization showing various inter-relationships among features: co-expression, co-occurrence and utility and differentiation ranks.

7) Multiple “pipelines” (workflows) - sequences of chained commands where the output of one or more commands is used as the input for a different command, including a “complete pipeline” command were all possible chains are explored.

OpenBiomind commands --

1) EnhanceDataset: adds “synthetic” or “enhanced” features to datasets based on the match of the genes in a dataset to existing gene and protein ontologies. In other words, it is a way to enrich the dataset using systematized ‘domain knowledge’.

This command is in fact a kind of relatively sophisticated ‘dataset transformation’, based on ontological data known about genes. For the purposes of OpenBiomind, ontology is a set of ‘gene categories’ defined by some aspect of biological relevance.

As examples, GO (Gene Ontology) is a set of functional gene categories, while PIR (Protein Information Resource) is a set of gene categories defined by protein types.

2) DatasetTransformer: used for dividing a given dataset into ‘validation folds’ and the selection of features more relevant to the classification problem at hand.

This command “transforms” a given dataset in a package of train-test cross-validation folds. In this way, it may also apply feature selection on the dataset. (Division in folds and feature selection is put in the same command due to logical dependency.

Feature selection is done in a fair way only if features are selected over the training dataset and then applied over the testing dataset.) It can also be used to apply feature selection over a pre-defined pair of train-test datasets.

3) MetaTask: executes a large number of ‘non-deterministic classification’ experiments (or tasks) over the same dataset.

4) UtilityComputer: computes the utility of genes (as well as enhanced features) based on their frequency across a large number of classification models (generated by a MetaTask).

5) ClusteringTransformer: generates clustering datasets using inputs from either categorical datasets or large numbers of classification models (again, those generated by a MetaTask).

6) Clusterize: clustering using Omniclust, OpenBiomind's simple clustering algorithm.

The Clusterize command operates on clustering data (like the data produced by MUTIC and MOBRA transformations), grouping features in that data accordingly to their similarity. Currently, the only clustering algorithm effectively implemented is Omniclust, and therefore there is No parameterization yet for the clustering method to be used.

7) ViewClusters: ‘clustering visualization’ in classic raster image view. The ViewClusters command produces a raster, graphic image showing a given clustering result.

8) GraphFeatures: portrays relations (of co-expression, co-occurrence in models) among genes (and enhanced features) in graph visualization, also displaying several other kinds of related information.

This command combines results generated by several other OpenBiomind processes - mainly clustering-related dataset transformations and computation of useful features - in order to generate a graph that is a visualization of some relationship(s) between the most useful features found in a MetaTask.

9) SimpleGraph: similar to GraphFeatures, but with far less parameters and it uses heuristics that guarantee a clear-looking, fully connected graph.

10) CompletePipeline: runs all the above processes over a categorical dataset, in a coordinated way, channeling outputs of some commands as inputs to others, etc.

CompletePipeline performs the “usual” OpenBiomind operations on a given dataset (or pair of training and testing datasets), encapsulating in a pipeline (workflow) the inputs and outputs of multiple commands.

11) FoldSelectSNPs: converts a SNP dataset into a float-coded version divided in folds and feature-selected - that is, ready for MetaTasks.

FoldSelectSNPs is much like a version of DatasetTransformer (see above...) specifically adapted to deal with SNP datasets.

SNPs, or Single-Nucleotide Polymorphisms, basically represent allelic information focused on the level of individual bases instead of whole genes.

A SNP value will thus be typically represented by pairs of symbols representing possible allelic pairings, for instance AA, AB and BB. Therefore, SNP data is discrete, in opposition to ‘gene expression’ data, which is represented by continuous numeric values.

12) SNPUtilityComputer: similar to UtilityComputer (see above...) and operating under the same principles, but dealing with ‘conceptual particularities’ of SNP experiments.

While UtilityComputer accounts for the frequency of usage of genes across ‘classification models’ learned from gene expression data, SNPUtilityComputer does the same for models learned from SNP data.

One may wonder why a different command is necessary for dealing with SNP features when the basic principle for gauging ‘feature utility’ (frequency across models) is the same.

The answer for that lies in Biology: ‘genes’ associated to important SNPs are also very important (perhaps even ‘more’ important than its individual important SNPs, depending on the focus of the research project at hand).

Therefore, SNPUtilityComputer also computes the utility of ‘genes’ in a given MetaTask output generated from SNP data. In the case of a gene g, its “SNP utility” is computed as the percent of models containing SNPs belonging to g.

System Requirements

OpenBiomind is developed in Java and designed for portability and simplicity. OpenBiomind file formats (for data and for results) are human- readable plain-text and can be manipulated with other standard tools such as the GNU coreutils, sed, awk, Perl, Python, etc. and Contact manufacturer.

Manufacturer

Manufacturer Web Site OpenBiomind

Price GNU General Public License, version 2.

G6G Abstract Number 20558

G6G Manufacturer Number 100427