Biolearn

Category Intelligent Software>Bayesian Network Systems/Tools, Cross-Omics>Pathway Analysis/Gene Regulatory Networks/Tools and Genomics>Gene Expression Analysis/Profiling/Tools

Abstract Biolearn is a general package for applying ‘probabilistic graphical models’ to biological applications.

Biolearn release 1.0 is concentrated on ‘structure learning’ for Bayesian networks; future releases are expected to include other types of graphical models, support inference applications, and allow plug-and-play addition of new types of probability distributions, scoring functions and search algorithms.

Structure learning for Bayesian networks definition/uses --

1) Bayesian networks are probabilistic models that represent statistical dependencies among variables of interest, such as biomolecules.

They can be used to perform structure learning, the elucidation of influence connections among variables, which can reveal the underlying structure of signaling pathways, etc.

2) High-dimensional data sets are abundant in a variety of scientific domains, including genetics, social networks, and astronomy.

To gain insight into the processes that generate this data, it is of interest to extract statistical dependencies between variables in the data. Based on these dependencies, it is possible to postulate a causal relationship.

For example, one can formulate hypotheses about ‘gene regulation’ by understanding statistical dependencies in ‘gene expression’ data. An effective method for extracting such causal relationships is to learn the structure of a Bayesian network from the data.

Within the space of all possible network structures, one searches for the structure that explains the data best. And, a ‘scoring function’ is then used to evaluate the quality of the network, given the data.

Invoking the Biolearn structure-learning application --

The Biolearn structure learning application reads observation data on the nodes of a Bayesian network, optionally performs discretization on the data, and then applies a ‘search algorithm’ for finding an optimal network structure.

The user can choose among several possible search algorithms and several possible Scoring functions (see below...). The application can be invoked to run just one structure-learning search on the data, or to run any number of structure-learning searches using random samples of the data and perform model averaging.

The results are output as text files, and optionally also visualized graphically. The distribution includes four (4) command files that can be used to invoke the Biolearn structure-learning application; two windows bat files and two UNIX sh files.

For each system there is one command file that invokes the application as a command-line utility, running non-interactively and providing its output only as a text output file; and one command file that invokes the application as a ‘graphic interactive’ application, providing its output both as a text output file and through graphic visualization.

A run of the Biolearn application is guided by a Specification file, a text file specifying the input data, the choice of algorithm, scoring function and discretization methods, and other user-controlled options. The command files accept between one (1) and three (3) command-line arguments:

1) The first, mandatory argument is the directory containing the five (5) jar files.

2) The second, optional argument is the name of the Specification file. If the second argument is omitted, the application by default searches for a file named ‘biolearn.spec.txt’ in the current directory.

The command-line version of the application fails if it can Not find the specification file; the interactive version, if it can Not find the specification file, opens a file chooser window to let the user find the specification file interactively.

3) The third, optional argument is for specifying the starting point of the Search.

The Biolearn Specification file --

Each line in the Specification file specifies the user's choice for one of the user-controlled options for the run. In a ‘non-interactive’ invocation, the input data and all options for the run must be specified in the specification file; in an ‘interactive run’, the input data and some of the options can be specified or changed interactively.

Scoring functions --

The Biolearn application provides a choice of three (3) scoring functions that can be used in the ‘structure-learning search’; each scoring function is associated with a different type of probability distribution on the variables.

1) BDe scoring function, using ‘conditional probability tables’ for the probability distributions on the variables.

2) Normal Gamma scoring function, using ‘regression trees’ for the probability distributions on the variables.

3) Mean Square Error as the scoring function, using ‘linear Gaussian’ probability distributions on the variables.

Edge and split penalties --

The user may specify a ‘score penalty’ on adding edges to the network.

Discretization --

The NormalGamma and MeanSquareError ‘scoring functions’ are suitable for dealing directly with continuous input data.

The BDe scoring function, however, requires discrete input data; if the BDe scoring function is used and the input data is continuous, it is automatically discretized.

Search algorithms --

The application provides a choice of several Search algorithms used in searching for the optimal structure.

1) Greedy Hill Climbing - this is likely to be the most useful algorithm for most applications. It is an enhanced version of the classic Greedy Hill Climbing algorithm, using random restarts to avoid getting stuck too easily in a local maximum, and using ‘Tabu search’ to search plateaus (i.e. regions of the search space in which a single step leaves the score unchanged).

2) Sparse Candidate - The Sparse Candidate algorithm is designed for ‘structure learning’ in very large networks. It is described in Learning Bayesian network structure from massive datasets: The “sparse candidate” algorithm, Friedman, Nachman and Pe'er, Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence 1999, 206-215.

As a rule of thumb, if the search involves more than 50 variables, use of the Sparse Candidate algorithm is likely to be necessary.

3) Exhaustive search - this algorithm is prohibitively expensive except in very ‘small networks’; but when the network is very small, it is guaranteed to find the optimal structure. It exhaustively checks all possible network structures, and outputs the one with the highest score.

Biolearn Documentation --

The manufacturer provides an extensive well documented User Manual in PDF format.

System Requirements

Biolearn is implemented in Java, and distributed as a jar. It is compiled in Java version 1.6.0, and therefore requires Java version 1.6.0 or later.

In addition to the Biolearn jar, the distribution also includes four (4) jars from open-source providers that are used for specific functions of Biolearn:

1) Jung-1.7.6.jar, provided by the JUNG Framework Development team.

2) Commons-collections-3.2.jar, provided by the Apache Commons project.

3) Colt.jar, provided by the Colt project.

(Note: The above three (3) jars are used for the graphical visualization of bayesian networks); and

4) Jama-1.0.2.jar, provided by NIST, and used for implementing the MeanSquareError scoring function.

Manufacturer

Manufacturer Web Site Biolearn

Price Contact manufacturer.

G6G Abstract Number 20602

G6G Manufacturer Number 104203