Category Cross-Omics>Data/Text Mining Systems/Tools and Cross-Omics>Knowledge Bases/Databases/Tools

Abstract BioGraph is a data integration and data mining platform for the exploration and discovery of biomedical information. This platform offers prioritizations of putative disease genes, supported by functional hypotheses.

The manufacturers have shown that BioGraph can retrospectively confirm recently discovered disease genes and identify potential susceptibility genes, outperforming existing technologies, without requiring prior domain knowledge.

Additionally, BioGraph allows for generic biomedical applications beyond gene discovery.

Motivation behind BioGraph --

For the computational identification of suitable targets among candidate genes in a biomedical context, the intelligence and intelligibility of the method are of vital importance for evaluating the prioritizations. Protein-protein interaction networks are often adopted, but are limited in functional expressivity.

The integration with multiple types of biomedical knowledge can enhance the quality of automatically generated functional hypotheses relating contexts, e.g., a disease, and target sets, and/or e.g., a set of candidate genes.

What BioGraph provides --

BioGraph provides an online resource and data mining method for the automated inference of functional hypotheses between biomedical entities. Assessment of these hypotheses can consequently be used for the ranking of targets, in the context of a research domain, such as a disease.

BioGraph’s resource is a knowledge base that integrates many biomedical databases into a 'common network' of heterogeneous relations. These databases are selected based on their practices of manual curation by experts, guaranteeing that the integrated knowledge is accurate and valid.

The manufacturer's methodology generates a “map of relations” linking biomedical research subjects to potential targets, such as diseases, genes, ontology annotations, pathways, etc. and offers literature support for these putative functional hypotheses.

Assessment of these hypotheses’ plausibility and specificity to source and targets allows for various applications in the identification of promising research targets.

BioGraph’s Integration of heterogeneous knowledge sources --

BioGraph is based on the data integration of 21 publicly available curated databases containing biomedical relations between heterogeneous biomedical entities such as: genes, diseases, compounds, pathways, ontology terms, protein domains, disease and gene families, and microRNAs.

The integrated databases were selected based on their quality of relations with respect to the curation methods and peer-reviewed references to the literature.

The manufacturers did Not integrate databases constructed from high-throughput experiments with statistical or computational inferences where No manual curation of the indexed relations was performed.

The integrated databases in BioGraph consist of three (3) types:

1) Curated databases (e.g., OMIM and various protein-protein interaction databases) constructed by manual extraction of published, peer reviewed information about a specific type of information, guaranteeing the quality of the relations in these databases.

2) Curated ontology databases (e.g., Gene Ontology and Medical Subject Headings) using hierarchical classifications of subjects.

3) Curated annotation databases (e.g., Gene Ontology Annotations and KEGG pathway database) that relate biomedical entities or concepts to ontology terms.

Relations between concepts are extracted from the knowledge resources, represented in a common format, annotated with semantic relation types (denoting the meaning of the relations, e.g., protein interaction or disease drug) and references to supporting literature, as provided by the integrated databases.

All relations in the network are equally weighed independent of their support in the databases or the literature.

To sanitize the resulting network for the subsequent data mining algorithms, disconnected concepts from the largest connected network are removed and dangling concepts (i.e., concepts connected to only 1 other concept) are pruned.

As a result, the integrated network comprises 54,567 biomedical entities representing unique biomedical concepts and 425,353 unique relations among these entities, supported by 244,258 references to 52,866 items from the biomedical literature.

The integrated network is frequently updated with updates of its dependent resources and the list of integrated databases may be appended with additional resources.

BioGraph’s Prioritization principle --

The manufacturers utilize stochastic random walks (trajectories on the network that consist of taking successive steps from one entity to a random related entity) on the knowledge network to measure the a priori importance or accessibility of concepts in a graph.

This technique determines the global centrality of concepts in the manufacturer's integrated network.

For this purpose, the manufacturer's compute the limit distribution that yields the probability of visiting the concepts when performing an infinite random walk on the integrated network.

Google’s PageRank algorithm adopts a similar link analysis algorithm to rank web pages by their relative importance.

Network hubs (top ranked concepts with a high prior probability) are generic and unspecific target concepts in the network. These hubs indicate important concepts for diverse biomedical processes, but should be avoided when trying to find relevant and non-obvious links between seemingly unrelated concepts.

For computing the vicinity of targets to a source concept in similarity to the prior probabilities, the manufacturer's compute the limit distribution of a stochastic model of random walks with restarts in the source concept (with a probability 0.25 at each step).

As such, the manufacturer's compute the a posteriori accessibility of each concept from the source concept, measuring the probability of visiting each target concept from the source disease, pathway, etc.

Concepts are scored by their posterior probability, divided by the square root of their respective prior probabilities and ranked with respect to this resulting score.

In practice, for a gene prioritization query, a user of the web application provides a research subject (for example, a disease, but also a pathway, a GO annotation or a gene may represent a research subject) and a list of research targets (e.g., putative genes or compounds) that need to be ranked in relation to the research subject.

The manufacturer's algorithm then assesses and ranks the relations between the source concept and each of the target concepts.

Since any type of concept can be provided as the subject or target of a prioritization, the manufacturer's method does Not require prior domain knowledge from the user, i.e., there is No need to define a gene set of known disease causing genes for the identification of related genes, which results in a more reproducible and robust user experience.

BioGraph’s Automated generation of functional hypotheses --

The method of performing random walks to determine the accessibility of target concepts implicitly generates ensembles of indirect paths between source and target concepts, which may serve as functional hypotheses for highly ranking targets.

The manufacturer's can heuristically determine highly probable simple paths, i.e., paths that do Not contain cycles, of the random walk that starts in the source concept and ends in the target concept by adopting backtracking.

The backtracking heuristic incrementally builds partial candidate paths, starting from the target to the source, while abandoning least likely paths along the way, leading to valid and specific paths that offer incentives for further functional research.

The resulting set of paths is presented to the user as a network with putative hypotheses linking the source to the target. Each directed edge represents a supporting relation among intermediate concepts, with annotated semantic meanings and literature references, intelligibly supporting the relation for evaluation by the user.

In cases where the target is highly ranked, specific and relevant connections and concepts are included in the constructed hypotheses. If the functional hypotheses linking concepts is limited to visiting general hub concepts, this is usually a sign that the linked source and target concepts can be considered unrelated, reflected by a bad ranking score.

BioGraph’s Application Programming Interface (API) --

BioGraph provides RESTful XML web services to integrate their discovery services into your own software.

Note: The API is currently still under development and is subject to changes.

System Requirements

Contact manufacturer.


Manufacturer Web Site BioGraph

Price Contact manufacturer.

G6G Abstract Number 20190A

G6G Manufacturer Number 104213