Category Genomics>Genetic Data Analysis/Tools

Abstract Gentrepid is an advanced tool that can be used for candidate 'disease gene' prediction. Gentrepid utilizes methodology from the fields of structural bioinformatics and systems biology.

It uses a combined bioinformatics approach encompassing methods of domain comparison and protein pathway and interaction data analysis. The system combines two (2) methods for the automated prediction of disease genes within known disease intervals.

The first, Common Pathway Scanning (CPS), is based on the assumption that common phenotypes are generally associated with disruption in proteins that participate in the same complex or pathway.

Recently, it was shown that disease genes preferentially interact with other disease-causing genes and a previous study predicted that 10% of proteins interacting with a disease gene product are likely to participate in the same disease.

The manufacturer's second method, Common Module Profiling (CMP), is based on the principle that candidate genes may have similar functions to disease genes that have already been determined.

CMP is similar in concept to methods using functional annotations, but many human proteins lack annotation and, therefore, similarities would be missed when comparing keywords alone.

CMP uses a domain-based comparative sequence analysis to identify those proteins with potential functional similarity.

Domain-based sequence comparison searches have been shown to be more accurate than full-sequence searches as commonly applied in BLAST or PSI-BLAST database searches.

Unlike the keyword systems, CMP calculates a measure of domain- based similarity to known disease genes rather than making a binary comparison.

Both methods use two (2) sources of input for disease gene prediction. First, known disease genes are used to predict novel disease genes in chromosomal intervals associated with the same disease.

Second, without knowledge of the disease genes, candidate disease genes are predicted by comparing all the genes in the multiple intervals associated with the same disease to find relationships between proteins linking the intervals.

The proteins may be related via a common pathway or shared domains.

Gentrepid Annotation pipeline --

All biological data in Gentrepid was combined into a relational database. Human disease gene information was extracted from the OMIM database and lists of genes flanking the disease genes were obtained from EntrezGene.

Protein sequence data were taken from GenBank and complete protein domain annotation was performed on all protein sequences using Pfam Hidden Markov models.

Finally, all genes were mapped to the latest pathway and protein-protein interaction (PPI) data.

There are currently over 250 biological pathway and network resources available.

The manufacturer utilized data from BioCarta (see G6G Abstract Number 20264), KEGG and OPHID, the most comprehensive databases of their type.

BioCarta and KEGG are chiefly pathway databases with BioCarta specializing in signaling pathways and KEGG in metabolic pathways.

OPHID is a secondary PPI database containing literature-derived interaction data from BIND, MINT and HPRD, as well as data from recent high-throughput experimentation. OPHID also contains transferred interactions from orthologous proteins in model organisms.

CPS --

Potential disease genes were predicted by identifying all proteins within a disease interval that are part of a pathway, described in BioCarta and KEGG.

PPI data from OPHID was used to identify novel disease genes by finding the interaction partners of known disease genes in a disease interval.

Three levels of interactions were tested for potential disease genes, based on the shortest path length to a known disease gene.

When CPS is applied across multiple intervals, i.e. in the absence of known disease genes, all interaction partners and pathways associated with the genes in each interval are compared across intervals.

Disease genes are predicted by identifying common pathways or interaction partners shared by the intervals associated with a specific phenotype.

CMP --

CMP compares the Pfam domain content of each protein within a disease interval to identify putative disease genes. Different calculations are performed depending on whether CMP uses known disease genes or multiple intervals as input.

Known disease genes - When known disease genes are used as input, a protein (candidate) observed to have disease-like domains is assigned a score.

Scores are based on the similarity between the protein's domains and the domains in the known disease gene using SSEARCH bit scores. SSEARCH is an implementation of the Smith and Waterman local alignment algorithm. Scores are normalized by matching the equivalent region of the disease gene against itself on a domain by domain basis.

Multiple intervals - When CMP is used across multiple intervals, a census of all domains in every interval associated with the disease is taken.

Disease genes are predicted based on the similarity of their domain content to genes from other intervals associated with the phenotype. The domain combination is tested for over-representation in the intervals compared to the genome as a whole.

Note: Gentrepid accelerates the disease gene discovery process, significantly reducing the cost of expensive experimental studies.

Successful identification of the disease gene enables targeted research on how mutations in the gene contribute to disease and provides specific leads towards cures.

System Requirements

Contact manufacturer.


Manufacturer Web Site Gentrepid

Price Contact manufacturer.

G6G Abstract Number 20418

G6G Manufacturer Number 104047