GeneSniffer

Category Genomics>Genetic Data Analysis/Tools

Abstract GeneSniffer is a computer program that was specifically developed to assist with the prioritization of candidate disease susceptibility genes within defined genomic intervals.

For each gene within a given genomic interval, GeneSniffer downloads appropriate web-pages from the NCBI’s Gene, OMIM and PubMed databases and interrogates the text using a list of disease-specific keywords provided by the investigator (assigns a score between 1 and 10 depending on significance).

Homologues of each gene are identified by the Basic Local Alignment Search Tool (BLAST), and these are scored and weighted according to the degree of homology.

A cumulative hit-score is calculated for each gene from the database hits and the weighted homologue database hits, and the output is presented as a web-page (in HTML format) to provide the investigator with information as to the source of the hits and links to relevant external web-pages.

The method can also employ the observed LOD score (A statistic giving the level of confidence in an estimate of linkage distance between two loci) function in the region of the quantitative trait loci (QTL) to use the localization data as a weighting function, in which genes would be considered more relevant the closer they are to the observed QTL peak.

GeneSniffer features/capabilities include:

1) GeneSniffer is written in Python, a free programming language similar to Perl.

2) GeneSniffer saves a significant amount of time compared to carrying out these analyses 'manually'.

3) GeneSniffer can be rerun at any time to incorporate database updates.

4) GeneSniffer removes human error from manual research techniques, but Not database errors!

The outputs need to be checked for false positive results.

5) GeneSniffer scores can be used to prioritize 'candidate disease genes'.

6) GeneSniffer provides an informed starting point before starting expensive and time consuming SNP analyses.

7) GeneSniffer complements other gene identification techniques.

Note: GeneSniffer is Not publicly available as a stand alone program or open source code. If you wish to use GeneSniffer please contact the manufacturer.

How GeneSniffer works --

1) List of genes - First GeneSniffer collects the list of genes between two (2) molecular coordinates corresponding to a linkage peak or other region of interest. This is taken from NCBI's map-viewer.

2) Collecting information from NCBI's Gene database - For each gene in the region the Gene page from NCBI is collected and scanned for each of the keywords. If the words or phrases are present, their weightings are added to make a cumulative hit-score.

3) Collecting information from OMIM and displaying results - For each gene the process is repeated for OMIM, if the information is available.

4) Searching and scanning PubMed abstracts - PubMed abstracts relevant to each gene are identified. Curated abstracts are collected from Gene and OMIM, and then an additional search of PubMed is carried out using alternative symbols and specific words from the gene description.

5) Viewing PubMed search results - To view the details of PubMed results, click on the gene symbol and go to the gene page.

The PubMed abstracts containing keywords are listed together with the keywords. Clicking on the reference will open the PubMed abstract.

6) Identifying and screening mouse orthologs - If curated, the mouse ortholog of each gene is identified through HomoloGene.

Pages of information are collected from the Mouse Genome Informatics database and screened for the keywords. Scores are given in the gene- list and results can be viewed through the gene page.

7) Identifying conserved domains - The protein sequence from each gene is (BLAST)ed against the Conserved Domain (CD) database.

The number of conserved domains is given on the gene-list page and details can be seen on the gene page. Conserved domains can be used to infer function.

8) Expression in a disease-relevant tissue - UniGene's expression profiler is the current source of expression data. Expression within each of 32 tissues is estimated by comparing the number of Expressed Sequence Tags (ESTs) available from tissue specific libraries.

One tissue of the 32 types available is selected as the most relevant tissue to the disease.

Using the Expression profiler, the program assesses whether the gene is expressed in this tissue of interest and then compares the level of expression with the other tissues.

The result is given as a number in the gene-list. A number higher than one indicates the gene is expressed at relatively higher levels in the tissue of interest, lower than one indicates that the gene was expressed at relatively lower levels.

NE indicates the gene is Not expressed in the tissue of interest and ND means that No data was available.

Clicking on the gene symbol shows details of the results in the gene page.

9) Druggability - The NCBI Gene page is screened with a list of terms which if found, infer that the gene or gene product might make a good drug target.

The incidence of these terms is used to compile a "druggability" score and the details of hit terms can be seen on the gene page.

10) Identification of homologs and gene family members - The protein sequence from each gene is used to BLAST the human non-redundant protein database to identify similar proteins. This can then be used to infer function.

The degree of homology is scored using % identity, bit score and the relative lengths of the query and matching proteins.

The number of proteins reaching the threshold to be identified as homologs is given in the gene-list and details can be viewed in the gene page.

11) Screening homolog Information - Each homolog that is identified is then subject to a similar analysis: the Gene, OMIM and PubMed abstracts are screened for the content of keywords.

In the same fashion as for the gene, hit scores are totaled and weighted.

These scores are then multiplied by the homology weighting to give a final score for each homolog given in the first column of the homolog table. Homolog scores from each of the databases is totaled and given in the gene-list.

12) Totaling the hit-scores - Finally, once the analyses have been completed, the scores are totaled. The scores derived from screening the each of the genes is totaled to give a gene total hit-score and the scores derived from the homologs of each gene are totaled to give a homolog total hit-score.

These are then both added to give a grand total score that can be used for 'ranking disease candidacy'.

13) Final results - The final results are seen as a list of gene(s) with the corresponding scores.

14) Presenting results in a ‘Genome Browser’ - The results of GeneSniffer can also be output in General Feature Format (GFF) so that they can be displayed as interactive results.

System Requirements

Contact manufacturer.

Manufacturer

Manufacturer Web Site GeneSniffer

Price Contact manufacturer.

G6G Abstract Number 20430

G6G Manufacturer Number 104058