geneid

Category Genomics>Genetic Data Analysis/Tools and Cross-Omics>Sequence Analysis/Tools

Abstract geneid is a software program that can be used to predict genes along a DNA sequence in a large set of organisms.

While its accuracy compares favorably to that of other existing tools, geneid is more efficient in terms of speed and memory usage and it offers some rudimentary support to integrate predictions from multiple sources.

It predicts genes in anonymous genomic sequences designed with a hierarchical structure.

In the first step, splice sites, start and stop codons are predicted and scored along the sequence using Position Weight Arrays (PWAs). In the second step, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites, plus the log-likelihood ratio of a Markov Model for coding DNA.

Finally, from the set of predicted exons, the gene structure is assembled, maximizing the sum of the scores of the assembled exons.

geneid offers support to integrate predictions from multiple sources via external general feature format (gff) files and the redefinition of the ‘general gene structure’ or model is also feasible.

The accuracy of geneid compares favorably to that of other existing tools, but geneid is likely more efficient in terms of speed and memory usage. Currently, geneid v1.2 analyzes the whole human genome in 3 hours (approx. 1 Gbp / hour).

Its main features/capabilities include:

1) geneid accuracy compares to that of other existing "ab initio" gene prediction tools.

2) geneid is very efficient in terms of speed and memory usage. In practice, geneid can analyze ‘chromosome size sequences’ at a rate of about 1 Gbp per hour on the Intel(R) Xeon CPU 2.80 GHz.

For the largest human chromosome (chr1), it requires 1/2 Gbyte of RAM plus the size of the Fasta sequence.

3) geneid offers support to integrate predictions from multiple sources (ESTs, blast HSPs) and to re-annotate genomic sequences, via external gff files, together with the redefinition of the "gene model".

4) geneid output can be customized to different levels of detail, including an exhaustive listing of potential signals and exons.

Furthermore, several output formats such as gff or XML are available.

5) There are parameter files available in geneid v 1.2 for Drosophila Melanogaster, human (which can be also used for vertebrate genomes), Dictyostelium discoideum and Tetraodon nigroviridis (which can be used for Fugu rubripes) among many others for species, spanning the four (4) "classical" kingdoms.

The currently available additional parameter files can be found under the "geneid parameter files" section on the manufacturer's web-site.

Training geneid --

In order to build a parameter file for geneid it is necessary to "train" the program and parameter configurations that exist for a number of eukaryotic species.

Training basically consists of computing position weight matrices (PWMs) or Markov models for the splice sites and start codong and deriving a model for coding DNA (generally a Markov model of the order 4 or 5).

The basic requirements for a training set are an annotation file (preferably in geneid gff format and a set of Fasta sequences corresponding to the gene models in the annotation file).

Generally as few as 100 gene models could be enough to build a reasonably accurate geneid parameter file, but more generally a user would want to have as many sequences as possible (> 500) to build an optimally accurate matrix and also to be able to set aside some of the gene models for testing purposes.

If a user wants to evaluate the accuracy of the newly developed parameter file they will also require an annotation file and Fasta files corresponding to the sequences in the evaluation set.

However if a user only has a limited number of gene models to train geneid with (generally less than 500 sequences) they can use a "leave-one-out strategy" for evaluating the accuracy (more information is available in the training tutorial, provided by the manufacturer).

The user can go through an example of a typical geneid "training" protocol (Training geneid for the parasite Perkinsus marinus) by following the training tutorial.

Gene Predictions on Whole Genomes page --

The extensive predictions available on this page have been obtained by using the gene-finding software geneid and SGP2.

SGP2 (an additional product from this manufacturer) combines geneid predictions with tblastx comparison of a query genome from one species (i.e. human) against an informant genome of another species (i.e. from mouse).

Gene predictions on genomes --

The above (Gene Predictions on Whole Genomes page…) contains the set of predicted genes using geneid on the recently sequenced genomes (Drosophila melanogaster, Homo sapiens, Mus musculus, Fugu rubripes or Dictyostelium discoideum) for some of their most common releases.

geneid web server --

A geneid web server is available to submit sequences over the Internet. There is No limit to the length of the submitted sequence, other than the imposed by the Internet (except when plotting is required).

System Requirements

Contact manufacturer.

Manufacturer

Manufacturer Web Site geneid

Price Contact manufacturer.

G6G Abstract Number 20389

G6G Manufacturer Number 104026