Category Cross-Omics>Data/Text Mining Systems/Tools

Abstract GeneTUKit is a document-level gene normalization software system for full-text articles.

This software employs both local context surrounding gene mentions and global context from the whole full-text document.

It can normalize genes of different species simultaneously. Given a target article, the software outputs a list of normalized genes, and each predicted gene is associated with a confidence score.

When participating in BioCreAtIvE III, the system obtained good results among 37 runs: the system was ranked first, fourth and seventh in terms of TAP-20, TAP-10 and TAP-5, respectively on the 507 full-text test articles.

GeneTUKit departs from previous systems --

GeneTUKit departs from previous systems in two (2) aspects:

1) First, it combines local and global contexts to normalize genes at the document-level.

The goal of this software is Not to normalize every “mention” correctly, but to suggest a list of normalized genes given a target document, to assist human annotators.

Most previous systems are normalizing genes at the mention-level and only local context surrounding a mention (e.g. the sentence where the mention was recognized) were employed.

However, due to the high ambiguity of gene names, it may be insufficient to use only local context: inter-sentential or document-level context can be helpful with this task.

2) Second, GeneTUKit is designed for simultaneously normalizing genes of many different species, for full-text articles.

It is Not limited to any specific organism, but rather deals with all species present in a gene database (such as, Entrez Gene, etc.).

GeneTUKit has four (4) main modules --

The first module is for gene mention recognition, the second one for gene ID candidate generation and the third one for gene ID disambiguation.

In the fourth module, the software generates confidence scores for each gene ID, where the confidence score indicates the strength of the association between a gene ID and the document.

1) First module - The manufacturers used three (3) methods for recognizing gene mentions in the first module.

The input text is processed by these methods separately, and the resulting mentions are maintained if a mention is recognized by at least two methods.

If two mentions are similar but have different boundaries, the overlapping part is taken, as the final mention.

2) Second module - The second module generates gene ID candidates for a recognized mention. In this module, an open-source indexing package, Lucene, was used to index all the genes in Entrez Gene.

Each mention was then queried and the top 50 gene IDs were returned as candidates.

The text of mentions and Entrez Gene entries were, respectively, processed by the following rules, sequentially:

3) Third module - The third module is for disambiguating gene IDs, which is accomplished by a ranking algorithm. The algorithm was trained on the 32 full-text articles provided by BioCreAtIvE III.

Each article has a list of tuples (gene mention, gene id and species); however, the annotations did Not give the positions where a gene mention was recognized.

The training samples were generated as follows: for each gene ID candidate, if the gene ID appears in the manual annotation list, the candidate is taken as positive, otherwise negative.

For each gene ID candidate and its corresponding mention, the manufacturers extract features from local and global contexts. Some local context features are as follows:

Whether at least one word indicating gene functions of a gene ID appears in the sentences from which the mention was recognized.

The words indicating gene functions are obtained from the corresponding gene symbols after removing common words (such as protein, gene etc.) and words containing capital letters or digits (e.g. VDR, p65).

The document-level, global context features are partly listed as follows:

In constructing these features, the manufacturers used dictionary-based matching to recognize species; as such a simple method can produce a fairly good performance.

For finding full/abbreviated name mappings, the manufacturers adopted a method from: (Schwartz A.S., Hearst M.A. Proceedings of the 8th Pacific Symposium on Biocomputing. Kauai, Hawaii: World Scientific Publishing Co. Pte. Ltd; 2003). A simple algorithm for identifying abbreviation definitions in biomedical text;

Once features were obtained, the manufacturers used a ranking algorithm ListNet to rank gene IDs for each mention and the top gene ID was maintained for further processing.

4) Fourth module - The fourth module generates a confidence score for each predicted gene ID to measure the association of the given gene ID and the document using a support vector machine (SVM) classifier.

The training examples were similarly constructed, as in the third module.

The features were constructed as follows:

System Requirements

Contact manufacturer.


Manufacturer Web Site GeneTUKit

Price Contact manufacturer.

G6G Abstract Number 20791

G6G Manufacturer Number 104364