Category Cross-Omics>Data/Text Mining Systems/Tools

Abstract FACTA (Finding Associated Concepts with Text Analysis) is a text search engine for MEDLINE abstracts, which is designed particularly to help users browse biomedical concepts (e.g. genes/proteins, diseases, enzymes and chemical compounds) appearing in the documents retrieved by the query.

The concepts are presented to the user in a tabular format and ranked based on the co-occurrence statistics.

Unlike existing systems that provide similar functionality, FACTA pre- indexes Not only the words but also the concepts mentioned in the documents, which enables the user to issue a flexible query (e.g. free keywords or Boolean combinations of keywords/concepts) and receive the results immediately even when the number of the documents that match the query is very large.

The user can also view snippets from MEDLINE to get textual evidence of associations between the query terms and the concepts.

The concept IDs and their names/synonyms for building the indexes were collected from several biomedical databases and thesauri such as UniProt, BioThesaurus, UMLS, KEGG, and DrugBank. FACTA features/capabilities include:

FACTA receives a query from the user as the input.

A query can be a word (e.g. “p53”), a concept ID (e.g. "UNIPROT: P04637”), or a combination of these (e.g. “(UNIPROT:P04637 AND (lung OR gastric))”).

The system then retrieves all the documents that match the query from MEDLINE using word/concept indexes.

The concepts contained in the documents are then counted and ranked according to their relevance to the query. The results are presented to the user in a tabular format.

The relevant concepts of six (6) categories are displayed in a table and ranked by their frequencies.

The document icon next to each concept name in the table allows the user to view snippets from MEDLINE and see textual evidence of the association.

The user can also invoke another search by clicking a concept name in the table. This allows the user to explore associations between many different concepts in a highly interactive manner.

Indexing -- FACTA’s real-time responses to the queries are made possible by the use of its own indexing scheme and implementation of the analysis engines in C++.

It uses two (2) indexes built offline - one for the words and the other for the concepts.

Both indexes are stored in memory to achieve quick responses while the actual sentences of MEDLINE abstracts are stored on external storage.

Currently, FACTA covers six (6) categories of biomedical concepts: human genes/proteins, diseases, symptoms, drugs, enzymes, and chemical compounds.

The concepts appearing in the documents are recognized by dictionary matching.

In total, 80,260 unique concepts are indexed. The manufacturer used UniProt accession numbers as the concept IDs for genes/proteins and collected their names and synonyms from BioThesaurus (Liu et al., 2006).

The manufacturer also used UMLS (Humphreys and Lindberg, 1989) for diseases and symptoms.

The concept IDs and names for drugs, enzymes, and chemical compounds were collected from several databases including HMDB, KEGG, and DrugBank.

Ambiguity causes problems in indexing. For example, the term “collapse” is Not necessarily used as a symptom name in the documents that produce the results, so ideally such occurrences should be disambiguated using the context and excluded from the counting for the category.

There is also intra-category ambiguity, e.g. some protein synonyms can be mapped to multiple gene/protein IDs. These problems are currently Not addressed in FACTA.

Ranking -- Since the number of the concepts contained in the documents is usually very large, it is important that the concepts are properly ranked when presented to the user.

Although frequencies are normally a good indicator of the relevance of a concept, they tend to overestimate the importance of common concepts.

FACTA can also rank the concepts by using point wise mutual information.

Point wise mutual information gives an indication of how much more the query and concept co-occur than we expect by chance.

System Requirements

Web based.


Manufacturer Web Site FACTA

Price Contact manufacturer.

G6G Abstract Number 20257

G6G Manufacturer Number 102854