Category Cross-Omics>Data/Text Mining Systems/Tools

Abstract PolySearch is a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites.

This tool is designed specifically for extracting and analyzing text- derived relationships between human diseases, genes/proteins, mutations Single Nucleotide Polymorphisms (SNPs), drugs, metabolites, pathways, tissues, organs and sub-cellular localizations.

It also displays links and ranks text, as well as sequence data in multiple forms and formats.

A distinguishing feature of PolySearch over other biomedical text mining tools is the fact that it extracts and analyses Not only PubMed data, but also text data from multiple databases (DrugBank, SwissProt, HGMD, Entrez SNP, etc.).

This integration of current literature text and database ‘factoids’ allows PolySearch to extract and rank information that is Not easily found in databases alone or in journals alone.

PolySearch, as the name suggests, is a tool that supports multiple (‘poly’) types of ‘biomedical text’ searches from multiple (‘poly’) types of databases.

It is also designed to facilitate the search, retrieval and compilation of disease-associated human ‘poly’morphisms (SNPs).

PolySearch exploits recent advances in text mining along with the readily availability of diverse biomedical databases and biomedical thesauruses to permit a wide variety of complex or expansive text searches over many biomedical domains.

PolySearch consists of seven (7) basic components:

1) A web-based user interface for constructing queries;

2) A collection of internal and external biomedical databases;

3) A collection of biomedical synonyms (custom thesauruses and all entity lists);

4) A general text search engine for extracting data from heterogeneous databases;

5) A schema for selecting, ranking and integrating content;

6) A display tool for displaying and synopsizing results; and

7) A PCR primer-designing tool to facilitate SNP and mutation studies.

PolySearch's query interface was written in standard HTML and Perl. PolySearch has been tested on a variety of platforms is compatible with most common browsers (Firefox, Safari and Internet Explorer).

It uses a series of text boxes and pull-down menus to facilitate query construction.

The basic structure of almost every PolySearch query is ‘given a single X find all associated Y's’, where X can be any single human disease, gene/protein name, drug, metabolite, SNP, gene/protein sequence or user-provided text word and Y can be any one of all human diseases, genes/proteins, drugs, metabolites, pathways, tissues, organs, sub- cellular localizations, SNPs, PCR primers or user-supplied text words.

In each case the ‘X’ and ‘Y’ words can correspond to either a common name or synonyms.

Once the general query is constructed and submitted the user is presented with a second page (the 'query refinement page') that allows further refinement of the query, including the selection of association words, databases, query word synonyms and display options.

PolySearch query refinement page --

Through its query refinement page, PolySearch also allows users to add or include synonyms to their original query words (i.e. query synonym expansion). In particular, PolySearch uses its own thesauruses to automatically append synonyms to a query word (by clicking on the option for ‘automated synonym list’).

If the computer-generated synonyms appear inadequate, the user may further edit or add to this list. Users can also edit the set of association words used to refine PolySearch queries.

From the query refinement interface users can also choose to limit their search to 'PubMed only', or to perform their search on some of PolySearch's other reference databases (see below...).

Limiting PolySearch searches to the PubMed database (the default configuration) is faster but the results tend to be less accurate.

Additionally, through the query refinement interface users can also specify:

1) How far back in time the PubMed records should be searched; 2) The number of abstracts to be searched; and 3) The minimum number of PubMed citations required to be considered as a hit.

Changing these values judiciously can also shorten the search times.

PolySearch algorithms --

PolySearch does Not use part-of-speech tagging, but rather it uses a dictionary or ‘bag-of-words’ approach to identify relevant text associations. Key to the success of dictionary-based text mining is having a comprehensive collection of words and synonyms, all of which are properly normalized or mapped to appropriate database accession numbers.

PolySearch maintains nine (9) different thesauruses, compendia or synonym lists for human genes, human proteins, human diseases, approved drugs, endogenous metabolites, protein/gene pathways, human tissues, human organs and sub-cellular localizations.

These thesauruses or compendia are obviously critical for many of the expansive queries (‘given one, find many’) supported by PolySearch. They are also critical for providing the sensitivity and specificity for many single-word queries (i.e. the automated synonym feature in the 'query refinement page').

PolySearch databases --

One of the unique features of PolySearch is its integration of multiple databases containing both text and sequence data.

Many of these databases (PubMed, OMIM, etc.) are housed externally and queried through various custom CGI tools written in Perl, while others (DrugBank, HMDB and the SNP databases) are housed internally by the manufacturer to accelerate PolySearch's query process. Below is a short description of each database.

1) PubMed - A service of the U.S. National Library of Medicine that includes over 17 million abstracts and paper titles from life science journals dating back to the 1950s.

2) Online Mendelian Inheritance in Man (OMIM) - A catalog of human genes and genetic disorders authored and edited by Dr. Victor A. McKusick and his colleagues at Johns Hopkins University, and developed for the Web by the NCBI (the National Center for Biotechnology Information).

3) Genetic Association Database (GAD) - (see G6G Abstract Number 20314) - An archive of human genetic association studies of complex diseases and disorders.

4) SwissProt - A curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications (PTMs), variants, etc.), a minimal level of redundancy and high level of integration with other databases.

5) Human Protein Reference Database (HPRD) - A centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks and disease association for each protein in the human proteome.

6) DrugBank - A unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information.

7) Human Metabolome Database (HMDB) - A freely available electronic database containing detailed information about small molecule metabolites found in the human body.

8) HapMap - A freely available resource that contains information pertaining to the haplotype map of the human genome. The HapMap database describes the common patterns of human DNA sequence variation.

9) Entrez SNP (dbSNP) - A central repository for both single base nucleotide substitutions (SNPs) and short deletion and insertion polymorphisms in the human genome.

10) CGAP SNP500cancer Database - A part of the Cancer Genome Anatomy Project and is specifically designed to contain data on the genetic variation in genes important in cancer.

11) Human Genome Mutation Database (HGMD) - A database comprises various types of mutation within the coding regions, splicing and regulatory regions of human nuclear genes causing inherited disease.

PolySearch limitations --

PolySearch is Not without some limitations. As a text mining tool, PolySearch uses a relatively simple dictionary approach to identify biological or biomedical associations. This means PolySearch canNot identify novel or newly named diseases, genes, cell types, drugs or metabolites.

Another limitation lies in its inability to extract context or meaning from sentences or terms. Methods that use artificial intelligence (AI), word context or machine learning (ML) methods could potentially improve the current 'term identification system'.

Efforts are underway to incorporate these improvements in future releases of PolySearch.

System Requirements



Manufacturer Web Site PolySearch

Price Contact manufacturer.

G6G Abstract Number 20505

G6G Manufacturer Number 104124