MedlineRanker

Category Cross-Omics>Data/Text Mining Systems/Tools

Abstract MedlineRanker webserver is a text mining system which allows the flexible ranking of Medline (PubMed) for a topic of interest without expert knowledge.

Given some abstracts related to a topic, the system automatically deduces the most discriminative words in comparison to a random selection.

These words are used to score other abstracts, including those from recent publications that have Not yet been annotated, which can then be ranked by relevance.

The user defines their topic of interest using their own set of abstracts, which can be just a few examples, and they can also run the analysis with default parameters.

If the input contains closely related abstracts, the system returns relevant abstracts from the recent bibliography with high accuracy.

The web interface also allows customization of other parameters and inputs, such as the reference set of abstracts, which is compared to the query.

This tool can process thousands of abstracts from the Medline database in a few seconds, or millions in a few minutes.

MedlineRanker method and implementation --

The MedlineRanker method is derived from a supervised learning method which was tested on the subject of stem cells.

Briefly, noun usage is compared between a set of abstracts related to a topic of interest, called the training set, and the whole Medline or a subset, called the background set.

First, nouns are extracted from each English abstract, including the title, without counting multiple occurrences.

The original supervised learning method was improved by using a linear naïve Bayesian classifier which is applied by calculating noun weights with a refactored-for-speed dot product, which sums only the features that occur.

The manufacturer also uses the split-Laplace smoothing scheme to counteract class skew.

An abstract is scored by summing the weights of each of its nouns, and 'P-values' are defined as the proportion of abstracts with a higher score within 10,000 recent abstracts.

Extraction of nouns in English abstracts is performed using the TreeTagger program (Helmut Schmid, Institute for Natural Language Processing, University of Stuttgart) and stored in a local MySQL database along with information from the Medline database.

MedlineRanker Results --

MedlineRanker user inputs -

There are three (3) different sets of data that the user can provide to help them get the most relevant results from MedlineRanker: the training set, the background set and the test set.

A user interested in ranked results related to a particular topic has to input some abstracts related to that topic as 'the training set'.

In the training set, an abstract is represented by its PubMed identifier (PMID). These identifiers can be easily retrieved from a PubMed search results page as explained in the MedlineRanker webserver online documentation.

Also, thanks to available Medline annotations the webserver can automatically construct the training set from a list of biomedical MeSH terms.

Some example training sets can also be selected just by clicking on hyperlinks.

If the user decides to run the analysis with the default parameters, the training set profile will be compared to a precomputed profile of the entire Medline database, and used to rank ten thousand (10,000) recent abstracts.

Beyond the input set, a 'second main parameter' of MedlineRanker is the choice of the reference abstracts, i.e. 'the background set'.

To construct a profile for the query topic, the noun frequencies in the training set are compared to the corresponding frequencies in the background set by a linear naïve Bayesian classifier.

The default background set is the entire Medline database, which is clearly suitable when ranking recent abstracts or the most recent years of the literature.

The manufacturer recommends using the default background set; however, you can also provide your own list of PMIDs.

This may be useful when the abstracts that have to be ranked are all related to a same secondary topic.

For instance, if one is interested in ranking abstracts already related to protein binding according to their relevance for the topic ‘Phosphorylation’, an appropriate background set would be a list of abstracts related to ‘protein binding’.

The 'last main parameter' defines which abstracts are going to be ranked, i.e. 'the test set'.

By default, 10,000 recent abstracts are selected. By using this relatively small subset of Medline, the results can be returned quickly and the performance of the training set can be evaluated in short amount of time.

The test set can be extended to the last months or years of Medline with a cost in computational time. The manufacturer's server can process approximately one million abstracts per minute.

Alternatively, the user can input his own test set with a list of PMIDs. This is very useful for focusing a search on a particular set of abstracts of interest.

For instance, if one was interested in ranking abstracts describing protein-protein interaction (PPI), the main PPI databases, like the Human Protein Reference Database (HPRD), the Database of Interacting Proteins (DIP) or MINT (database of functional relationships between proteins, DNA and RNA), provide PMIDs for each described interaction.

MedlineRanker Results page --

The results page shows the ranked test set as a table, with the most relevant records at the top of the table. For each abstract, the table shows the rank, PMID, title and 'P-value'.

The discriminative words that were used to score the abstracts are highlighted in the column containing the article title.

Clicking on a PMID opens a pop-up window showing the whole abstract text with highlighted discriminative words, further info and a link to PubMed.

During the ranking process, a leave-one-out cross validation is done on a subset of the data. This provides an estimation of the method's predictive performance, including precision and recall, for several cut- offs and is displayed as a table.

Additionally, the probability of the correct ranking of a random pair of abstracts, one relevant and one irrelevant, is calculated from the area under a Receiver Operating Characteristic (ROC) curve.

This is provided to allow future comparisons with other algorithms. Finally, the list of ‘discriminative words’ with corresponding weights is given in decreasing order of importance.

System Requirements

Web-based.

Manufacturer

MedlineRanker was created by the members of the Computational Biological and Data Mining (CBDM) group of Miguel Andrade at the Max Delbrueck Center for Molecular Biology, Berlin.

Manufacturer Web Site MedlineRanker

Price Contact manufacturer.

G6G Abstract Number 20482

G6G Manufacturer Number 104107