BioText Search Engine

Category Cross-Omics>Data/Text Mining Systems/Tools

Abstract The BioText Search Engine is a freely available Web-based application that provides biologists with new ways to access scientific literature.

One novel feature is the ability to search and browse article figures and their captions.

A grid view juxtaposes many different figures associated with the same keywords, providing new insight into the literature. An abstract/title search and list view shows at a glance many of the figures associated with each article.

The interface is carefully designed according to usability principles and techniques. The search engine is a work in progress, and more functionality will be added over time.

BioText Search Engine's features/capabilities include:

Design --

The current design consists of an interaction flow in which users can search over either the text of abstracts (plus titles, author names and other metadata), [For the query mutagenesis, results of searching over the titles and abstracts, shows many of the article's figures alongside its abstract], or search over the text of the captions, [For the query mutagenesis, results of searching over the captions, shows the corresponding figures].

The results can be viewed either in a list view (in the case of abstract search and caption search) or in a grid view (in the case of caption search) [For the query mutagenesis, results of searching over the captions, shows the corresponding figures in the grid view].

Functionality --

As mentioned above, figure captions contain important information about experimental methods.

For example, searching on "Western Blot" in the current collection (database) produces few results when run only over title and abstract text, but returns more than a thousand results in caption search (Note: that caption search does Not currently also search over abstracts). Similar behavior is seen for the queries PCR, "phylogenetic tree" and "sequence alignment".

The grid view may be especially useful for seeing commonalities among topics, such as all the phylogenetic trees that include a given gene, or seeing all images of embryo development of some species.

Implementation --

The current system indexes all 'Open Access' articles available at PubMed Central. This collection consists of more than 150 journals, 20,000 articles and 80,000 figures (new articles are downloaded daily). The figures are stored locally, in order to be able to present thumbnails quickly.

The Lucene open source search engine is used to index, retrieve and rank the text (using the default statistical ranking).

Note: Apache Lucene is a high-performance, full-featured’ text search engine library’ written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross- platform.

Publication date is stored as a separate field and can also be used to sort the result.

For tokenization, the standard analysis settings for Lucene are used - words are split at punctuation characters and hyphens, unless there is a number in the token, and it uses lowercasing, simple stemming and stop word removal.

The interface is web based and is implemented in python (a general- purpose high-level programming language) and PHP (a widely-used general-purpose scripting language).

Logs and other information are stored using MySQL [a Relational DataBase Management System (RDBMS)].

System Requirements



Project Leads

Manufacturer Web Site BioText Search Engine

Price Freely available.

G6G Abstract Number 20253

G6G Manufacturer Number 102853