@Note

Category Cross-Omics>Data/Text Mining Systems/Tools

Abstract @Note is a platform for Biomedical Text Mining (BioTM) that aims at the effective translation of the advances between three (3) distinct classes of users: biologists, text miners and software developers.

Its main functional contributions are the ability to process abstracts and full-text; an information retrieval module enabling PubMed search and journal crawling; a pre-processing module with PDF-to-text conversion, tokenization and stop-word removal;

A semantic annotation schema; a lexicon-based annotator; a user-friendly annotation view that allows you to correct annotations; and a Text Mining Module supporting dataset preparation and algorithm evaluation.

@Note improves the interoperability, modularity, and flexibility when integrating in-house and open source third-party components. Its component-based architecture allows the rapid development of new applications, emphasizing the principles of transparency and simplicity of use.

@Note Functional modules -- @Note integrates four (4) main functional modules covering different tasks of Biomedical Text Mining (BioTM).

1) The Document Retrieval Module (DRM) accounts for Information Retrieval (IR) tasks. Initial Information Extraction (IE) steps are covered by

2) the Document Conversion and Structuring Module (DCSM), whereas

3) the Natural Language Processing Module (NLPM) supports tokenization, stemming, stop-word removal, syntactic, and semantic text processing.

In particular, the SYntactic Processing sub-module (SYP) carries out Part-Of-Speech (POS) tagging and shallow parsing, while the Lexicon-based Named Entity Recognition (NER) sub-module (L-NER) and the Model-based NER sub-module (M-NER) are responsible for semantic NER annotation.

4) The Text Mining Module (TMM) deals with Machine Learning (ML) algorithms, providing models for distinct IR or IE tasks (e.g. NER or document relevance assessment).

1) Document Retrieval Module (DRM) -- The DRM supports PubMed keyword-based queries, and also document retrieval from open-access and subscribed Web-accessible journals.

It accounts for the need of processing full-text documents, in order to obtain detailed information about biological processes. The module exploits the Entrez Programming Utilities (eUtils) Web service.

External links are traversed sequentially, avoiding server overload and respecting journal policy. This module identifies most document source hyperlinks through general templates.

However, for journals where traverse is Not straightforward (for example, due to JavaScript components or redirect actions), particular retrieval templates need to be implemented.

Apart from implementing the search and retrieval of problem related documents, the DRM also supports document relevance assessment. Keyword-based queries deliver a list of candidate documents and the user usually evaluates the actual relevance of each of these documents.

Even taking into account document annotations, this process is laborious and time-consuming as some assessments demand careful reading of full-texts and the interpretation of implicit statements.

Foreseeing the need to automate relevance assessment, this module includes Machine Learning (ML) algorithms to obtain problem-specific document relevance classification models, thus delivering some degree of automation to this process.

2) Document Conversion and Structuring Module (DCSM) -- The DCSM is responsible for PDF-to-text document conversion and first-level structuring. PDF files need to be translated to a format that can be utilized by posterior Natural Language Processing (NLP) modules.

@Note includes two (2) of the most successful free conversion programs, namely: the pdftotext program (which is part of Xpdf software and its MAC OS version) and the PDFBox.

The process of XML-oriented document structuring is based on bibliographic data and general rules. @Note catalogue provides for title, authors, journal and abstract data.

Additional template rules search for known journal headings (such as Introduction, Implementation, Conclusions, and References), assuming that they are usually fully capitalized (or present as initial caps) and start at the beginning of a line and are followed by a newline.

3) Natural Language Processing Module (NLPM) -- The NLPM embraces document pre-processing, syntactic annotation, semantic annotation and a friendly environment for the manual annotation of documents. Furthermore, it is able to process abstracts and full-text interchangeably.

Tokenization, sentence splitting and stop-word removal are the basic text processing steps, and typically they do Not rely on previous pre-processing, whereas shallow parsing and Named Entity Recognition (NER) may be based on Part-Of-Speech (POS) annotation.

In fact, the developed tools are able to deal with both semantic and syntactic annotation and annotation processes have No precedence over one another, i.e. semantic annotation may occur after or before POS tagging.

Such multi-layer annotation can support text mining tasks (namely the construction of NER classifications models) as well as further relationship extraction.

This module also supports the construction and use of lexical resources, encompassing data loaders for major biomedical databases such as BioCyc - (see G6G Abstract Number 20230), UniProt, ChEBI, and NCBI Taxonomy and integrative databases such as BioWarehouse - (see G6G Abstract Number 20238).

Also, it provides lists of standard laboratory techniques, general physiological states and verbs commonly related to biological events produced by the authors.

Currently, the system accounts for a total of fourteen (14) biological classes as follows: gene (including the subclasses metabolic and regulatory gene), protein (including the subclasses transcription factor and enzyme), pathway, reaction, compound, organism, DNA, RNA, physiological state, and laboratory technique.

The rewriting system attempts to match terms (up to 7-word composition) against dictionary contents, checking for different term variants (e.g. hyphen and apostrophe variants) and excluding too short terms (less than 3-characters long).

Additional patterns are included to account for previously unknown terms and term variants. Besides class identification, the system also sustains term normalization, grouping all term variants around a ‘‘common name” for visualization and statistical purposes.

The M-NER sub-module aims at applying classification models to the Named Entity Recognition (NER) task and therefore accounting for the constantly mutating biological terminology.

Both the L-NER and M-NER sub-modules provide invaluable aid to curators, but available techniques do Not fully cope with terminological issues.

Manual curation is still an important BioTM requirement and @Note acknowledges this fact by providing a user-friendly environment where biologists (problem experts) may revise automatically annotated documents.

The manual annotation environment guarantees high-quality annotation and hence the extraction of relevant information. Annotated documents resulting from L-NER can be refined, eliminating or correcting (e.g. change term class or adjusting term grams) existing annotations and adding new annotations.

Such annotation refinement may also support dictionary updates, accounting for term novelty and term synonymy.

Manually curated documents can be used as a training corpus at the Text Mining Module (TMM) to build classification models.

In fact, the existence of this curation environment makes it possible for biologists and researchers to cooperate in the improvement of BioTM corpora to build automated models upon expert-revised knowledge.

4) Text Mining Module (TMM) -- The TMM accounts for the workbench for conducting text mining experiments. This module is implemented by a low-level plug-in to YALE - (see G6G Abstract Number 20177), that also includes WEKA - (see G6G Abstract Number 20534).

These are two (2) open-source toolkits that allow the deployment of different problem-oriented text mining experiments (namely feature selection and model evaluation).

Currently, this module aims only at the construction and evaluation of Named Entity Recognition (NER) Machine Language (ML) models that can be further used by the M-NER sub-module of the Natural Language Processing Module (NLPM), although other tasks such as document relevance are already being developed.

NER-oriented dataset preparation was implemented by the authors using the General Architecture for Text Engineering (GATE) features and covers morphological, syntactical and context features.

Morphological features track term composition elements (such as capitalization, hyphenisation, alphanumeric data, quotes, and tildes) and affix information (3-5 characters long).

Syntactical features are based on Part-Of-Speech (POS) tagging. Context features capture the morphological and syntactical nature of the words in the neighborhood of the term (typically, two words for each side).

Based on their expertise, text miners select the set of features that better describe each problem and perform mining experiments. Experiments evaluate different mining algorithms and alternative algorithm configuration. The resulting model can then be saved and further used in the Model-based (M)-NER sub-module.

Low-level Integration Issues -- At the low-level, @Note supports continuous development, where new features and services can be added and improved frequently, integrating many research efforts.

@Note is built on top of AIBench - (see G6G Abstract Number 20693), a Java application development framework used in a growing number of research projects.

AIBench comprises core libraries and delivers a set of functionalities in the form of plug-ins. Currently AIBench integrates the GATE text engineering plug-in and YALE data mining plug-in.

System Requirements

Contact manufacturer.

Manufacturer

IBB - Institute for Biotechnology and Bioengineering
Centre of Biological Engineering
University of Minho
Campus de Gualtar, 4710-057 Braga
Portugal

Manufacturer Web Site @Note

Price Contact manufacturer.

G6G Abstract Number 20696

G6G Manufacturer Number 104267

The G6G Directory of Omics and Intelligent Software

@Note