Semantic Features in Text (SENT)

Category Cross-Omics>Data/Text Mining Systems/Tools

Abstract Semantic Features in Text (SENT) is a functional interpretation tool based on literature analysis.

SENT uses Non-negative Matrix Factorization (NMF) – [Non-negative Matrix Factorization (NMF) is a method used to find a low rank approximation to a matrix using a set of factors, the number of which must be provided beforehand] to identify topics in the scientific articles related to a collection of genes or their products, and uses them to group and summarize these genes.

In addition, this application allows users to rank and explore the articles that best relate to the topics found, helping put the analysis results into context.

This approach is useful as an exploratory step in the workflow of interpreting and understanding experimental data, shedding some light on the complex underlying biological mechanisms.

This tool provides a user-friendly interface via a web site, and a programmatic access via a SOAP web-server.

SENT is an exploratory tool that uses literature analysis to describe a list of genes. The description has the form of a set of 'Semantic Features' (see below...), each consisting of a list of words that suggest a biological concept.

Each gene in the list has a certain level of relation to each semantic feature, and thus, to each biological concept; usually, if the factorization is appropriate, each gene may be identified with one, and only one, semantic feature.

The appropriateness of the factorization depends on the nature of the data, and, since the number of factors to produce must be estimated a- priori, on how good this estimation is.

There is an analytic measure of how appropriate a factorization is, called the 'Cophenetic Correlation Coefficients' (see below...), that can help determine the optimal number of factors to produce, so that one can capture the biological concepts with more accuracy.

Semantic Feature -- The term Semantic Feature is introduced previously in literature to refer to NMF factors, honoring the interpretable nature provided by non-negativity, sparseness and locality.

In SENT the manufacturer does Not use this term to refer to the factors themselves, but to a more elaborate selection of terms that is derived from these factors.

SENT runs 10 executions of NMF, instead of just one, and, as a result, has 10 times the specified number of factors. These repeated factors are clustered back together and averaged.

From these factor averages the semantic feature is selected as the 15 words that rank the highest in value for that factor group, divided by the average across the other factor groups. This scoring combines and favors strong signals and specificity of the semantic features.

Cophenetic Correlation Coefficient -- These coefficients are estimated from the dendrogram used to group the results from the 10 executions of NMF.

High values of this coefficient indicate a high level of agreement across different executions, which indicate stability in the results and, thus, an appropriate factorization, both in terms of the nature of the data and in terms of the selection of the number of factors.

Sent Methodology Overview --

The literature examined consists of titles and abstracts of articles from PubMed that are found to be related to each gene.

The text from these titles and abstracts is converted to a 'bag of words' representation; the terms used in the bag of words are the 'stems' from the words and 'bigrams'.

From the whole collection of words and bigrams, only those with the best expressiveness potential are considered, the rest are eliminated from the analysis.

The measure of expressiveness potential is the Term Frequency Inverse Document Frequency (TD-IDF) -

(TD-IDF is a measure of usefulness of a word commonly used in text mining. It gives importance to words that are very frequent, while penalizing them if they appear in too many documents.

This measure favors words that appear frequently in a small subset of documents, as these will be useful to discriminate these documents).

The process described above allows the manufacturer to represent the genes in a 'vector space model' using the bag of words derived from their literature.

This vector representation is then processed using NMF to summarize the data into combinations of signals called factors. These factors, when the factorization is appropriate, capture the main topics discussed in analyzed text, and are easily interpretable.

The NMF processing is then sent for processing to the bioNMF SOAP web-server (bioNMF is a web-based tool for Non-negative Matrix Factorization in biology).

SENT Results --

When you examine the results you may find that some groups represent sparsely annotated genes with articles relating to useless information like methodology articles describing microscopy, or spectrometry experiments.

This often happens and you may want to remove those genes from your input list and redo the analysis. Also, some genes may form big groups of genes and may deserve their own separate analysis job.

For these reasons the first time you examine a set of genes, it is advised that you use the recommended number of factors and Not turn on the ‘Fine grained analysis’ and the ‘Build literature index’ options, since these are computationally expensive. Once you are sure that the input genes list is OK you may try these options.

Each job receives unique identifiers that are used to query the results, both through the web server or through the web site interchangeably (except that, currently, the web server only saves the latest factorization, and the web site saves the results for each factorization).

The main results page shows a series of components:

Clustering Heat-Map - This image is used, along with the Cophenetic Correlation Coefficient, to examine the stability of the factorization.

It shows 10 different factorizations for a given number of factors; if, for instance, 8 factors are selected, 10 executions of an 8 factor factorizations will render 80 different factors.

These factors are represented as columns in the heat-map image, and, if the factorization is stable, they should cleanly cluster back into as many groups as the original number of factors specified, roughly having 10 factors each.

The rows of the image represent genes; they are sorted so that genes that should be assigned to a particular factor group end up together.

While the image shows a gene-factor matrix, the factors are actually clustered together according to their word profiles in order to favor more stable word profiles for the semantic features, which is preferred over more stable grouping of genes. Genes are assigned to the group that, on average, contributes more to that gene’s literature.

Job Overview and Factorizations - Contains general information about the job: genes analyzed, range of factors examined and their Cophenetic Correlation Coefficients, and number of articles used in the analysis.

It also provides links to the download of several data matrices that compose the results of the factorization being shown as it would be retrieved from the Web Service.

Groups - These groups are formed from the factorization and include a list of words that compose a Semantic Feature, as well as the genes that have that Semantic Features as their most representative one.

Gene Details - This page shows a Gene Ontology (GO) term enrichment analysis, performed by GeneCodis (GeneCodis is a grid- based tool that integrates different sources of biological info to search for biological features (annotations) that frequently co-occur in a set of genes and rank them by statistical significance), and details for the list of genes.

There is a page like this for each group and one for the complete job.

Literature Examination - Each semantic features word may be used to query the list of articles associated to that group to establish a ranking of relevance. To do this the server must have first computed the literature index.

This is done once for every job and is maintained across factorizations.

There is a search tool that allows you to perform custom queries to sort the list of articles. The default rankings use the list of terms for the Semantic Feature as a query.

Note: A very good Results page(s) example is accessible on the bottom of the manufacturers Quick Start Guide page located on their web-site.

System Requirements

Web-based.

Manufacturer

Manufacturer Web Site SENT

Price Contact manufacturer.

G6G Abstract Number 20497

G6G Manufacturer Number 104118