InforSense TextSense
Category Cross-Omics>Data/Text Mining Systems/Tools
Abstract InforSense TextSense, built on the InforSense Platform, provides a wide range of text processing, analytics and visualization components. It allows scientists, informaticians and analysts, in any discipline, to rapidly create, execute and deploy applications which enable them to leverage literature analytics within their work. The TextSense analytics components cover the most widely used text analysis functions. Each component has an intuitive visual interface for setting its parameters and can be composed with other components into an analytic workflow to create complete, problem solving applications. The components are designed to easily integrate with the wide variety of InforSense analytical components so that applications for predictive modeling, structured data analysis, bioinformatics and cheminformatics can all be combined with text analytics.
With many applications in a large number of sectors, InforSense TextSense allows organizations to discover new concepts and relationships in large digital collections of textual data, including scientific papers, patents, business reports, laboratory reports, web pages and warranty records.
Note 1: Gene Expression Case Study - Gene expression profiling is widely used for target discovery in the drug development process. Such experiments result in a list of differentially expressed genes which the analyst will wish to investigate further. One information source that can be leveraged for this is the published scientific literature. Text analytics can be used to answer specific questions about the genes; are there direct or indirect relationships between these genes and the disease under study, in which biological processes are these genes involved, in what biological pathways are these genes involved. In this way, hypothesis based on the experimental results, may be supported or contradicted by information extracted from the published scientific literature.
InforSense TextSense Key Features include:
Import-Export --
InforSense TextSense provides a wide range of components for importing documents and other resources. These include:
- 1) Document import from database, file, compressed archive and web sources (including PubMed).
- 2) Format conversion from Portable Document Format (PDF), PostScript, MS Word, Rich Text Format (RTF) and MS PowerPoint.
- 3) Import of lists, thesauri, taxonomies and ontologies.
XML --
InforSense TextSense supports a wide range of processing operations on Extensible Markup Language (XML) documents. These include:
- 1) Parsing using XML Path Language (XPath) and XML Query (XQuery) (including very large and badly formed XML documents).
- 2) Transforming using Extensible Stylesheet Language (XSL).
- 3) Conversion between tabulated and XML data.
Preprocessing --
InforSense TextSense supports a wide range of preprocessing components for parsing, cleaning and normalizing text. This includes operations for:
- 1) Parsing semi-structured text formats [XML, HyperText Markup Language (HTML), RIS (file format), Medical Literature Analysis and Retrieval System Online (Medline), etc].
- 2) Splitting and grouping documents.
- 3) Stemming text.
- 4) Removal, extraction and replacement of text from documents.
- 5) Document filtering.
Annotation --
InforSense TextSense implements a generic architecture for tagging features within documents. Supported operations include:
- 1) Natural language processing (NLP) to annotate parts-of-speech and phrases.
- 2) Structural annotation (tokens, sentences, paragraphs, etc).
- 3) Entity extraction using lists, thesauri, taxonomies and ontologies.
- 4) Syntactic annotation based on regular expressions.
Statistical Analysis --
Statistical analysis can be used to find trends and patterns within a document collection or it can be used to transform documents into the feature vector space in preparation for document categorization. Statistical analysis components include:
- 1) Calculation of feature statistics within and across document classes.
- 2) Feature vector generation and processing.
- 3) Document similarity calculations.
Document Categorization --
Once a document has been transformed into the feature vector space, traditional classification and clustering algorithms may be applied to categorize the document. These components include:
- 1) Document clustering (Hierarchical, K-Means and Expectation Maximization algorithms).
- 2) Document cluster labeling.
- 3) Document classification [Naïve Bayesian, Support Vector Machines (SVM), Decision Tree, Decision Rules and Neural Net (NN) algorithms].
Information Extraction --
Information in the form of relationships that exist between features in documents can be uncovered using one of the following components:
- 1) Co-occurrence finds when two (2) or more feature occurs in a certain window of text.
- 2) The association rule algorithm is a statistical method for finding and evaluating relationships between features in text.
Visualization --
InforSense TextSense adds to the suite of InforSense visualizers. The additional visualizers are:
- 1) A document viewer for reading document text. It includes a hierarchical feature browser, document cluster browser, interactive search and feature highlighting.
- 2) A heatmap viewer for exploring feature associations and co- occurrences.
- 3) Export of feature relationship data into the open source CytoScape (see G6G Abstract Number 20092) network visualization tool.
Indexing --
Documents collections may be indexed for rapid interactive querying. Supported operations include:
- 1) Flexible index creation.
- 2) Index searching with queries combining plain text and annotated features.
Oracle Text --
InforSense Oracle Edition extends InforSense by allowing components within an InforSense application to be executed within Oracle, without the overhead of data transfer to and from the database. TextSense’s Oracle Text nodes add to this by providing Oracle’s Oracle Text functionality in the same framework.
This package includes components that perform the following in- database processing:
- 1) Classification and clustering.
- 2) Gist and theme extraction.
- 3) Search.
Extensibility --
The TextSense range of components can be extended to add extra functionality, including:
- 1) Rapid integration of new methods and third party tools via the InforSense Software Development Kit (SDK).
- 2) Simple-to-use Application Programming Interface (API) for the TextSense data formats.
Additionally for Scientists --
Create and deliver true cross-domain applications through access to additional analytical components for biology and chemistry using InforSense BioSense (see G6G Abstract Number 20033) and ChemSense to combine our components, your internal tools and third- party software and data stores in one analytical workflow solution.
Note 2: Ontology Tagging Case Study - Ontologies are structured vocabularies that are used to describe knowledge (entities and the relationships between these) within a given domain. Well known ontologies include the Gene Ontology in the biomedical domain and the Derwent World Patents Index Codes in the intellectual property domain. Analysts find it much easier to locate relevant literature if it has been categorized against an ontology that describes their domain. This is often done manually, for instance various biomedical databases curate scientific papers according to the Gene Ontology (GO) concepts to which they refer. Using text analytics and machine learning techniques, documents from any source can be automatically categorized to any ontology. For instance, as well as accessing patents manually categorized by Derwent World Patents Index Codes, an analyst could also access Medline abstracts, which may contain useful information associated with the patents, categorized by the same codes.
System Requirements
The InforSense Platform is based on the Java J2EE architecture and has been validated on an a wide range of operating system environments. Currently supported platforms include:
- 1) Client: Microsoft Windows 2000/XP and Mac OS X Tiger (PowerPC architecture);
- 2) Server: Microsoft Windows 2000/XP, Mac OS X Tiger (PowerPC architecture), Linux (Intel architecture), Solaris 8 (SPARC architecture).
Manufacturer
- InforSense Limited
- Colet Court, 100 Hammersmith Road
- London, W6 7JP
- United Kingdom
- Tel.: +44 (0) 20 8237 8440
- Fax: +44 (0) 20 8237 8441
- information@inforsense.com
Manufacturer Web Site InforSense Limited
Price Contact manufacturer.
G6G Abstract Number 20034
G6G Manufacturer Number 101430




