Category Cross-Omics>Data/Text Mining Systems/Tools

Abstract WordStat is a text mining and content analysis module specifically designed to study textual information such as responses to open-ended questions, interviews, titles, journal articles, public speeches, and electronic communications.

Whether you need a text mining tool for fast extraction of themes and trends or achieve careful and precise measurement with a state-of-the-art quantitative content analysis method, WordStat provides a unique combination of both approaches in flexible and easy to use text analysis software.

Products seamless integration with Simstat, a statistical data analysis tool and QDA Miner (see G6G Abstract Number 20168R), qualitative data analysis software gives you the flexibility for analyzing text and relating its content to structured information including numerical and categorical data.

WordStat can be used for information extraction and knowledge discovery from incident reports, customer complaints, messages, and analysis of news coverage or scientific literature, and taxonomy development and validation, and fraud detection, authorship attribution, patent analysis, etc.

Sample case studies using WordStat:

1) Mining Microarray Expression Data by Literature Profiling -- Authors: Damien Chaussabel and Alan Sher (Laboratory of Parasitic Diseases, National Institute of Allergy and Infectious Diseases, National Institutes of Health).

Description: The authors developed a mining technique based on the analysis of literature profiles generated by extracting the frequencies of certain terms from thousands of abstracts stored in the Medline literature database.

Terms are then filtered, on the basis of both repetitive occurrence and co-occurrence among multiple gene entries. Finally, clustering analysis is performed on the retained frequency values, shaping a coherent picture of the functional relationship among large and heterogeneous lists of genes.

Such data treatment also provides information on the nature and pertinence of the associations that were formed. The analysis of patterns of term occurrence in abstracts constitutes a means of exploring the biological significance of large and heterogeneous lists of genes.

This approach should contribute to optimizing the exploitation of microarray technologies by providing investigators with an interface between complex expression data and large literature resources.

Full reference: Chaussabel, D., & Sher, A. (2001). Mining microarray expression data by literature profiling. Genome Biology, 3, 1-55.

2) Searching for Clinical Prediction Rules in Medical Literature Analysis and Retrieval System Online (Medline) --

Authors: Ingui, Bette Jean; & Mary AM., Rogers (Upstate Medical University, Syracuse, New York).

Reference: Ingui, B.J. & Rogers, M.A. (2001). Searching for clinical prediction rules in MEDLINE, Journal of the American Medical Informatics Association. 8, 391-397.

WordStat key features/capabilities include:

1) Integrated text mining analysis and visualization tools (clustering, multi-dimensional scaling, heat-maps, correspondence analysis).

2) Hierarchical categorization dictionary or taxonomy supporting words, word patterns, phrases and proximity rules.

3) Vocabulary and phrase finder for extraction of technical terms, recurring ideas and themes.

4) Keyword-in-context and keyword retrieval tools for easy identification of relevant text segments.

5) Machine Learning algorithm for automatic document classification (Naive Bayes and K-Nearest Neighbors) with automatic features selection and validation tools.

6) Importation of documents and exportation of data, tables and graphs support industry standard formats.

Additional features/capabilities include:

Automated Text Classification --

1) Flexible feature selection for automatic selection of best subsets of attributes.

2) Numerous validation methods (leave-but-one, n-fold cross-validation, split sample).

3) Experimentation module allows easy comparison of predictive models and fine-tuning of classification models.

4) Classification models may be saved to disk and applied later using a stand-alone document classification utility program, a command line program or a programming library.

Note: The command line and the programming library are part of the WordStat Software Developer's kit (SDK) which is sold separately.

A new version of WordStat (as of August 30, 2010) - WordStat 6.1 introduces several improvements such as:

1) A new multilingual user interface (English, French and Spanish);

2) Improved linguistic support with integrated dictionaries and thesauruses for five languages (English, French, Spanish, German and Portuguese) to assist the development of taxonomies and content analysis dictionaries;

3) A 50% improvement over its predecessor in processing speed, allowing one to analyze up to 30 million words per minute.

System Requirements


Manufacturer Web Site WordStat

Price Contact manufacturer.

G6G Abstract Number 20169R

G6G Manufacturer Number 102232