GenCLiP (Genes Cluster with Literature Profiles)

Category Cross-Omics>Pathway Analysis/Gene Regulatory Networks/Tools and Cross-Omics>Data/Text Mining Systems/Tools

Abstract GenCLiP (Genes Cluster with Literature Profiles) is a software program for clustering gene lists by ‘literature profiling’ and constructing ‘gene co-occurrence networks’ related to custom keywords.

GenCLiP is based on the method of Chaussabel and Sher (Mining microarray expression data by literature profiling; Genome Biology 2002, 3: research0055.1-0055.16) that can retrieve ‘gene-related literature’ from PubMed, extract keywords from this literature, provide the user with an interface with which to curate these keywords, and then cluster the genes with keywords.

GenCLiP can also export a group of ‘negative control genes’ and a group of ‘positive control genes’ for comparing the cluster results with that of the analyzed genes. From the compared results, one can quickly acquire a primary profile of pathogenesis for a disease and this disease's known related genes.

GenCLiP can further search the list of genes and the ‘gene co-occurrence networks’ related to the specified keywords and/or certain genes. Random simulations are done to verify whether the analyzed ‘gene list’ is related to the specified keywords, and/or engaged in the same networks related to the ‘specified keywords’.

Thus, GenCLiP highlights the potent ‘disease-related genes’ or pathways to be verified by experiment.

GenCLiP's workflow --

First, a group of ‘positive control genes’ and a group of ‘negative control genes’ are generated based on the imported ‘gene list’. Then the literature pertaining to each gene of these three (3) groups of genes is retrieved from PubMed.

After that, the keywords related to each group of genes are auto-extracted from the literature. The keywords can be manually curated. Then, each group of genes is clustered with the keywords, and the ‘gene co- occurrence networks’ can be constructed among each group of genes based on certain keywords.

After that, the cluster results and the ‘gene networks’ are compared against the three (3) groups of genes, and the user can select the keywords that are more related to the positive control genes and the analyzed genes, compared to the negative control genes, to construct gene networks.

Once the analyzed genes and the positive control genes are found to contain more genes (or more complex ‘gene networks’) related to certain keywords than the negative control genes, 10,000 or more random simulations are done to decide whether it occurs randomly.

Thus, an inference can be obtained for further experimental verification.

GenCLiP Generation of controls --

To generate the negative control genes, the full gene set from which the analyzed genes are derived is used to generate a group of genes randomly. To generate the positive control genes, the full gene set and ‘certain keywords’ are first used to search the database (e.g. PubMed or Entrez Gene) for all known genes related to the ‘specified keywords’.

Then, the known-related ‘gene set’ is used to generate a group of genes randomly.

Note: The analyzed genes, the positive control genes, and the negative control genes should all have the same number of genes. And the average number of literature per gene for each of them should be comparative.

GenCLiP Literature retrieval --

To retrieve literature pertaining to each gene of the three (3) groups of genes, the NCBI EUtilities, ESearch, and EFetch web services, are used to access the PubMed database for its description. The user can decide whether to provide the gene symbol directly, or provide the gene ID (HUGO, Entrez, or Unigene).

If an alternative gene ID is available, it can be converted to the appropriate input form using MatchMiner - (MatchMiner is a set of tools that enables the user to translate between disparate IDs for the same gene).

Each gene's literature is saved in a ‘text file’ with the gene's official symbol as the ‘file name’. To solve the ambiguity of gene names, including synonyms (different names for the same gene) and homonyms (different genes or unrelated concepts with the same name), a ‘human gene thesaurus’, which collected all the aliases for each ‘gene name’ from the HUGO Nomenclature Committee database and the Entrez Gene database is used.

GenCLiP Auto-extraction of keywords --

Auto-extraction of keywords is performed for the ‘description’. Briefly, terms are first extracted from literature titles and abstracts, and their occurrences (number of literature containing a given term divided by the number of total literature) for each gene are calculated.

The terms are then filtered systematically using several criteria. Only the terms that pass through this filter for at least two (2) of the analyzed genes are retained. These retained terms are considered keywords.

GenCLiP Manual curation of keywords --

Keywords can be manually curated. The user can remove unrelated keywords and add relevant keywords (single terms or phrases). The user sets the weight for certain keywords that are perceived more important than others. The user defines certain keywords as one synonym entity. The user also determines which keywords have singular/plural forms.

GenCLiP Clustering analysis --

Clustering analysis is performed for the description. Briefly, occurrences of all keywords for each gene pass through the following processes:

1) The occurrence of each keyword in its singular and plural forms is averaged into one unique occurrence;

2) Each occurrence is multiplied by its weight;

3) Occurrences of synonyms are averaged into one unique occurrence, and each synonym entity is represented by a keyword. An ‘array file’ is then generated and used to do ‘clustering analysis’ with the ‘average linkage hierarchical clustering’ algorithm for the description.

This file can also be used for clustering analysis with publicly available software, such as Cluster 3.0 (Cluster 3.0 is an enhanced version of Cluster, which was originally developed by Michael Eisen while at Stanford University) and SpotFire.

GenCLiP Network construction --

Gene co-occurrences are searched from the literature that contains certain keywords. The Neato program in the WinGraphviz software is used to create a two-dimensional layout.

[WinGraphviz - Graphviz is a Project of AT&T Labs research. It provides a collection of tools (including Neato) for manipulating graph structures and generating graph layouts. WinGraphviz is free software based on the Graphviz project.

It can render the dot-language to the common Image-format. It's also a Windows COM Object and you can use it in your Windows-application or ASP service without a UNIX server].

GenCLiP Random simulation --

Random simulation is performed in two (2) steps.

First, each gene of the full ‘gene set’ is used to search PubMed for whether its literature mentions certain keywords, and the resulting PubMed IDs are recorded.

Second, for each simulation, the same number of genes as the number of the analyzed genes are randomly picked from the full gene set, and the number of genes (and then gene pairs, i.e. two genes sharing the same PubMed ID) related to the specified keywords are counted.

The average number of literature per gene for the randomly picked genes should be comparative with the ‘analyzed genes’.

After 10,000 or more random simulations, if the distribution of the number of ‘related genes’ (or gene pairs) is similar to the expected ‘normal distribution’ and the probability that a set of ‘randomly picked genes’ contain the same or more number of related genes (or gene pairs) as the ‘analyzed genes’ do, is less than 0.05 (i.e. P<0.05), then it can be inferred that the gene relatedness is Not random.

GenCLiP Literature display --

The literature containing certain genes and keywords can be searched and displayed with the genes and the keywords coded using different colors.

System Requirements

Contact manufacturer.


Manufacturer Web Site Southern Medical University Cancer Institute

Price Contact manufacturer.

G6G Abstract Number 20571

G6G Manufacturer Number 10477