CAESAR
Category Genomics>Genetic Data Analysis/Tools
Abstract CAESAR (CAndidatE Search And Rank) is a computational system that ranks all annotated human genes as candidates for a complex trait by using ontologies to semantically map natural language descriptions of the trait with a variety of gene-centric information sources.
This approach can be applied to any well-documented mono- or multi- factorial trait in any organism for which an annotated gene set exists.
Applications include selection of candidate genes for association or re- sequencing studies, prioritization of candidates for functional genomics experiments, or evaluation of results from linkage and genome-wide association studies.
CAESAR exploits the knowledge of complex traits in literature by using ontologies to semantically map the trait information to gene and protein- centric information from several different public data sources, including tissue-specific 'gene expression', conserved protein domains, protein- protein interactions, metabolic pathways and the mutant phenotypes of homologous genes.
CAESAR uses four (4) possible methods of integration to combine the results of data searches into a prioritized candidate gene list.
In effect, CAESAR mimics the steps a researcher would undertake in selecting candidate genes, albeit faster, potentially more thoroughly, and in a more quantitative manner.
CAESAR represents a novel selection strategy in that it combines text and data mining to associate genetic information with extracted trait knowledge in order to prioritize candidate genes.
CAESAR is ultimately designed for traits in which the relevant biological processes may Not be well understood and potentially hundreds of reasonable candidate genes exist.
The potential benefits to a researcher in adopting a computational approach to gene selection such as CAESAR include the ability to quickly and systematically process several hundred thousand biological annotations, many of which require highly specialized domain expertise to interpret.
This benefit will continue to grow in importance as the volume and technical detail of annotation data increases.
Relevant gene annotations can easily escape human consideration due to biases that investigators bring to the task of prioritization and that are difficult to overcome even by conscious effort.
This is particularly valuable for complex traits, which may be affected by a wider array of biological processes, some of which may Not have been directly implicated by previous studies.
CAESAR also reports the evidence supporting the prioritization rank of each gene, allowing an investigator to trace the line of reasoning and to exercise his or her own judgment as to its validity. Thus, it can be seen as a very sophisticated aid to manual prioritization.
Though designed to help with the design of an association study involving a few hundred genes, CAESAR can also be used to prioritize a smaller number of candidates within a region of linkage, or to prioritize among polymorphisms annotated with ranked genes that show significant association in a genome-wide study.
CAESAR Methods --
CAESAR is comprised of three (3) main steps.
First, previously implicated genes mentioned in the input text are identified and ontology terms are ranked based on their similarity to an input text.
Second, genes are ranked for each data source independently based on the relevance of the ontology terms with which they are annotated.
Third, the individual gene lists are integrated to provide a single ranked list of candidate genes that combines evidence from all data sources.
The manufacturer refers to these three (3) steps as text mining, data mining and data integration, respectively.
1) Text mining --
Text mining is used to extract gene symbols and ontology terms from the input.
CAESAR requires a user-defined body of text (referred to as a corpus) as input. This text is ideally an authoritative and comprehensive source of biological knowledge about the trait of interest.
If an online Mendelian inheritance in man (OMIM) identifier is supplied, CAESAR will use the OMIM record as input. Alternately, the user can provide any other body of text, for instance one or more review articles.
2) Data mining --
In the data-mining step, genes within each gene-centric data source are ranked based on the relevance to the trait-centric terms.
Eight (8) sources of gene-centric information are used to 'map ranked ontology terms' to the genes annotated with them.
The resulting output is eight (8) lists of gene scores, one for each functional category.
3) Data integration --
The gene scores from the eight (8) sources are integrated to produce one combined score for each gene. Integration is accomplished using one of four (4) methods.
Each method represents a different approach that an investigator might choose when manually prioritizing candidate genes on the basis of evidence from several data sources.
CAESAR relies on human expert knowledge in order to function effectively, but it does Not require that the user actually possess all of this knowledge.
At a minimum, the user needs to select a relevant corpus, but much more user intervention is possible.
The user may manually modify the scores from the text-mining step and/or introduce genes in addition to those that were extracted from the corpus.
The final rankings may be modified based on user perceptions of the importance of particular data sources.
The user may also restrict the algorithm to consider only certain genomic regions or particular sets of genes.
While it is Not advisable to eliminate human judgment and oversight of the candidate gene selection process, due to the volume and the complexity of the information involved, semi-automated methods such as CAESAR may well outperform an unaided expert.
At the very least, CAESAR provides a quantitative starting point for which the assumptions are clear and the user's biases are minimized.
System Requirements
CAESAR is written in Perl and requires a working installation of Perl in order to function.
Contact the manufacturer for additional System Requirements.
Manufacturer
CAESAR was developed by the Mohlke lab and the Vision lab at the University of North Carolina, Departments of Genetics and Biology
Manufacturer Web Site CAESAR
Price Contact manufacturer.
G6G Abstract Number 20424
G6G Manufacturer Number 104053




