Genna

Category Intelligent Software>Data Mining Systems/Tools, Intelligent Software>Genetic Algorithm Systems and Intelligent Software>Neural Network Systems/Tools

Abstract Genna is a hybrid data mining algorithm that combines genetic algorithm and nearest neighbor technologies to provide an advanced modeling tool for tackling classification and regression data mining tasks. The data input to Genna consists of a set of independent variables and a dependent variable.

The objective of applying Genna is to structure this data in such a way that the dependent variable can be predicted accurately for a given vector consisting of values assigned to the independent variables (often referred to as the target exemplar or case). The variables (dependent and independent) may be categorical or numeric.

Given a training data set, rather than learning a model through the use of induction, Genna converts the data into a ‘Corporate Memory’ through the structuring of the data in such a way that when called upon to make predictions, the memory itself can be used to retrieve comparable cases to the target, for which a prediction is required, and predictions made based on the outcomes of the retrieved comparables.

The advantages of this approach range from cognitive appeal through incremental learning. By cognitive appeal the manufacturer means that most humans approach problem solving in this manner -- retrieving similar, previous experiences, adapting them to the current situation and then solving the current situation based on the successful solutions in the past.

Incremental learning refers to the fact that as new data becomes available; it gets added to the ‘Corporate Memory’ and can be used immediately to make further predictions. This is Not the case with most other data mining approaches where a model is learned from the data and any new data can only be incorporated into the model after an expensive learning process has been re-executed.

As can be seen from the description above of Genna, the key to accurate predictions is the comparability / similarity index used to retrieve the comparable cases and the method by which the outcomes of the comparable cases are combined to produce a prediction.

The choice of comparability index and prediction mechanism is a complex process that can be viewed as an optimization problem aimed at minimizing predictive error given certain constrains defined on the parameters of the comparability index and prediction mechanism. Genna uses a 'genetic algorithm' to perform this optimization.

Genetic Algorithms mimic the natural process of evolution to navigate the search space of all possible solutions, in a non-exhaustive manner to quickly arrive at the global optima. Starting with a population of candidate optimal points, an iterative process of evaluation, selection, crossover and mutation helps the population evolve and converge to the global optima navigating around local optima due to the parallel nature of the search being performed. Genetic Algorithms are known for their robustness and parallelizable nature, making them ideal candidates for use in data mining.

Genna (as stated above) can be used for classification and regression predictive tasks. The perspicuity of the model and cognitive basis makes it particularly suited to applications where the justifications of an individual prediction are key. Examples of such domains are government and medicine.

Typical example applications to which Genna has been applied to include:

1) Churn Analysis.

2) House Price Prediction for Mass Appraisal.

3) Prognosis of Colorectal Patients.

Genna key features/capabilities include:

Attribute Weights: User or (Semi) Automatic Generation --

Genna provides the user with three (3) options for optimizing the similarity metric used in comparable retrieval. In the first instance the user of the algorithm can provide a weighting for each of the dependent variables in the input data. Secondly, the user can provide a ranking of the attributes based on domain knowledge resident in the user.

This ranking is taken into account by the optimization carried out by the Genetic Algorithm within Genna. Finally, the user can suggest that the Genetic Algorithm generate the weights autonomously. After the generation of the weights the user can tune the weights and generate new models to obtain insights into the sensitivity of the model to changes in the individual attribute weights.

Flexibility --

Genna provides the user with greater flexibility with regards to affecting the type of model that is generated through the setting of parameters of the algorithm that affect the nature of the distance metric employed, the prediction method employed and the number of comparables used within the prediction phase of the model.

The user can also influence the type of error distribution generated by the application of the model through the setting of a parameter that affects the trade-off used by the algorithm between accuracy and variability of the model. The algorithm uses well-established statistical metrics to generate a measure the estimated accuracy and variability of the model.

In addition to the above features/capabilities Genna also provides the following features:

Ability to use Censored Observations --

Genna uniquely provides distance metrics and prediction mechanisms to explicitly handle censored observations by combining elements of evidence theory into the prediction process and well established statistical techniques like Kaplan-Meier and Wilcoxon’s test.

Ability to use Categorical and Numeric Attributes through the use of innovative distance metrics --

Generally, nearest neighbor algorithms use similarity metrics that are either more suited to categorical attributes or numeric attributes. Using both these types of attributes together introduce biases within the Ability to (semi-) automatically optimize the similarity metric used for comparable retrieval. Genna uses innovative similarity metrics that are suitable for use by numeric as well as categorical attributes.

Automatic Indexing of data for Scalability and Speed --

One of the potential shortfalls of the nearest neighbor family of algorithms is that as they do Not build “compact” models from data for use in predictions, as the data volume increases, the speed of the prediction process can suffer. To alleviate this problem, Genna automatically indexes the data using clustering techniques to speed up the prediction process.

Incremental Learning and Introspection --

Once a model is built using data mining, an important part of the deployment is the monitoring of the accuracy of the predictions made by the model. Over a period of time, the context of the application of the model changes, a concept referred to as ‘Concept Drift in Machine Learning’ literature. With this shift in context the model becomes less accurate in its application. Most data mining algorithms would need to be reapplied to new data resulting in a new model being built and applied within the new context.

Genna approaches this problem differently, as new data is collected, whether the data represents new observations or feedback from the application of the model, it is incorporated into the current model. If the data is actually new observations this continuous learning is referred to as Incremental learning. The incorporation of data on the accuracy of the model’s application on the other hand is referred to as Introspection.

System Requirements

Hardware Pentium III 550MHz and above with 256 RAM (512 Mb Recommended) and a CD-ROM drive for installation is also required.

Operating system Windows 2000 (Service Pack 3) / XP (Service Pack 1) and Microsoft Office 2000.

Manufacturer

Manufacturer Web Site Genna

Price Contact manufacturer.

G6G Abstract Number 20161

G6G Manufacturer Number 101036