NAG Data Mining Components

Category Intelligent Software>Data Mining Systems/Tools and Intelligent Software>Neural Network Systems/Tools

Abstract NAG Data Mining Components (DMC) is a collection of ‘callable components’ designed to help developers build fast, accurate, and robust applications for predictive analytics.

You select the components you need for problem solving and readily integrate these components into your existing applications.

DMC incorporates routines for data cleaning (including imputation and outlier detection), data transformations (scaling, principal component analysis), clustering, classification, regression models and 'machine learning' methods (neural networks, radial basis function, decision trees, nearest neighbors), and association rules.

Also included are utility functions including random number generators and functions for rank ordering, sorting, mean and sum of squares updates, two-way classification comparison, and save and load models.

Who DMC is for -- Application developers working in areas such as Life Sciences, Research, finance, and Independent Software Vendors (ISVs) rely on NAG’s DMC components to enhance performance and significantly reduce development time.

Data mining plays an essential part in applications in a range of business activities including: Bioinformatics, CRM, Web Analytics, Finance, e-Business, Retail, Consumer Behavioral Modeling, and Fraud Detection.

NAG Data Mining Components - Functionality

Data Cleaning --

Data Imputation - Missing values in data are replaced by suitable values by using one of three (3) fundamental approaches (summary statistics; distanced-based measures; or the EM algorithm for multivariate Normal data).

Outlier Detection - Outlier detection concerns finding suspect data records in a set of data. Data records are identified as suspect if they seem Not to be drawn from an assumed data distribution.

Data Transformations --

Scaling Data - The contribution to a distance computation by data values on a continuous variable depends on the range of its values. Thus, for functions which do Not include scaling as an option, it is often preferable to transform all data values on continuous variables to be on the same scale.

Principal Component Analysis - Principal component analysis (PCA) is a tool for reducing the number of variables that you need to consider in an analysis.

PCA derives a set of orthogonal, i.e., uncorrelated, variables that contain most of the information in the original data. The new variables, called principal components, are calculated as linear transformations of the original data.

Cluster Analysis -- Cluster analysis is the statistical name for techniques that aim to find groups of similar data records in a study. k-means Clustering - In k-means clustering the analyst decides how many groups or clusters there are in the data.

Hierarchical Clustering - Hierarchical clustering starts from the collection of data records and agglomerates them step-by-step until there is only one group. The analyst uses results from the hierarchical clustering to determine a natural number of clusters.

Classification --

Classification Trees - NAG Data Mining Components includes functions to calculate binary and n-ary decision trees for classification. The binary classification tree uses the Gini index criterion at nodes, whereas the n- ary classification tree uses an entropy-based criterion.

Generalized Linear Models - Generalized linear models allow a wide range of models to be fitted. These include logistic and probit regression models for binary data, and log-linear models for contingency tables.

In NAG DMC the following distributions are available: binomial distribution (for binary classification tasks) and Poisson distribution (typically used for count data).

Nearest Neighbors - k-nearest neighbor models predict values based on values of the k most similar data records in a training set of data.

The measure of similarity is taken to be one of two distance functions. Prior probabilities can be set for the classes in the data. Training data are stored in a binary tree to enable efficient searching for nearest neighbors.

Regression --

Regression Trees - The two decision trees available in NAG DMC for regression tasks are both binary trees. Each regression tree minimizes the sum of squares about the mean for data at a node.

However, one of the regression trees uses a robust estimate of the mean, whereas the other uses the sample average.

Linear Regression - Linear regression models can be used to predict an outcome y from a number of independent variables. The predictive model is a linear combination of independent variables and a constant term.

NAG DMC can automatically select a good subset of independent variables to use in a model by using stepwise selection.

Multi-layer Perceptron Neural Networks - Multi-layer perceptrons (MLPs) are flexible non-linear models that may be represented by a directed graph.

The process of optimizing values of the free parameters in an MLP is known as training. Training involves minimizing the sum of the squared error function between MLP predictions and training data values.

NAG DMC uses a conjugate gradients optimizer to train MLPs.

Nearest Neighbors - k-nearest neighbor models predict values based on values of the k most similar data records in a training set of data.

The measure of similarity is taken to be one of two distance functions. Training data are stored in a binary tree to enable efficient searching for nearest neighbors.

Radial Basis Function Models - A radial basis function (RBF) computes a scalar function of the Euclidean distance from its centre location to data records.

A linear combination of RBF outputs defines a RBF model. The advantages of RBF models are that the centres can be positioned to reflect domain knowledge and the optimization is fast and accurate.

Association Rules -- The goal of association analysis is to determine relationships between nominal data values.

These models are typically used for market basket analysis.

Utility Functions -- The utility functions are designed to support the main functions described above and to help with prototyping. Utility functions included are:

Random number generators; Rank ordering; Sorting; Mean and sum of squares updates; Two-way classification comparison; and Save and load models.

System Requirements

Installers and Users notes are available for the following systems:

Manufacturer

Manufacturer Web Site NAG Data Mining Components

Price Contact manufacturer.

G6G Abstract Number 20343

G6G Manufacturer Number 102637