Category Intelligent Software>Data Mining Systems/Tools

Abstract CART data mining software is a decision tree tool that automatically sifts large, complex databases, searching for and isolating significant patterns and relationships. This discovered knowledge is then used to generate reliable, predictive models for applications such as credit risk scoring (probability of default, loss given default); fraud detection; targeted marketing (new customer acquisition, cross-sell, up-sell); churn modeling [and related customer relationship management (CRM)]; document classification; microarray data analysis; genomics, proteomics; manufacturing and production line quality control.

In addition, CART is an advanced pre-processing complement to other data analysis techniques and data-mining packages, such as SAS. For example, CART's outputs (predicted values) can be used as inputs to improve the predictive accuracy of Neural Networks (NN) and Logistic Regression.

In the first stage of a data-mining project, CART can extract the most important variables from a very large list of potential predictors. Focusing on the top variables from the CART model can significantly speed up neural networks and other data-mining techniques. For neural nets in particular, CART bypasses "noise" and irrelevant variables, quickly and effectively selecting the best variables for input. The result is significant reductions in neural-net training speeds and more accurate and robust neural networks. In addition, the CART outputs, or "predicted values," can be used as inputs to the neural net.

CART is an acronym for Classification and Regression Trees, a decision-tree procedure. A decision tree is a flow chart or diagram representing a classification system or predictive model. The tree is structured as a sequence of simple questions, and the answers to these questions trace a path down the tree. The end point reached determines the classification or prediction made by the model, which can be a qualitative judgment (e.g., these are responders) or a numerical forecast (e.g., sales will increase 15 percent).

CART's methodology is characterized by:

1) A reliable pruning strategy -- CART's developers determined definitively that No stopping rule could be relied on to discover the optimal tree, so they introduced the notion of over-growing trees and then pruning back; this idea, fundamental to CART, ensures that important structure is Not overlooked by stopping too soon.

2) An advanced binary split search approach -- CART's binary decision trees are more sparing with data and detect more structure before too little data are left for learning.

3) Automatic self-validation procedures -- In the search for patterns in databases it is essential to avoid the trap of "overfitting," or finding patterns that apply only to the training data. CART's embedded test disciplines ensure that the patterns found will hold up when applied to new data. Further, the testing and selection of the optimal tree are an integral part of the CART algorithm.

In addition, CART accommodates many different types of real world modeling problems by providing a unique combination of automated solutions:

1) Surrogate splitters intelligently handle missing values -- CART handles missing values in the database by substituting "surrogate splitters," which are back-up rules that closely mimic the action of primary splitting rules. The surrogate splitter contains information that is typically similar to what would be found in the primary splitter. In CART, each record is processed using data specific to that record; this allows records with different data patterns to be handled differently, which results in a better characterization of the data.

2) Adjustable misclassification penalties help avoid the most costly errors --CART can accommodate situations in which some misclassifications, or cases that have been incorrectly classified, are more serious than others. CART users can specify a higher penalty for misclassifying certain data, and the software will steer the tree away from that type of error. Further, when CART cannot guarantee a correct classification, it will try to ensure that the error it does make is less costly. If credit risk is classified as low, moderate, or high, for example, it would be much more costly to classify a high risk person as low risk than as moderate risk.

3) Alternative splitting criteria make progress when other criteria fail -- CART includes seven (7) single variable splitting criteria - Gini, symmetric Gini, twoing, ordered twoing and class probability for classification trees, and least squares and least absolute deviation for regression trees - and one multi-variable splitting criterion, the linear combinations method. The default Gini method typically performs best, but, given specific circumstances, other methods can generate more accurate models. CART's unique "twoing" procedure, for example, is tuned for classification problems with many classes, such as modeling which of 170 products would be chosen by a given consumer. To deal more effectively with select data patterns, CART also offers splits on linear combinations of continuous predictor variables.

Model Deployment -- Any CART model can be easily deployed when translated into one of the supported languages (SAS-compatible), C, and Predictive Modeling Markup Language (PMML)/Extensible Markup Language (XML) or into classic text output. This is critical for using your CART trees in large scale production work.

The decision logic of a CART tree, including the surrogate rules utilized if primary splitting values are missing, is automatically implemented. The resulting source code can be dropped into an external application thus eliminating errors due to hand coding of decision rules and enabling fast and accurate model deployment.

Additional Features/Benefits are:

1) Scalable -- Easily and quickly handles gigabyte-sized datasets.

2) GUI and Command-Line Interfaces -- Intuitive point-and-click and command-line control modes (issues commands at prompt or via batch files).

3) Multiple Variable Types -- Efficiently searches any combination of categorical, continuous, and text data.

4) Automatic Self-Testing Procedures -- Automatically validates tree results using cross validation or user-specified test data.

5) Committee of Experts -- Yields higher accuracy with new resampling bootstrap and ARCing technologies for tree combining.

Note: See CART 5.0 New and Enhanced Features (G6G Abstract Number 20053A1) for additional features.

System Requirements

CART requirements.


  • Salford Systems
  • 9685 Via Excelencia
  • Suite 208
  • San Diego, CA 92126
  • USA
  • Telephone: (619) 543-8880
  • Fax: (619) 543-8888

Manufacturer Web Site Salford Systems CART

Price Contact manufacturer.

G6G Abstract Number 20053

G6G Manufacturer Number 102305