See5 and C5.0

Category Intelligent Software>Data Mining Systems/Tools

Abstract See5 (Windows 2000/XP/Vista) and its Unix counterpart C5.0 are advanced data mining tools for discovering patterns that delineate categories, assembling them into classifiers, and using them to make predictions.

See5/C5.0 features/ capabilities are:

1) See5/C5.0 has been designed to analyze substantial databases containing thousands to hundreds of thousands of records and tens to hundreds of numeric, time, date, or nominal fields.

See5/C5.0 also takes advantage of processors with quad cores, up to four CPUs, or Intel Hyper-Threading to speed up the analysis.

2) To maximize interpretability, See5/C5.0 classifiers are expressed as decision trees or sets of if-then rules, forms that are generally easier to understand than neural networks.

3) See5/C5.0 is available for Windows 2000/XP/Vista and Linux.

4) See5/C5.0 is easy to use and does Not presume any special knowledge of Statistics or Machine Learning (although these don't hurt, either!)

5) RuleQuest provides C source code so that classifiers constructed by See5/C5.0 can be embedded in your organization's own systems.

So what makes See5/C5.0 different? One short answer is its attention to the issue of 'comprehensibility'.

The manufacturer believes that a 'data mining system' should find patterns that provide insight in addition to supporting accurate predictions.

In line with this approach, See5/C5.0 emphasizes 'rule-based classifiers' because they are easier to understand -- each rule can be examined and validated separately, without having to consider it in the context of the classifier as a whole.

When you use it you'll also notice that See5/C5.0 is fast.

See5/C5.0 can also generate decision trees, useful in situations where classifiers must be constructed even more quickly.

The emphasis on rule-based classifiers is only one aspect of See5/C5. 0 (albeit an important one). Other advanced facilities include:

1) Boosting, a technique for constructing multiple classifiers to improve predictive accuracy;

2) 'Differential misclassification costs', allowing some mistakes to be identified as more important than others;

3) Case weights, when an application needs to specify the importance of each case;

4) Winnowing, which ignores less relevant attributes and estimates the relative importance of those remaining; and

5) Support for 'cross-validation trials' and sampling.

New in Release 2.06 of See5/C5.0 --

1) New algorithm for softening thresholds -

See5/C5.0 decision trees have an option to soften threshold tests for continuous attributes; values near the threshold cause both the low and high branches to be evaluated and combined probabilistically.

The methods for finding the bounds within which this combination is invoked have been re-designed to make them both faster and more effective.

This option can lead to noticeably better predictive performance and is now recommended for applications with many continuous attributes.

2) Improved boosting -

The boosting option generates several classifiers that are then voted to give a final prediction.

This option, which is commonly used to increase classification accuracy, has been updated to give better results, especially on applications that use differential misclassification costs.

3) Selecting tests -

Recent releases used a subset of the training data to eliminate some possible tests from consideration.

This could (very rarely) lead to different classifiers when the training data were reordered, or when See5/C5.0 was run on computers with multiple CPUs.

The use of data subsets has been discontinued in Release 2.06, at a cost of a small increase in the time required for applications with many continuous attributes and hundreds of thousands of training cases.

4) Faster classification with rule sets -

The process for finding all the rules that are satisfied by a case has been enhanced.

5) Discontinuation of Solaris support -

Solaris on SPARC architectures will no longer be supported.

Any Solaris licensees who might be inconvenienced by this change should contact the manufacturer to discuss possible remedies, such as moving their licenses to different computers.

6) Bug fix: attribute winnowing -

The attribute winnowing option attempts to identify unhelpful attributes and exclude them from classifiers.

A bug that could allow some or all of these attributes to be retained was corrected in January 2009.

System Requirements

Available for Windows 2000/XP/Vista and Linux.

Manufacturer

Manufacturer Web Site See5 and C5.0

Price Contact manufacturer.

G6G Abstract Number 20344

G6G Manufacturer Number 102311