CHAID-like Classification Trees
SAS MACRO
/****************************************************************/
/* S A S S A M P L E L I B R A R Y */
/* */
/* NAME: TREEDISC */
/* TITLE: TREEDISC MACRO - BETA VERSION */
/* PRODUCT: IML */
/* SYSTEM: ALL */
/* KEYS: */
/* PROCS: IML */
/* DATA: */
/* */
/* REF: */
/* MISC: Bug with LEAF= and NOMINAL= fixed 03Dec93 */
/* Added color options for DRAW=GR 07Mar94 */
/* Added POS= option for DRAW=GR 10May94 */
/* Added INTO_, TIE_, and POST_ to */
/* OUTTREE= data set 3Sep94 */
/* Added PFORMAT= option 17Oct94 */
/* Bug in CODE= with unformatted ordinal */
/* character variable fixed 15Nov94 */
/* Changed 2nd arg in PROBF call from */
/* 1e10 to 1e9 due to 6.11 limit 25Aug95 */
/* */
/****************************************************************/
The %TREEDISC macro requires the SAS/IML product. To draw the tree, the
SAS/OR product is required, and release 6.08 or later is recommended.
The XMACRO macros from the SAS/STAT sample library in 6.10 or later are
also required. This macro will not work in 6.04 or earlier releases.
Purpose
-------
The %TREEDISC macro generates a SAS data set which describes a decision
tree computed from an input data set to predict a specified categorical
dependent variable from one or more other predictor variables. The tree
can be listed or drawn or used to generate code for a SAS DATA step to
classify observations.
Overview
--------
The decision tree is constructed by partitioning the data set into two
or more subsets of observations based on the categories of one of the
predictor variables. After the data set is partitioned according to the
chosen predictor variable, each subset is considered for further
partitioning using the same algorithm that was applied to the entire
data set. Each subset is partitioned without regard to any other
subset. This process is repeated for each subset until some stopping
criterion is met. This recursive partitioning forms a tree structure.
The "root" of the tree is the entire data set. The subsets and
subsubsets form the "branches" of the tree. Subsets that meet a stopping
criterion and thus are not partitioned are "leaves". Any subset in the
tree, including the root or leaves, is a "node".
The number of subsets in a partition can range from two up to the number
of categories of the predictor variable. In this regard, %TREEDISC is
similar to the CHAID algorithm (Kass 1980), but differs from CART
(Breiman et al. 1984), which always forms two subsets, and from ID3 or
C4.5 (Quinlan 1993), which make every category a subset.
The predictor variable used to form a partition is chosen to be the
variable that is most significantly associated with the dependent
variable according to a chi-squared test of independence in a
contingency table (a cross-tabulation of the predictor and dependent
variable). The main stopping criterion used by %TREEDISC is the p-value
from this chi-squared test. A small p-value indicates that the observed
association between the predictor and the dependent variable is unlikely
to have occurred solely as the result of sampling variability.
If a predictor has more than two categories, then there may be a very
large number of ways to partition the data set based on the categories.
A combinatorial search algorithm is used to find a partition that has a
small p-value for the chi-squared test. The p-values for each
chi-squared test are adjusted for the multiplicity of partitions.
Predictors can be nominal (aka free), ordinal (aka monotonic), or
ordinal with a floating category. For a nominal predictor, the
categories are not ordered and therefore can be combined in any way to
form a partition. For an ordinal predictor, the categories are ordered,
and only categories that are adjacent in the order can be combined when
forming a partition. A predictor that is ordinal with a floating
category has categories that are all ordered except for the one floating
category. The ordered categories can be combined only in accordance with
their order, but the floating category can be combined with any other
categories. The floating category is the first category.
Categories and formats
----------------------
Categories are defined by the formatted values of the variables unless
you specify OPTIONS=NOFORMAT. If the categories are correctly defined by
unformatted values, then OPTIONS=NOFORMAT can save considerable computer
time and disk space. However, using unformatted floating-point variables
with more than 10 digits risks incorrect comparisons due to limits on
numerical precision. Prior to release 6.08, default format lengths
cannot be determined correctly, so be sure to specify a length with each
format.
When formats are not used, the order of the categories is determined by
their unformatted (internal) values, i.e., numerical order for numeric
variables or alphabetical order for character variables. When formats
are used, the order of the categories is determined according to the
ORDER= argument, which by default uses the same order as for unformatted
values.
Ordinal predictors are allowed to be continuous, rather than
categorical, but the amount of computer time and memory increases with
the number of different values. If the number of different values of a
predictor is very large, it is advisable to use a format to categorize
the predictor.
In this version of %TREEDISC, there is no way to format some variables
but not others. Default formats are $. for character variables and
BEST12. for numeric variables.
Missing values
--------------
Missing values are treated as just another category. Using the default
ORDER=INTERNAL, missing values sort lower than nonmissing values.
If an ordinal floating predictor has missing values, then by default
the floating category will be the missing value. If an ordinal floating
predictor has no missing values, then the floating category will be
the first nonmissing value.
Numeric variables in SAS data sets can contain ordinary missing values
(.) and special missing values (.A, .B, ..., .Z, ._). However, PROC IML
cannot distinguish between different types of missing values. Hence, if
you specify OPTIONS=NOFORMAT, all missing values are treated as being in
the same category. If you do not specify OPTIONS=NOFORMAT, numeric
variables are converted to character variables before running IML, so
different missing values will be distinguishable.
Algorithm
---------
The algorithm is similar to the CHAID algorithm described in Kass
(1980). However, the published algorithm is ambiguous in step 3 on
p 121, which says:
Step 3. For each compound category consisting of three or more of
the original categories, find the most significant binary
split (constrained by the type of the predictor) into which
the merger may be resolved. ...
This step does not specify how to find the required binary split. Using
direct search, finding an optimal binary split for nominal variables
requires time that is exponential in the number of categories. But the
purpose of the stepwise algorithm that Kass proposes is to avoid using
exponential time, so it is impossible to determine from the published
article how this step was intended to be implemented. The %TREEDISC
macro uses one of many possible compromises to avoid using excessive
time: direct search is used if the number of categories within a
compound category is less than or equal to a threshold specified by the
NOMSPLIT= or ORDSPLIT= argument; otherwise, only the merging step is
used.
The published CHAID algorithm also suffers from possible infinite loops.
To avoid such loops, after each sequence of merges or splits, %TREEDISC
chooses the set of compound categories that yields the minimum adjusted
p-value. The CHAID algorithm uses the final set of compound categories.
The choice used by %TREEDISC prevents infinite loops since the adjusted
p-value can never increase, and also tends to find compound categories
with better p-values than the original algorithm.
Kass (1980) uses a Bonferroni adjustment for the p-values computed from
the contingency tables relating predictors to the dependent variable.
The adjustment is conditional on the number of branches (compound
categories) in the partition and thus does not take into account the
fact that different numbers of branches are considered. The
conservatism shown in the simulations in Kass (1980) is due to the
ineffectiveness of the CHAID algorithm in finding partitions with small
p-values. %TREEDISC is more effective than CHAID at obtaining small
p-values. Simulations with %TREEDISC have shown that the adjusted
p-values can be slightly liberal for 5 or fewer categories in an ordinal
predictor. For example, the observed type 1 error rate may be as high as
6% for an alleged alpha of 5%, or as high as 11% for an alleged alpha of
10%. This degree of liberality should not be of any practical concern.
In addition to the Bonferroni adjustment, %TREEDISC uses Gabriel's
adjustment to increase the power for multiple comparisons in a
contingency table as suggested by Hawkins & Kass (1982). For an
observed chi-square value X for a dependent variable with r categories
and a predictor with c categories merged into k branches, the adjusted
p-value is computed as:
p = min( Bonf. mult. * Prob(chisquare(r-1,k-1)>X),
Prob(chisquare(r-1,c-1)>X) )
Limitations
-----------
The number of observations may not exceed (2**31)-1, which is slightly
over 2 billion. %TREEDISC has been tested with up to a million
observations, which may take several days to process on a workstation
or fast PC.
The number of predictors should not exceed about 4000 due to limitations
in IML. If formatted values are used, the number of predictors is
limited by the number of SAS data sets that can be created under your
operating system.
The number of categories of a predictor or of the dependent variable
may not exceed 32767. For a nominal predictor, a large number of
categories (perhaps several hundred, depending on the machine) may
cause floating-point overflow. It is recommended that you keep the
number of categories fairly small to keep computer time and memory
usage down.
If formatted values are used, the categories must be distinguishable
by the first 16 characters of the formatted values.
There must be enough memory to store contingency tables relating the
dependent variable to each predictor; the minimum number of bytes
required is thus eight times the number of categories for the dependent
variable times the sum of the number of categories of all the
predictors. Additional memory is required depending on the value of the
MAXREAD= argument.
Usage
-----
To construct a decision tree, you must specify the dependent variable
with the DEPVAR= argument and the predictor variables with the NOMINAL=,
ORDINAL=, or ORDFLOAT= arguments. No other arguments are required.
The tree structure can be listed using indentation to show the levels of
the tree by specifying LISTVAR= or OPTIONS=LIST. The SAS/OR procedure
NETDRAW can be used to draw the tree diagram by specifying DRAW= or
DRAWVAR=. You can generate code for a SAS DATA step to classify
observations in a data set by specifying CODE=.
If you are constructing a decision tree and request no specific display
of the tree, then the tree is listed. If you are processing a previously
computed tree and request no specific output, then the tree diagram is
drawn with NETDRAW. It is usually a good idea to list the tree before
drawing it to see how big the tree is.
The arguments may be listed within parentheses in any order, separated
by commas. For example:
%treedisc( data=iris, depvar=species, ordinal=petal: sepal:,
draw=lp, options=list noformat)
Do not use data set names or variable names that begin or end with an
underscore.
Arguments pertaining to input data sets:
DATA= SAS data set to be analyzed. If the data set is
TYPE=TREEDISC, then it is treated the same as the
INTREE= data set. If INTREE= is not specified, then
the most recently created SAS data set (_LAST_) is used.
INTREE= Name of input data set containing a previously computed
tree. If INTREE= is specified without DATA=, then no new
decision tree is computed, but the existing tree is drawn.
Arguments pertaining to input variables:
DEPVAR= Dependent variable. This argument is required for
computing a decision tree, although not for displaying
a previously computed tree.
FREQ= Frequency variable. Each observation is treated as if it
were repeated as many times as the value of the FREQ=
variable. Observations with a FREQ= value less than or
equal to zero are omitted from the tree construction.
Fractional values are allowed.
NOMINAL= Variable list specifying nominal predictors.
ORDINAL= Variable list specifying ordinal predictors.
ORDFLOAT= Variable list specifying floating predictors. A floating
predictor has its categories on an ordinal scale except
for one (the first) that does not belong with the rest
or whose position on the ordinal scale is unknown.
FORMAT= List of variables and formats to be used with them, using
the same syntax as in a FORMAT statement.
ORDER= ORDER= option to use with PROC FREQ to determine order of
categories. Default is internal (unformatted) order. This
argument does not apply if OPTIONS=NOFORMAT is specified.
Arguments for stopping criteria for construction of the decision tree:
ALPHA= Numeric value in the range (0,1), adjusted significance
level for using a predictor in the tree. If the
significance level for the most significant (optimally
merged) predictor exceeds this value, then the node is
subdivided according to the predictor's merged
categories. The default is 0.1.
BRANCH= A positive integer specifying the minimum number of
observations in a node to qualify it for further
subdivision. The default is twice the value of LEAF=.
LEAF= A positive integer specifying the minimum number of
observations allowed in a leaf. Any partition that
would result in a leaf with fewer observations is
rejected. The default is 1.
MAXDEPTH= Maximum depth (number of nodes from root to leaf) allowed
in the tree. The default is 100.
Arguments for saving the decision tree in a SAS data set:
OUTTREE= Name of the output data set describing the decision tree.
The default is _DATA_. The data set type is TREEDISC.
Variables include:
NODE_ is the node number. The root is 1 and the other
nodes in the tree are numbered consecutively.
DEPTH_ the depth of the node, with the root being 1.
SPLIT_ names the predictor by which the node was
split. For the root, however, the name of the
dependent variable is given.
VALUES_ are values of the predictor variable for each
branch of the split. For the root, however, the
values of the dependent variable are given.
Missing values for character variables (i.e.
blanks) are indicated by a period. Individual
values are in _VAL1, _VAL2, etc.
TOTAL_ is the total sample size.
SIZE_ is the number of observations in the node.
ERRORS_ is the number of classification errors at the
node assuming no further splits.
COUNT_ contains the number of observations in each
category of the dependent variable in the node.
Individual values are in _COU1, _COU2, etc.
PCTOTAL_ contains the percentage of the total number of
observations in each category of the dependent
variable in the node.
Individual values are in _PCT1, _PCT2, etc.
PCNODE_ contains the percentage of the observations in
the node in each category of the dependent
variable.
Individual values are in _PCN1, _PCN2, etc.
PVAL1_ is the smallest adjusted p-value for
further splits.
PVAL2_ is the second smallest adjusted p-value for
further splits.
PVALUES_ gives the two smallest adjusted p-values for
further splits, formatted as usual for p-values.
INTO_ The category of the dependent variable into
which an observation in this branch is
classified by the decision tree.
POST_ The posterior probability estimate, which
is biased upwards.
TIE_ The number of categories that were tied
for greatest posterior probability, if any.
Most of these variable names do not include leading
underscores because NETDRAW insists on using only the
first three characters of a variable name.
Arguments for generating DATA step code to classify data:
CODE= PRINT|LOG|fileref, specifying where to write code for a
DATA step implementing the decision tree computed by
%TREEDISC. This code can be executed to classify the
original data or to classify a test data set to obtain
an unbiased estimate of the misclassification rate.
If you specify a fileref, you can run the code by:
data output_data_set;
set data_to_be_classified;
%inc fileref_in_CODE_argument;
run;
If you are using formats, each variable must be
identifiable by the first seven letters of its name,
since formatted values are named by adding an
underscore as a prefix to the original variable name.
The output data set contains the following new variables:
NODE_ The node number.
INTO_ The category of the dependent variable into
which the observation is classified by the
decision tree.
POST_ The posterior probability estimate, which
is biased upwards.
TIE_ The number of categories that were tied
for greatest posterior probability, if any.
You can use this data set with various procedures to
obtain information such as descriptive statistics
regarding each leaf in the tree.
Arguments for printed output:
TRACE= NONE|SHORT|MEDIUM|LONG, amount of printed output from IML
tracing the decision tree computation. The default is
NONE.
TRACE=SHORT only prints the best predictor at each node.
TRACE=MEDIUM reports chi-squared and adjusted p-values
for every predictor considered at each node in the tree.
TRACE=LONG also prints the final contingency table for
each predictor that is selected, as well as the
chi-squared and p-value for each predictor that is
considered as they are computed, which is useful if you
are wondering if the program has hung while processing a
large data set.
TRACE=VERYLONG also prints the final contingency table
for each predictor that is considered.
PFORMAT= Format for printing p-values. By default, p-values
greater than .0001 are printed with a 6.4 format, and
smaller p-values are printed as .0001. If you have a
large sample size and get lots of tiny pvalues, then
you might try something like PFORMAT=BEST8. (don't
forget the decimal point).
INDENT= Number of spaces to indent each level of the tree in the
tree listing and code. The default is 4.
SPACE= Number of spaces between values in the list in the
variables VALUES_, COUNT_, PCTOTAL_, and PCNODE_.
The default is 2.
LISTVAR= List of variables in the OUTTREE= data set, optionally
interspersed with quoted strings and formats, to print
in the tree listing. Use a slash to indicate the start
of a new line. Be very careful typing this, since
mistakes can produce baffling error messages. If you do
get error messages, then OPTIONS MPRINT may help to
diagnose them. The default is:
SPLIT_ 'value(s): ' VALUES_ /
'DV counts: ' COUNT_ ' Best p-value(s): ' PVALUES_ /.
Arguments for drawing the tree:
DRAW= Mode for drawing the tree diagram with PROC NETDRAW:
LP|LINEPRINTER (default)
FS|FULLSCREEN
GR|GRAPHICS
NO|NONE
For DRAW=LP, use a *large* line size (OPTIONS LS=).
Setting the page size (OPTIONS PS=) is tricky, since the
tree is drawn starting at the bottom of the page. If you
specify too large a value for PS=, you will get lots of
blank lines at the top of the page.
For interactive use, DRAW=FS is recommended. Use the SCALE
command in NETDRAW, usually SCALE MAX, to see the nodes in
the tree in their entirety. You can vary the size of the
nodes interactively, and you can scroll up, down, left,
and right to see the entire tree. However, the arcs in the
tree may not be drawn correctly in releases prior to 6.08.
For DRAW=GRAPHICS, be *sure* to specify GOPTIONS DEVICE=,
otherwise the tree will be drawn on the SAS log and will
be unintelligible. To see the tree on the screen, it is
better to use SAS Display Manager than to run in line or
batch mode, since with DMS you can zoom the graphics
window and the tree will expand to fill the window; in
non-DMS modes, if you zoom the graphics window, the
tree will stay the same size. This behavior may be
dependent on the operating system.
DRAWVAR= Variables to display in each node of the tree diagram.
The default is SPLIT_ VALUES_ COUNT_ PVALUES_.
BOX= Value of the BOXWIDTH= option to be used with PROC
NETDRAW. The default is 21.
NETOPT= Options to be included in the PROC NETDRAW statement.
The default is OUT=_NETDR_.
ACTOPT= Options to be included in the ACTNET statement. If
DRAW=GRAPHICS is used, the default is NODEFID TREE
CENTERSUBTREE RECTILINEAR. Otherwise, the
default is NODEFID TREE CENTERSUBTREE VBETWEEN=3.
Argument for controlling the size of the graph with DRAW=GRAPHICS:
POS=hpos vpos
Two positive integers giving values for the HPOS= and
VPOS= options in the GOPTIONS statement. If you do not
specify POS=, the tree diagram will occupy one page,
screen, or window. You can enlarge the tree diagram to
two or more pages/screens/windows by specifying values
for POS= equal to the number of character positions used
horizontally and vertically in the tree diagram divided
by the number of pages/screens/windows that you want
horizontally and vertically. Trial and error may be the
easiest way to determine these values.
Arguments for colors with DRAW=GRAPHICS:
CBACK= Color of background. The default depends on the device
and the version of SAS software being used. This argument
sets the background color with the CBACK= option in the
GOPTIONS statement, so the color persists for subsequent
graphs until you change it.
CFILL= Color to use for filling the nodes. Default is the
same as CBACK=.
CTEXT= Color of text in the nodes. The default is WHITE if
CFILL is BLACK, GRAY, or BLUE; otherwise, the default
is BLACK.
CLINES= Color of the lines connecting and outlining nodes.
The default is YELLOW if CFILL is BLACK, GRAY, or BLUE;
otherwise, the default is BLACK.
Arguments for tuning performance and memory usage:
MAXREAD= Maximum number of observations to read into memory at
one time in IML. The default is 100. If IML runs out of
memory, try a smaller value of MAXREAD=.
Miscellaneous options
OPTIONS= List of additional options separated by blanks:
NOLIST Suppress listing of the tree structure.
LIST List the tree structure anyway.
READ Keep the entire data set in memory instead of
reading it from the disk over and over. This
will use more memory but save time computing the
tree.
DOFREQ Reduce the DATA= data set to a frequency table
using PROC FREQ. This saves time if there are
many duplicate observations in the data set.
NOFORMAT Do not format variables to determine
categories. This can save time if formats are
not needed to define categories. It may also
be necessary for processing a large number of
predictors.
CHAID Use the original CHAID merging and splitting
algorithm.
Esoteric arguments you are not likely to need:
MERGE= Numeric value in the range (0,1), raw significance level
for combining two categories into a compound category.
The default is 0.0001.
SPLIT= Numeric value in the range (0,1), raw significance level
for dividing one compound category that contains at
least three of the original categories into its most
significant binary split. The default is 0.049.
NOMSPLIT= A positive integer specifying the maximum number of
categories in a compound category of a nominal variable
for which splitting will be attempted. The default is 10.
ORDSPLIT= A positive integer specifying the maximum number of
categories in a compound category of an ordinal variable
for which splitting will be attempted. The default is
100.
The following statements may be useful for diagnosing errors:
%let _notes_=1; %* Prints SAS notes for all steps;
%let _echo_=1; %* Prints the arguments to the DISTANCE macro;
%let _echo_=2; %* Prints the arguments to the DISTANCE macro
after defaults have been set;
options mprint; %* Prints SAS code generated by the macro
language;
options mlogic symbolgen; %* Prints lots of macro debugging info;
This macro normally spends a lot of time checking the arguments for
validity, in hopes of avoiding mysterious error messages from the
generated SAS code. You can reduce the amount of time spent checking
arguments (and thereby speed up the macro at the risk of getting
inscrutable error messages if you make a mistake) by using one of the
following statements before invoking the macro:
%let _check_=1; %* reduce argument checking;
%let _check_=0; %* suppress argument checking--use at your own risk!;
References
----------
Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984),
_Classification and Regression Trees_, Wadsworth: Belmont, CA.
Hawkins, D.M. & Kass, G.V. (1982), "Automatic Interaction Detection",
in Hawkins, D.M., ed., _Topics in Applied Multivariate Analysis_,
267-302, Cambridge Univ Press: Cambridge.
Kass, G.V. (1980), "An Exploratory Technique for Investigating Large
Quantities of Categorical Data", Applied Statistics, 29, 119-127.
Quinlan, J.R. (1993), _C4.5: Programs for Machine Learning_, Morgan
Kaufman: San Mateo, CA.
Example
-------
*** The following statement should be changed as necessary
to refer to the location of TREEDISC on your system;
%inc 'treedisc.sas';
proc format;
value specname
1='SETOSA '
2='VERSICOLOR'
3='VIRGINICA ';
value specchar
1='S'
2='O'
3='V';
run;
data iris;
title 'Fisher (1936) Iris Data';
input sepallen sepalwid petallen petalwid species @@;
format species specname.;
label sepallen='Sepal Length in mm.'
sepalwid='Sepal Width in mm.'
petallen='Petal Length in mm.'
petalwid='Petal Width in mm.';
cards;
50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3
63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2
59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2
65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3
68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3
77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3
49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2
64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3
55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1
49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1
67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1
77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2
50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1
61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1
61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1
51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1
51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1
46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1
50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3
57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1
71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3
49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1
49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1
66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1
44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2
47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2
74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1
56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3
49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1
56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2
51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3
54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3
61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3
68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1
45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1
55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1
51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2
63 33 60 25 3 53 37 15 02 1
;
options ls=110 ps=80 nodate nonumber;
*** Compute a tree for predicting SPECIES from the petal and
sepal lengths and widths, which are treated as ordinal
predictors;
%treedisc( data=iris, depvar=species, ordinal=petal: sepal:,
outtree=trd, options=noformat, trace=2);
*** draw the tree diagram in lineprinter mode;
%treedisc( intree=trd, draw=lp);
*** Generate DATA step code to classify observations;
%treedisc( intree=trd, code=print);
*** This time save the code in a file named trdiris.code .
Filenames are system-dependent, so this name may need to be
changed;
%treedisc( intree=trd, code='trdiris.code')
*** Classify the data;
data out;
set iris;
%inc 'trdiris.code';
run;
*** Cross-tabulate actual species with the predicted species (_INTO_);
proc freq; tables species*into_; format species into_ specname.; run;