CHAID-like Classification Trees
SAS MACRO


 /****************************************************************/
 /*          S A S   S A M P L E   L I B R A R Y                 */
 /*                                                              */
 /*    NAME: TREEDISC                                            */
 /*   TITLE: TREEDISC MACRO - BETA VERSION                       */
 /* PRODUCT: IML                                                 */
 /*  SYSTEM: ALL                                                 */
 /*    KEYS:                                                     */
 /*   PROCS: IML                                                 */
 /*    DATA:                                                     */
 /*                                                              */
 /*     REF:                                                     */
 /*    MISC: Bug with LEAF= and NOMINAL= fixed    03Dec93        */
 /*          Added color options for DRAW=GR      07Mar94        */
 /*          Added POS= option for DRAW=GR        10May94        */
 /*          Added INTO_, TIE_, and POST_ to                     */
 /*             OUTTREE= data set                  3Sep94        */
 /*          Added PFORMAT= option                17Oct94        */
 /*          Bug in CODE= with unformatted ordinal               */
 /*             character variable fixed          15Nov94        */
 /*          Changed 2nd arg in PROBF call from                  */
 /*             1e10 to 1e9 due to 6.11 limit     25Aug95        */
 /*                                                              */
 /****************************************************************/


The %TREEDISC macro requires the SAS/IML product. To draw the tree, the
SAS/OR product is required, and release 6.08 or later is recommended.
The XMACRO macros from the SAS/STAT sample library in 6.10 or later are
also required. This macro will not work in 6.04 or earlier releases.

Purpose
-------

The %TREEDISC macro generates a SAS data set which describes a decision
tree computed from an input data set to predict a specified categorical
dependent variable from one or more other predictor variables. The tree
can be listed or drawn or used to generate code for a SAS DATA step to
classify observations.

Overview
--------

The decision tree is constructed by partitioning the data set into two
or more subsets of observations based on the categories of one of the
predictor variables.  After the data set is partitioned according to the
chosen predictor variable, each subset is considered for further
partitioning using the same algorithm that was applied to the entire
data set.  Each subset is partitioned without regard to any other
subset.  This process is repeated for each subset until some stopping
criterion is met. This recursive partitioning forms a tree structure.
The "root" of the tree is the entire data set.  The subsets and
subsubsets form the "branches" of the tree. Subsets that meet a stopping
criterion and thus are not partitioned are "leaves". Any subset in the
tree, including the root or leaves, is a "node".

The number of subsets in a partition can range from two up to the number
of categories of the predictor variable. In this regard, %TREEDISC is
similar to the CHAID algorithm (Kass 1980), but differs from CART
(Breiman et al. 1984), which always forms two subsets, and from ID3 or
C4.5 (Quinlan 1993), which make every category a subset.

The predictor variable used to form a partition is chosen to be the
variable that is most significantly associated with the dependent
variable according to a chi-squared test of independence in a
contingency table (a cross-tabulation of the predictor and dependent
variable). The main stopping criterion used by %TREEDISC is the p-value
from this chi-squared test. A small p-value indicates that the observed
association between the predictor and the dependent variable is unlikely
to have occurred solely as the result of sampling variability.

If a predictor has more than two categories, then there may be a very
large number of ways to partition the data set based on the categories.
A combinatorial search algorithm is used to find a partition that has a
small p-value for the chi-squared test.  The p-values for each
chi-squared test are adjusted for the multiplicity of partitions.

Predictors can be nominal (aka free), ordinal (aka monotonic), or
ordinal with a floating category.  For a nominal predictor, the
categories are not ordered and therefore can be combined in any way to
form a partition.  For an ordinal predictor, the categories are ordered,
and only categories that are adjacent in the order can be combined when
forming a partition.  A predictor that is ordinal with a floating
category has categories that are all ordered except for the one floating
category. The ordered categories can be combined only in accordance with
their order, but the floating category can be combined with any other
categories. The floating category is the first category.

Categories and formats
----------------------

Categories are defined by the formatted values of the variables unless
you specify OPTIONS=NOFORMAT. If the categories are correctly defined by
unformatted values, then OPTIONS=NOFORMAT can save considerable computer
time and disk space. However, using unformatted floating-point variables
with more than 10 digits risks incorrect comparisons due to limits on
numerical precision. Prior to release 6.08, default format lengths
cannot be determined correctly, so be sure to specify a length with each
format.

When formats are not used, the order of the categories is determined by
their unformatted (internal) values, i.e., numerical order for numeric
variables or alphabetical order for character variables. When formats
are used, the order of the categories is determined according to the
ORDER= argument, which by default uses the same order as for unformatted
values.

Ordinal predictors are allowed to be continuous, rather than
categorical, but the amount of computer time and memory increases with
the number of different values. If the number of different values of a
predictor is very large, it is advisable to use a format to categorize
the predictor.

In this version of %TREEDISC, there is no way to format some variables
but not others. Default formats are $. for character variables and
BEST12. for numeric variables.

Missing values
--------------

Missing values are treated as just another category. Using the default
ORDER=INTERNAL, missing values sort lower than nonmissing values.
If an ordinal floating predictor has missing values, then by default
the floating category will be the missing value. If an ordinal floating
predictor has no missing values, then the floating category will be
the first nonmissing value.

Numeric variables in SAS data sets can contain ordinary missing values
(.) and special missing values (.A, .B, ..., .Z, ._). However, PROC IML
cannot distinguish between different types of missing values. Hence, if
you specify OPTIONS=NOFORMAT, all missing values are treated as being in
the same category. If you do not specify OPTIONS=NOFORMAT, numeric
variables are converted to character variables before running IML, so
different missing values will be distinguishable.

Algorithm
---------

The algorithm is similar to the CHAID algorithm described in Kass
(1980). However, the published algorithm is ambiguous in step 3 on
p 121, which says:

  Step 3. For each compound category consisting of three or more of
          the original categories, find the most significant binary
          split (constrained by the type of the predictor) into which
          the merger may be resolved. ...

This step does not specify how to find the required binary split. Using
direct search, finding an optimal binary split for nominal variables
requires time that is exponential in the number of categories. But the
purpose of the stepwise algorithm that Kass proposes is to avoid using
exponential time, so it is impossible to determine from the published
article how this step was intended to be implemented.  The %TREEDISC
macro uses one of many possible compromises to avoid using excessive
time: direct search is used if the number of categories within a
compound category is less than or equal to a threshold specified by the
NOMSPLIT= or ORDSPLIT= argument; otherwise, only the merging step is
used.

The published CHAID algorithm also suffers from possible infinite loops.
To avoid such loops, after each sequence of merges or splits, %TREEDISC
chooses the set of compound categories that yields the minimum adjusted
p-value. The CHAID algorithm uses the final set of compound categories.
The choice used by %TREEDISC prevents infinite loops since the adjusted
p-value can never increase, and also tends to find compound categories
with better p-values than the original algorithm.

Kass (1980) uses a Bonferroni adjustment for the p-values computed from
the contingency tables relating predictors to the dependent variable.
The adjustment is conditional on the number of branches (compound
categories) in the partition and thus does not take into account the
fact that different numbers of branches are considered.  The
conservatism shown in the simulations in Kass (1980) is due to the
ineffectiveness of the CHAID algorithm in finding partitions with small
p-values. %TREEDISC is more effective than CHAID at obtaining small
p-values. Simulations with %TREEDISC have shown that the adjusted
p-values can be slightly liberal for 5 or fewer categories in an ordinal
predictor. For example, the observed type 1 error rate may be as high as
6% for an alleged alpha of 5%, or as high as 11% for an alleged alpha of
10%. This degree of liberality should not be of any practical concern.

In addition to the Bonferroni adjustment, %TREEDISC uses Gabriel's
adjustment to increase the power for multiple comparisons in a
contingency table as suggested by Hawkins & Kass (1982).  For an
observed chi-square value X for a dependent variable with r categories
and a predictor with c categories merged into k branches, the adjusted
p-value is computed as:

   p = min( Bonf. mult. * Prob(chisquare(r-1,k-1)>X),
            Prob(chisquare(r-1,c-1)>X) )

Limitations
-----------

The number of observations may not exceed (2**31)-1, which is slightly
over 2 billion. %TREEDISC has been tested with up to a million
observations, which may take several days to process on a workstation
or fast PC.

The number of predictors should not exceed about 4000 due to limitations
in IML. If formatted values are used, the number of predictors is
limited by the number of SAS data sets that can be created under your
operating system.

The number of categories of a predictor or of the dependent variable
may not exceed 32767. For a nominal predictor, a large number of
categories (perhaps several hundred, depending on the machine) may
cause floating-point overflow. It is recommended that you keep the
number of categories fairly small to keep computer time and memory
usage down.

If formatted values are used, the categories must be distinguishable
by the first 16 characters of the formatted values.

There must be enough memory to store contingency tables relating the
dependent variable to each predictor; the minimum number of bytes
required is thus eight times the number of categories for the dependent
variable times the sum of the number of categories of all the
predictors. Additional memory is required depending on the value of the
MAXREAD= argument.

Usage
-----

To construct a decision tree, you must specify the dependent variable
with the DEPVAR= argument and the predictor variables with the NOMINAL=,
ORDINAL=, or ORDFLOAT= arguments. No other arguments are required.

The tree structure can be listed using indentation to show the levels of
the tree by specifying LISTVAR= or OPTIONS=LIST.  The SAS/OR procedure
NETDRAW can be used to draw the tree diagram by specifying DRAW= or
DRAWVAR=. You can generate code for a SAS DATA step to classify
observations in a data set by specifying CODE=.

If you are constructing a decision tree and request no specific display
of the tree, then the tree is listed. If you are processing a previously
computed tree and request no specific output, then the tree diagram is
drawn with NETDRAW. It is usually a good idea to list the tree before
drawing it to see how big the tree is.

The arguments may be listed within parentheses in any order, separated
by commas. For example:

   %treedisc( data=iris, depvar=species, ordinal=petal: sepal:,
        draw=lp, options=list noformat)

Do not use data set names or variable names that begin or end with an
underscore.

Arguments pertaining to input data sets:

   DATA=      SAS data set to be analyzed. If the data set is
              TYPE=TREEDISC, then it is treated the same as the
              INTREE= data set. If INTREE= is not specified, then
              the most recently created SAS data set (_LAST_) is used.

   INTREE=    Name of input data set containing a previously computed
              tree. If INTREE= is specified without DATA=, then no new
              decision tree is computed, but the existing tree is drawn.

Arguments pertaining to input variables:

   DEPVAR=    Dependent variable. This argument is required for
              computing a decision tree, although not for displaying
              a previously computed tree.

   FREQ=      Frequency variable. Each observation is treated as if it
              were repeated as many times as the value of the FREQ=
              variable. Observations with a FREQ= value less than or
              equal to zero are omitted from the tree construction.
              Fractional values are allowed.

   NOMINAL=   Variable list specifying nominal predictors.

   ORDINAL=   Variable list specifying ordinal predictors.

   ORDFLOAT=  Variable list specifying floating predictors. A floating
              predictor has its categories on an ordinal scale except
              for one (the first) that does not belong with the rest
              or whose position on the ordinal scale is unknown.

   FORMAT=    List of variables and formats to be used with them, using
              the same syntax as in a FORMAT statement.

   ORDER=     ORDER= option to use with PROC FREQ to determine order of
              categories. Default is internal (unformatted) order. This
              argument does not apply if OPTIONS=NOFORMAT is specified.

Arguments for stopping criteria for construction of the decision tree:

   ALPHA=     Numeric value in the range (0,1), adjusted significance
              level for using a predictor in the tree. If the
              significance level for the most significant (optimally
              merged) predictor exceeds this value, then the node is
              subdivided according to the predictor's merged
              categories. The default is 0.1.

   BRANCH=    A positive integer specifying the minimum number of
              observations in a node to qualify it for further
              subdivision. The default is twice the value of LEAF=.

   LEAF=      A positive integer specifying the minimum number of
              observations allowed in a leaf. Any partition that
              would result in a leaf with fewer observations is
              rejected. The default is 1.

   MAXDEPTH=  Maximum depth (number of nodes from root to leaf) allowed
              in the tree. The default is 100.

Arguments for saving the decision tree in a SAS data set:

   OUTTREE=   Name of the output data set describing the decision tree.
              The default is _DATA_.  The data set type is TREEDISC.
              Variables include:

              NODE_    is the node number. The root is 1 and the other
                       nodes in the tree are numbered consecutively.

              DEPTH_   the depth of the node, with the root being 1.

              SPLIT_   names the predictor by which the node was
                       split. For the root, however, the name of the
                       dependent variable is given.

              VALUES_  are values of the predictor variable for each
                       branch of the split. For the root, however, the
                       values of the dependent variable are given.
                       Missing values for character variables (i.e.
                       blanks) are indicated by a period.  Individual
                       values are in _VAL1, _VAL2, etc.

              TOTAL_   is the total sample size.

              SIZE_    is the number of observations in the node.

              ERRORS_  is the number of classification errors at the
                       node assuming no further splits.

              COUNT_   contains the number of observations in each
                       category of the dependent variable in the node.
                       Individual values are in _COU1, _COU2, etc.

              PCTOTAL_ contains the percentage of the total number of
                       observations in each category of the dependent
                       variable in the node.
                       Individual values are in _PCT1, _PCT2, etc.

              PCNODE_  contains the percentage of the observations in
                       the node in each category of the dependent
                       variable.
                       Individual values are in _PCN1, _PCN2, etc.

              PVAL1_   is the smallest adjusted p-value for
                       further splits.

              PVAL2_   is the second smallest adjusted p-value for
                       further splits.

              PVALUES_ gives the two smallest adjusted p-values for
                       further splits, formatted as usual for p-values.

              INTO_    The category of the dependent variable into
                       which an observation in this branch is
                       classified by the decision tree.

              POST_    The posterior probability estimate, which
                       is biased upwards.

              TIE_     The number of categories that were tied
                       for greatest posterior probability, if any.

              Most of these variable names do not include leading
              underscores because NETDRAW insists on using only the
              first three characters of a variable name.

Arguments for generating DATA step code to classify data:

   CODE=      PRINT|LOG|fileref, specifying where to write code for a
              DATA step implementing the decision tree computed by
              %TREEDISC. This code can be executed to classify the
              original data or to classify a test data set to obtain
              an unbiased estimate of the misclassification rate.

              If you specify a fileref, you can run the code by:

                 data output_data_set;
                    set data_to_be_classified;
                    %inc fileref_in_CODE_argument;
                 run;

              If you are using formats, each variable must be
              identifiable by the first seven letters of its name,
              since formatted values are named by adding an
              underscore as a prefix to the original variable name.

              The output data set contains the following new variables:

              NODE_    The node number.

              INTO_    The category of the dependent variable into
                       which the observation is classified by the
                       decision tree.

              POST_    The posterior probability estimate, which
                       is biased upwards.

              TIE_     The number of categories that were tied
                       for greatest posterior probability, if any.

              You can use this data set with various procedures to
              obtain information such as descriptive statistics
              regarding each leaf in the tree.

Arguments for printed output:

   TRACE=     NONE|SHORT|MEDIUM|LONG, amount of printed output from IML
              tracing the decision tree computation. The default is
              NONE.

              TRACE=SHORT only prints the best predictor at each node.

              TRACE=MEDIUM reports chi-squared and adjusted p-values
              for every predictor considered at each node in the tree.

              TRACE=LONG also prints the final contingency table for
              each predictor that is selected, as well as the
              chi-squared and p-value for each predictor that is
              considered as they are computed, which is useful if you
              are wondering if the program has hung while processing a
              large data set.

              TRACE=VERYLONG also prints the final contingency table
              for each predictor that is considered.

   PFORMAT=   Format for printing p-values. By default, p-values
              greater than .0001 are printed with a 6.4 format, and
              smaller p-values are printed as .0001. If you have a
              large sample size and get lots of tiny pvalues, then
              you might try something like PFORMAT=BEST8. (don't
              forget the decimal point).

   INDENT=    Number of spaces to indent each level of the tree in the
              tree listing and code. The default is 4.

   SPACE=     Number of spaces between values in the list in the
              variables VALUES_, COUNT_, PCTOTAL_, and PCNODE_.
              The default is 2.

   LISTVAR=   List of variables in the OUTTREE= data set, optionally
              interspersed with quoted strings and formats, to print
              in the tree listing. Use a slash to indicate the start
              of a new line. Be very careful typing this, since
              mistakes can produce baffling error messages. If you do
              get error messages, then OPTIONS MPRINT may help to
              diagnose them. The default is:
              SPLIT_ 'value(s): ' VALUES_ /
              'DV counts: ' COUNT_ ' Best p-value(s): ' PVALUES_ /.

Arguments for drawing the tree:

   DRAW=      Mode for drawing the tree diagram with PROC NETDRAW:
                 LP|LINEPRINTER  (default)
                 FS|FULLSCREEN
                 GR|GRAPHICS
                 NO|NONE

              For DRAW=LP, use a *large* line size (OPTIONS LS=).
              Setting the page size (OPTIONS PS=) is tricky, since the
              tree is drawn starting at the bottom of the page. If you
              specify too large a value for PS=, you will get lots of
              blank lines at the top of the page.

              For interactive use, DRAW=FS is recommended. Use the SCALE
              command in NETDRAW, usually SCALE MAX, to see the nodes in
              the tree in their entirety. You can vary the size of the
              nodes interactively, and you can scroll up, down, left,
              and right to see the entire tree. However, the arcs in the
              tree may not be drawn correctly in releases prior to 6.08.

              For DRAW=GRAPHICS, be *sure* to specify GOPTIONS DEVICE=,
              otherwise the tree will be drawn on the SAS log and will
              be unintelligible. To see the tree on the screen, it is
              better to use SAS Display Manager than to run in line or
              batch mode, since with DMS you can zoom the graphics
              window and the tree will expand to fill the window; in
              non-DMS modes, if you zoom the graphics window, the
              tree will stay the same size. This behavior may be
              dependent on the operating system.

   DRAWVAR=   Variables to display in each node of the tree diagram.
              The default is SPLIT_ VALUES_ COUNT_ PVALUES_.

   BOX=       Value of the BOXWIDTH= option to be used with PROC
              NETDRAW. The default is 21.

   NETOPT=    Options to be included in the PROC NETDRAW statement.
              The default is OUT=_NETDR_.

   ACTOPT=    Options to be included in the ACTNET statement.  If
              DRAW=GRAPHICS is used, the default is NODEFID TREE
              CENTERSUBTREE RECTILINEAR. Otherwise, the
              default is NODEFID TREE CENTERSUBTREE VBETWEEN=3.

Argument for controlling the size of the graph with DRAW=GRAPHICS:

   POS=hpos vpos
              Two positive integers giving values for the HPOS= and
              VPOS= options in the GOPTIONS statement. If you do not
              specify POS=, the tree diagram will occupy one page,
              screen, or window. You can enlarge the tree diagram to
              two or more pages/screens/windows by specifying values
              for POS= equal to the number of character positions used
              horizontally and vertically in the tree diagram divided
              by the number of pages/screens/windows that you want
              horizontally and vertically. Trial and error may be the
              easiest way to determine these values.

Arguments for colors with DRAW=GRAPHICS:

   CBACK=     Color of background. The default depends on the device
              and the version of SAS software being used. This argument
              sets the background color with the CBACK= option in the
              GOPTIONS statement, so the color persists for subsequent
              graphs until you change it.

   CFILL=     Color to use for filling the nodes. Default is the
              same as CBACK=.

   CTEXT=     Color of text in the nodes. The default is WHITE if
              CFILL is BLACK, GRAY, or BLUE; otherwise, the default
              is BLACK.

   CLINES=    Color of the lines connecting and outlining nodes.
              The default is YELLOW if CFILL is BLACK, GRAY, or BLUE;
              otherwise, the default is BLACK.

Arguments for tuning performance and memory usage:

   MAXREAD=   Maximum number of observations to read into memory at
              one time in IML. The default is 100. If IML runs out of
              memory, try a smaller value of MAXREAD=.

Miscellaneous options

   OPTIONS=   List of additional options separated by blanks:

              NOLIST   Suppress listing of the tree structure.

              LIST     List the tree structure anyway.

              READ     Keep the entire data set in memory instead of
                       reading it from the disk over and over. This
                       will use more memory but save time computing the
                       tree.

              DOFREQ   Reduce the DATA= data set to a frequency table
                       using PROC FREQ. This saves time if there are
                       many duplicate observations in the data set.

              NOFORMAT Do not format variables to determine
                       categories. This can save time if formats are
                       not needed to define categories. It may also
                       be necessary for processing a large number of
                       predictors.

              CHAID    Use the original CHAID merging and splitting
                       algorithm.

Esoteric arguments you are not likely to need:


   MERGE=     Numeric value in the range (0,1), raw significance level
              for combining two categories into a compound category.
              The default is 0.0001.

   SPLIT=     Numeric value in the range (0,1), raw significance level
              for dividing one compound category that contains at
              least three of the original categories into its most
              significant binary split. The default is 0.049.

   NOMSPLIT=  A positive integer specifying the maximum number of
              categories in a compound category of a nominal variable
              for which splitting will be attempted.  The default is 10.

   ORDSPLIT=  A positive integer specifying the maximum number of
              categories in a compound category of an ordinal variable
              for which splitting will be attempted.  The default is
              100.

The following statements may be useful for diagnosing errors:

   %let _notes_=1;       %* Prints SAS notes for all steps;
   %let _echo_=1;        %* Prints the arguments to the DISTANCE macro;
   %let _echo_=2;        %* Prints the arguments to the DISTANCE macro
                            after defaults have been set;
   options mprint;       %* Prints SAS code generated by the macro
                            language;
   options mlogic symbolgen; %* Prints lots of macro debugging info;

This macro normally spends a lot of time checking the arguments for
validity, in hopes of avoiding mysterious error messages from the
generated SAS code.  You can reduce the amount of time spent checking
arguments (and thereby speed up the macro at the risk of getting
inscrutable error messages if you make a mistake) by using one of the
following statements before invoking the macro:

   %let _check_=1; %* reduce argument checking;
   %let _check_=0; %* suppress argument checking--use at your own risk!;

References
----------

Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984),
_Classification and Regression Trees_, Wadsworth: Belmont, CA.

Hawkins, D.M. & Kass, G.V. (1982), "Automatic Interaction Detection",
in Hawkins, D.M., ed., _Topics in Applied Multivariate Analysis_,
267-302, Cambridge Univ Press: Cambridge.

Kass, G.V. (1980), "An Exploratory Technique for Investigating Large
Quantities of Categorical Data", Applied Statistics, 29, 119-127.

Quinlan, J.R. (1993), _C4.5: Programs for Machine Learning_, Morgan
Kaufman: San Mateo, CA.

Example
-------

*** The following statement should be changed as necessary
    to refer to the location of TREEDISC on your system;
%inc 'treedisc.sas';

proc format;
   value specname
      1='SETOSA    '
      2='VERSICOLOR'
      3='VIRGINICA ';
   value specchar
      1='S'
      2='O'
      3='V';
run;

data iris;
   title 'Fisher (1936) Iris Data';
   input sepallen sepalwid petallen petalwid species @@;
   format species specname.;
   label sepallen='Sepal Length in mm.'
         sepalwid='Sepal Width  in mm.'
         petallen='Petal Length in mm.'
         petalwid='Petal Width  in mm.';
   cards;
 50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3
 63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2
 59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2
 65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3
 68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3
 77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3
 49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2
 64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3
 55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1
 49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1
 67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1
 77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2
 50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1
 61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1
 61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1
 51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1
 51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1
 46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1
 50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3
 57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1
 71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3
 49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1
 49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1
 66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1
 44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2
 47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2
 74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1
 56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3
 49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1
 56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2
 51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3
 54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3
 61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3
 68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1
 45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1
 55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1
 51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2
 63 33 60 25 3 53 37 15 02 1
 ;

options ls=110 ps=80 nodate nonumber;

*** Compute a tree for predicting SPECIES from the petal and
    sepal lengths and widths, which are treated as ordinal
    predictors;
%treedisc( data=iris, depvar=species, ordinal=petal: sepal:,
           outtree=trd, options=noformat, trace=2);

*** draw the tree diagram in lineprinter mode;
%treedisc( intree=trd, draw=lp);

*** Generate DATA step code to classify observations;
%treedisc( intree=trd, code=print);

*** This time save the code in a file named trdiris.code .
    Filenames are system-dependent, so this name may need to be
    changed;
%treedisc( intree=trd, code='trdiris.code')

*** Classify the data;
data out;
   set iris;
   %inc 'trdiris.code';
run;

*** Cross-tabulate actual species with the predicted species (_INTO_);
proc freq; tables species*into_; format species into_ specname.; run;