52
Factor
The factor data type is much more complicated. A factor column
maintains a list of strings, representing the levels of the factor. When a
string is read from a file or database into a factor data column (by the
file reader proc), each string value is compared to the list of level
strings, and converted into a level number (1 through the number of
levels).
If you know all of the possible factor levels that can be read ahead of
time, you can simply set up a factor column with these levels.
However, if you don't know all of the possible factor levels, there is a
potential problem. The pipeline is designed to be used for problems
with very large amounts of data. If every new factor level that
appears is simply added to the level list, then it is possible that the
system would allocate more and more different level strings, until you
run out of memory. Even if the number of possible levels is relatively
small, there are situations where it is useful to restrict it even further.
Some operations such as tree modeling and crosstabs can take
massive amounts of time or space for variables with many factor
levels.
To control the number of automatically-created factor levels, each
column has two properties,
max.auto.levels and overflow.level.
For a factor data type, the
max.auto.levels property determines the
maximum number of levels that will be automatically created. The
actual number of levels in a factor can be set larger than this, by
explicitly setting the level strings. The default value of
max.auto.levels is 10. A simple way to disable auto-creation of
factors is by setting
max.auto.levels to 0.
The
overflow.level property is used when handling a new factor
level. If adding the new level would cause the number of levels to
exceed the
max.auto.levels property, the overflow.level string is
used instead. For example, if the levels are
"yes" and "no",
max.auto.levels is 2, and overflow.level is "yes", then all other
factor values will map to
"yes". If the overflow level is not one of the
existing levels, it is added, but it is done soon enough so that the total
number of levels will not exceed
max.auto.levels. If
overflow.level is "", the default, then overflow levels are mapped to
NA.
Kommentare zu diesen Handbüchern