shared
Class SplitAttr

java.lang.Object
  |
  +--shared.SplitScore
        |
        +--shared.SplitAttr

public class SplitAttr
extends SplitScore

A class for determining, holding, and returning the information associated with an attribute split.


Field Summary
static int multiRealThresholdSplit
          SplitTypeEnum value.
static int nominalSplit
          SplitTypeEnum value.
static int noReasonableSplit
          SplitTypeEnum value.
static int partitionSplit
          SplitTypeEnum value.
static int realThresholdSplit
          SplitTypeEnum value.
static java.lang.String[] splitTypeEnum
          Names of SplitTypeEnum values.
 
Fields inherited from class shared.SplitScore
defaultSplitScoreCriterion, externalScore, gainRatio, logOptions, mutualInfo, mutualInfoRatio, normalizedMutualInfo, splitScoreCriterionEnum
 
Constructor Summary
SplitAttr()
          Constructor.
 
Method Summary
 void copy(SplitAttr original)
          Copies the given SplitAttr inot this SplitAttr.
 boolean exist_split()
          Returns TRUE if there is a split stored in this SplitAttr.
 void free_type_info()
          Delete and clear typeInfo.
 int get_attr_num()
          Returns the number of attributes.
 double get_gain_ratio(boolean penalize)
          Returns the mutual gain-ratio.
 double get_mutual_info(boolean normalize, boolean penalize)
          Returns the mutual information.
 boolean get_penalize_by_mdl()
          Returns the minimum distance length penalty value.
 void initialize(InstanceList[] instLists, int attributeNumber)
          Initialize attribute data and distribution arrays.
 boolean make_nominal_split()
          Helper function to do all processing for nominals.
 boolean make_nominal_split(InstanceList instList, int attributeNumber)
          Helper function to do all processing for nominals.
 boolean make_real_split(RealAndLabelColumn column, int attrNum, double minSplit, int smoothInst, double smoothFactor)
          Helper function to do all processing for real thresholds.
static boolean ok_to_split(int attrNum, BagCounters counters, double minSplit)
          Check if it is OK to make a split on the nominal attribute by making sure at least two branches have more than minSplit instances.
 double penalty()
          Get penalty.
 void reset()
          Reset values, except attribute number.
 void save_real_split(DoubleRef thresh, IntRef splitIndex, IntRef numDistinct)
          The data calculated by find_best_threshold() is saved in the SplitAttr via this function.
 double score()
          The criterion calculation depends on the score criterion.
 double score(double[][] sAndLDist, double[] sDist, double[] lDist)
          Computes the scores and updates the cache when there are being computed many times for the same number of instances and entropy.
 double score(double[][] sAndLDist, double[] sDist, double[] lDist, double entropy)
          Computes the scores and updates the cache when there are being computed many times for the same number of instances and entropy.
 double score(double[][] sAndLDist, double[] sDist, double[] lDist, double entropy, double totalWeight)
          Computes the scores and updates the cache when there are being computed many times for the same number of instances and entropy.
 void set_attr_num(int num)
          Sets the attribute number for this split.
 void set_penalize_by_mdl(boolean choice)
          Sets if the split should be penalized by minimum description length.
 int split_type()
          Returns the type value of this SplitAttr.
 double threshold()
          Return the threshold.
 
Methods inherited from class shared.SplitScore
assign, display, display, display, get_cond_entropy, get_entropy, get_external_score, get_gain_ratio, get_label_dist, get_log_level, get_log_options, get_log_stream, get_mutual_info_ratio, get_mutual_info, get_split_and_label_dist, get_split_dist, get_split_entropy, get_split_score_criterion, get_unnormalized_mutual_info, has_distribution, has_distribution, has_external_score, normalize_by_num_splits, num_splits, release_label_dist, release_split_and_label_dist, release_split_dist, set_external_score, set_log_level, set_log_options, set_log_prefixes, set_log_stream, set_split_and_label_dist, set_split_dist, set_split_score_criterion, total_weight
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

noReasonableSplit

public static final int noReasonableSplit
SplitTypeEnum value.

realThresholdSplit

public static final int realThresholdSplit
SplitTypeEnum value.

multiRealThresholdSplit

public static final int multiRealThresholdSplit
SplitTypeEnum value.

nominalSplit

public static final int nominalSplit
SplitTypeEnum value.

partitionSplit

public static final int partitionSplit
SplitTypeEnum value.

splitTypeEnum

public static java.lang.String[] splitTypeEnum
Names of SplitTypeEnum values.
Constructor Detail

SplitAttr

public SplitAttr()
Constructor.
Method Detail

split_type

public int split_type()
Returns the type value of this SplitAttr.
Returns:
The type value of this attribute.
See Also:
noReasonableSplit, realThresholdSplit, multiRealThresholdSplit, nominalSplit, partitionSplit

get_mutual_info

public double get_mutual_info(boolean normalize,
                              boolean penalize)
Returns the mutual information. The mutual information must be >= 0.
Parameters:
normalize - TRUE if the mutual info is to be normalized, FALSE otherwise.
penalize - TRUE if the mutual info should be penalized, FALSE otherwise.
Returns:
The mutual information for this attribute split.

exist_split

public boolean exist_split()
Returns TRUE if there is a split stored in this SplitAttr.
Returns:
Returns TRUE if the SplitAttr contains a split, FALSE otherwise.

set_penalize_by_mdl

public void set_penalize_by_mdl(boolean choice)
Sets if the split should be penalized by minimum description length.
Parameters:
choice - TRUE if penalizing should occur, FALSE otherwise.

make_real_split

public boolean make_real_split(RealAndLabelColumn column,
                               int attrNum,
                               double minSplit,
                               int smoothInst,
                               double smoothFactor)
Helper function to do all processing for real thresholds. When a split is found, this also determines and stores the threshold, the mutual info, the cond info, and the mdl penalty for cost of storing the threshold.
Parameters:
column - The column of real values for this attribute and their associated label values.
attrNum - The number of the attribute.
minSplit - The minimum split value.
smoothInst - The instance to be smoothed towards.
smoothFactor - The factor by which real values are smoothed.
Returns:
A split threshold for a real valued attribute.

save_real_split

public void save_real_split(DoubleRef thresh,
                            IntRef splitIndex,
                            IntRef numDistinct)
The data calculated by find_best_threshold() is saved in the SplitAttr via this function.
Parameters:
thresh - The threshold to be saved.
splitIndex - The index of the split to be saved.
numDistinct - The number of distinct splits.

free_type_info

public void free_type_info()
Delete and clear typeInfo.

reset

public void reset()
Reset values, except attribute number.
Overrides:
reset in class SplitScore

set_attr_num

public void set_attr_num(int num)
Sets the attribute number for this split.
Parameters:
num - The number of the new attribute.

make_nominal_split

public boolean make_nominal_split(InstanceList instList,
                                  int attributeNumber)
Helper function to do all processing for nominals. Nominal splits always exist.
Parameters:
instList - The InstanceList over which to make a nominal split.
attributeNumber - The number of the attribute to be split.
Returns:
TRUE if a nominal split exists, FALSE if not.

make_nominal_split

public boolean make_nominal_split()
Helper function to do all processing for nominals. Nominal splits always exist.
Returns:
TRUE if a nominal split exists, FALSE if not.

ok_to_split

public static boolean ok_to_split(int attrNum,
                                  BagCounters counters,
                                  double minSplit)
Check if it is OK to make a split on the nominal attribute by making sure at least two branches have more than minSplit instances. The need is to split into 2 disjoint sets, both sets containing at least 'minSplit' instances (so there needs to be at least twice 'minSplit' instances). This function checks to see if there are enough instances for such a split to occur. The minSplit must be at least 1.
Parameters:
attrNum - The number of the attribute to be checked.
counters - Counters of the values for this attribute.
minSplit - The minimum split value.
Returns:
TRUE if the attribute is ok to split.

get_attr_num

public int get_attr_num()
Returns the number of attributes.
Returns:
The number of attrbutes.

penalty

public double penalty()
Get penalty. Only valid if you are penalizing.
Returns:
Returns the penalty value.

get_penalize_by_mdl

public boolean get_penalize_by_mdl()
Returns the minimum distance length penalty value.
Returns:
The minimum distance length penalty value.

get_gain_ratio

public double get_gain_ratio(boolean penalize)
Returns the mutual gain-ratio.
Parameters:
penalize - TRUE if penalization should occur, FALSE otherwise.
Returns:
The mutual gain-ratio.

threshold

public double threshold()
Return the threshold. Can only be called if the split exists and is a real threshold split.
Returns:
The threshold.

copy

public void copy(SplitAttr original)
Copies the given SplitAttr inot this SplitAttr.
Parameters:
original - The SplitAttr to be copied.

initialize

public void initialize(InstanceList[] instLists,
                       int attributeNumber)
Initialize attribute data and distribution arrays. The first version bases it on splits that were done before calling us. The second version does the split based on a given categorizer and computes its worth based on the resulting instance lists.
Parameters:
instLists - The InstanceList to use in initialization.
attributeNumber - The number of the attribute.

score

public double score()
The criterion calculation depends on the score criterion. For gainRatio it's (surprise) gain ratio. For mutualInfo and normalizedMutualInfo it's mutualInfo.
Overrides:
score in class SplitScore
Returns:
The score for this split.

score

public double score(double[][] sAndLDist,
                    double[] sDist,
                    double[] lDist,
                    double entropy,
                    double totalWeight)
Computes the scores and updates the cache when there are being computed many times for the same number of instances and entropy. This would happen, for instance, when determining the best threshold for a split.
Overrides:
score in class SplitScore
Parameters:
sAndLDist - The split and label distribution.
sDist - The split distribution.
lDist - The label distribution.
entropy - The entropy value.
totalWeight - The total weight of instances.
Returns:
The score for this split.

score

public double score(double[][] sAndLDist,
                    double[] sDist,
                    double[] lDist,
                    double entropy)
Computes the scores and updates the cache when there are being computed many times for the same number of instances and entropy. This would happen, for instance, when determining the best threshold for a split.
Parameters:
sAndLDist - The split and label distribution.
sDist - The split distribution.
lDist - The label distribution.
entropy - The entropy value.
Returns:
The score for this split.

score

public double score(double[][] sAndLDist,
                    double[] sDist,
                    double[] lDist)
Computes the scores and updates the cache when there are being computed many times for the same number of instances and entropy. This would happen, for instance, when determining the best threshold for a split.
Parameters:
sAndLDist - The split and label distribution.
sDist - The split distribution.
lDist - The label distribution.
Returns:
The score for this split.