shared
Class Entropy

java.lang.Object
  |
  +--shared.Entropy

public class Entropy
extends java.lang.Object

This class handles all of the Entropy based calculations. All logs are base 2 (other bases just scale the entropy). The reason for using log_bin is simply that the examples in Quinlan's C4.5 book use it, and those examples were used for testing. It's also a measure of the number of "bits" in information theory, so it's appealing in that sense too. The computation is based on:
1. "Boolean Feature Discovery in Empirical Learning" / Pagallo and Haussler.
2. "A First Course in Probability, 2nd Edition" / Ross, pages 354-359.
3. "C4.5: Programs for Machine Learning" / Ross Quinlan, pages 18-24.


Field Summary
static double M_LOG2E
          Constant for binary log calculations.
 
Constructor Summary
Entropy()
           
 
Method Summary
static double auto_lbound_min_split(double totalWeight)
          Automatically determine a good lower bound for minSplit, based on the total weight of an instance list at the start of training.
static double[] build_nominal_attr_split_dist(InstanceList[] currentLevel, int attrNum)
          Builds the distribution arrays necessary for calculating conditional entropy for nominal attributes.
static double[] build_nominal_attr_split_dist(InstanceList[] currentLevel, int attrNum, double unaccountedWeight)
          Builds the distribution arrays necessary for calculating conditional entropy for nominal attributes.
static RealAndLabelColumn build_real_and_label_column(InstanceList instList, int attrNum)
          Builds a column of real values and their associated label values for the given attribute.
static RealAndLabelColumn[] build_real_and_label_columns(InstanceList instList, int attrNum)
          Builds columns of real values and their associated label values.
static double[][] build_split_and_label_dist(InstanceList[] currentLevel, int attrNum)
          Build the splitAndLabelDist and splitDist arrays needed for calculating conditional entropy.
static double[][] build_split_and_label_dist(InstanceList[] currentLevel, int attrNum, double unaccountedWeight)
          Build the splitAndLabelDist and splitDist arrays needed for calculating conditional entropy.
static double cond_entropy(double[][] splitAndLabelDist, double[] splitDist, double totalWeight)
          Computes conditional entropy of the label given attribute X.
static double cond_entropy(InstanceList instList, int attrNumX)
          Computes conditional entropy of the label given attribute X.
static double entropy(double[] labelCount)
          Compute the entropy H(Y) for label Y.
static double entropy(double[] labelCount, double totalInstanceWeight)
          Compute the entropy H(Y) for label Y.
static double entropy(InstanceList instList)
          Compute the entropy H(Y) for label Y.
static void fill_scores(RealAndLabelColumn realAndLabelColumn, SplitScore split, double minSplit, double theEntropy, java.util.Vector outScores, IntRef numDistinctValues, double[][] splitAndLabelDist, double[] splitDist)
          Fills the Vector of scores with the scores for all the thresholds.
static double find_best_score(double totalKnownWeight, java.util.Vector scores, double minSplit, IntRef bestSplitIndex)
          Search a score array to find the best score/index.
static void find_best_threshold(RealAndLabelColumn realAndLabelColumn, double minSplit, SplitScore split, DoubleRef bestThreshold, IntRef bestSplitIndex, IntRef numDistinctValues, int smoothInst, double smoothFactor)
          Compute the best threshold for RealAndLabelColumn(s).
static double[][] get_score_array(RealAndLabelColumn realAndLabelColumn, SplitScore split, double minSplit, java.util.Vector outScores, IntRef numDistinctValues, int smoothInst, double smoothFactor)
          Calculates the distribution array for the given sorted RealAndLabelColumn.
static double j_measure(double[][] splitAndLabelDist, double[] splitDist, double[] labelCounts, int x, double totalWeight)
          Compute the J-measure.
static double j_measure(InstanceList instList, int attrNumX, int x)
          Compute the J-measure.
static double log_bin(double num)
          Returns the log base two of the supplied number.
static double min_split(double totalWeight, int numTotalCategories, double lowerBoundMinSplit, double upperBoundMinSplit, double minSplitPercent)
          Returns the minSplit which is used in find_best_threshold(), given lowerBoundMinSplit, upperBoundMinSplit, and minSplitPercent.
static double min_split(InstanceList instList, double lowerBoundMinSplit, double upperBoundMinSplit, double minSplitPercent)
          Returns the minSplit which is used in find_best_threshold(), given lowerBoundMinSplit, upperBoundMinSplit, and minSplitPercent.
static double min_split(InstanceList instList, double lowerBoundMinSplit, double upperBoundMinSplit, double minSplitPercent, boolean ignoreNumCat)
          Returns the minSplit which is used in find_best_threshold(), given lowerBoundMinSplit, upperBoundMinSplit, and minSplitPercent.
static double mutual_info(double ent, double[][] splitAndLabelDist, double[] splitDist, double totalWeight)
          Compute the mutual information which is defined as I(Y;X) = H(Y) - H(Y|X).
static double mutual_info(InstanceList instList, double ent, int attrNumX)
          Compute the mutual information which is defined as I(Y;X) = H(Y) - H(Y|X).
static double mutual_info(InstanceList instList, int attrNumX)
          Compute the mutual information which is defined as I(Y;X) = H(Y) - H(Y|X).
static void smooth_scores(java.util.Vector scores, int smoothInst, double smoothFactor)
          Normalizes the scores towards the instance at the supplied index.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

M_LOG2E

public static double M_LOG2E
Constant for binary log calculations.
Constructor Detail

Entropy

public Entropy()
Method Detail

auto_lbound_min_split

public static double auto_lbound_min_split(double totalWeight)
Automatically determine a good lower bound for minSplit, based on the total weight of an instance list at the start of training.
Parameters:
totalWeight - The total weight of instances in this InstanceList partition.
Returns:
A lower bound for minSplit.

log_bin

public static double log_bin(double num)
Returns the log base two of the supplied number.
Parameters:
num - The number for which a log is requested.
Returns:
The log base two of the supplied number.

find_best_threshold

public static void find_best_threshold(RealAndLabelColumn realAndLabelColumn,
                                       double minSplit,
                                       SplitScore split,
                                       DoubleRef bestThreshold,
                                       IntRef bestSplitIndex,
                                       IntRef numDistinctValues,
                                       int smoothInst,
                                       double smoothFactor)
Compute the best threshold for RealAndLabelColumn(s). Initially, all instances are at the right hand side of splitDist and splitAndLabelDist. To find the best threshold, we shift the instances from right to left and calculate their conditional entropy. Then we pick up the one with minimum conditional entropy and return bestThreshold and minCondEntropy as results. When two entropies are equal, we prefer the one that splits instances equally into two bins in the hope of making the tree shallower. Tests show minor effect but it does help pima for different leaf counts.
Parameters:
realAndLabelColumn - The column of real values and their associated labels
minSplit - The split value for which conditional entropy is minimal.
split - The scores for all available splits.
bestThreshold - The real value that is the best threshold over the supplied real values.
bestSplitIndex - Index of the best split in the supplied RealAndLabelColumn.
numDistinctValues - The number of non-equal values.
smoothInst - The index of the instance to be smoothed toward.
smoothFactor - The amount of normalization done towards the specified instance index.

get_score_array

public static double[][] get_score_array(RealAndLabelColumn realAndLabelColumn,
                                         SplitScore split,
                                         double minSplit,
                                         java.util.Vector outScores,
                                         IntRef numDistinctValues,
                                         int smoothInst,
                                         double smoothFactor)
Calculates the distribution array for the given sorted RealAndLabelColumn. This function then progresses through the input array, shifting counts from the right to the left (in the distributions). When the Real values in the input array change, the potential score of a split at that value is calculated, and stored in the output array, outScores. At the end, outScores contains the discrete thresholds, the scores associated with a split at each of these thresholds, and the indices into the original array.
Parameters:
realAndLabelColumn - The supplied column of real values and their associated label values.
split - The scores for all available splits.
minSplit - The split value for which conditional entropy is minimal.
outScores - The scores after smoothing.
numDistinctValues - The number of non-equal values.
smoothInst - The index of the instance to be smoothed towards.
smoothFactor - The amount of normalization done towards the specified instance index.
Returns:
The split and label distribution.

smooth_scores

public static void smooth_scores(java.util.Vector scores,
                                 int smoothInst,
                                 double smoothFactor)
Normalizes the scores towards the instance at the supplied index.
Parameters:
scores - The set of scores to be smoothed.
smoothInst - The index of the instance to be smoothed towards.
smoothFactor - The amount of normalization done towards the specified instance index.

fill_scores

public static void fill_scores(RealAndLabelColumn realAndLabelColumn,
                               SplitScore split,
                               double minSplit,
                               double theEntropy,
                               java.util.Vector outScores,
                               IntRef numDistinctValues,
                               double[][] splitAndLabelDist,
                               double[] splitDist)
Fills the Vector of scores with the scores for all the thresholds. The number of distinct values is only true for the range of numbers if the relevant range that does not include minSplit on the right and left.
Parameters:
realAndLabelColumn - The column of real values and their associated labels over which thresholds are created.
split - The SplitScore used for scoring this threshold split.
minSplit - The minimum value for splits.
theEntropy - The Entropy value
outScores - The Vector of scores to be filled.
numDistinctValues - The number of distinct real values for this attribute.
splitAndLabelDist - Distributions over each split and label pair.
splitDist - The distribution over splits.

entropy

public static double entropy(double[] labelCount)
Compute the entropy H(Y) for label Y. Giving us an InstanceList forces counters. If you don't give us the total instances, we count (slightly less efficient). Entropy without the sum acts a bit differently by allowing nodes with 0 instances and returning entropy -1. The reason is that the caller shouldn't be required to compute the sum just for this.
Parameters:
labelCount - The count of each label found in the data.
Returns:
The entropy for the supplied label counts.

entropy

public static double entropy(double[] labelCount,
                             double totalInstanceWeight)
Compute the entropy H(Y) for label Y. Giving us an InstanceList forces counters. If you don't give us the total instances, we count (slightly less efficient). Entropy without the sum acts a bit differently by allowing nodes with 0 instances and returning entropy -1. The reason is that the caller shouldn't be required to compute the sum just for this.
Parameters:
labelCount - The count of each label found in the data.
totalInstanceWeight - The total weight for all of the instances.
Returns:
The entropy for the supplied label counts.

entropy

public static double entropy(InstanceList instList)
Compute the entropy H(Y) for label Y. Giving us an InstanceList forces counters. If you don't give us the total instances, we count (slightly less efficient). Entropy without the sum acts a bit differently by allowing nodes with 0 instances and returning entropy -1. The reason is that the caller shouldn't be required to compute the sum just for this.
Parameters:
instList - The supplied instances for whihc entropy will be calculated.
Returns:
The entropy value.

find_best_score

public static double find_best_score(double totalKnownWeight,
                                     java.util.Vector scores,
                                     double minSplit,
                                     IntRef bestSplitIndex)
Search a score array to find the best score/index.
Parameters:
totalKnownWeight - Total weight of all Instances for which a value is known.
scores - The scores of the available Splits.
minSplit - The minimum value for a split.
bestSplitIndex - The index of the best split.
Returns:
The score of the best split.

build_split_and_label_dist

public static double[][] build_split_and_label_dist(InstanceList[] currentLevel,
                                                    int attrNum)
Build the splitAndLabelDist and splitDist arrays needed for calculating conditional entropy. All of the splitAndLabelDist arrays of the Instance Lists are concatenated. The unaccounted instances allow the list of nodes to be partial, i.e., not to contain all instances. The split will be created so that the unaccounted instances are in an extra split with the same label, so that the entropy will be decreased correctly as if they were in a pure node.
Parameters:
currentLevel - The list of instances in the current partition for which a split is being determined.
attrNum - The number of the attribute over which a split and label distribution is to be built.
Returns:
Distributions over each split and label pair.

build_split_and_label_dist

public static double[][] build_split_and_label_dist(InstanceList[] currentLevel,
                                                    int attrNum,
                                                    double unaccountedWeight)
Build the splitAndLabelDist and splitDist arrays needed for calculating conditional entropy. All of the splitAndLabelDist arrays of the Instance Lists are concatenated. The unaccounted instances allow the list of nodes to be partial, i.e., not to contain all instances. The split will be created so that the unaccounted instances are in an extra split with the same label, so that the entropy will be decreased correctly as if they were in a pure node.
Parameters:
currentLevel - The list of instances in the current partition for which a split is being determined.
attrNum - The number of the attribute over which a split and label distribution is to be built.
unaccountedWeight - The weight for instances that are not accounted for in this partition.
Returns:
Distributions over each split and label pair.

min_split

public static double min_split(double totalWeight,
                               int numTotalCategories,
                               double lowerBoundMinSplit,
                               double upperBoundMinSplit,
                               double minSplitPercent)
Returns the minSplit which is used in find_best_threshold(), given lowerBoundMinSplit, upperBoundMinSplit, and minSplitPercent. This function is called by inducers.
Parameters:
upperBoundMinSplit - Upper bound for the minimum split value.
totalWeight - The total weight of all instances in the list of instances for which a split is requested.
numTotalCategories - Number of possible values an instance may be categorized as.
lowerBoundMinSplit - Lower bound for the minimum split value.
minSplitPercent - The percentage of total weight per category that represents the minimum value for a split.
Returns:
The minSplit which is used in find_best_threshold().

min_split

public static double min_split(InstanceList instList,
                               double lowerBoundMinSplit,
                               double upperBoundMinSplit,
                               double minSplitPercent,
                               boolean ignoreNumCat)
Returns the minSplit which is used in find_best_threshold(), given lowerBoundMinSplit, upperBoundMinSplit, and minSplitPercent. This function is called by inducers.
Parameters:
instList - The list of instances over which a split is requested.
upperBoundMinSplit - Upper bound for the minimum split value.
lowerBoundMinSplit - Lower bound for the minimum split value.
minSplitPercent - The percentage of total weight per category that represents the minimum value for a split.
ignoreNumCat - Indicator that the number of values that an instance may be classified as should be ignored for this split computation.
Returns:
The minSplit which is used in find_best_threshold().

min_split

public static double min_split(InstanceList instList,
                               double lowerBoundMinSplit,
                               double upperBoundMinSplit,
                               double minSplitPercent)
Returns the minSplit which is used in find_best_threshold(), given lowerBoundMinSplit, upperBoundMinSplit, and minSplitPercent. This function is called by inducers.
Parameters:
instList - The list of instances over which a split is requested.
upperBoundMinSplit - Upper bound for the minimum split value.
lowerBoundMinSplit - Lower bound for the minimum split value.
minSplitPercent - The percentage of total weight per category that represents the minimum value for a split.
Returns:
The minSplit which is used in find_best_threshold().

cond_entropy

public static double cond_entropy(double[][] splitAndLabelDist,
                                  double[] splitDist,
                                  double totalWeight)
Computes conditional entropy of the label given attribute X. From Ross, Conditional entropy is defined as:
H(Y|X) = sum_x H(Y|X=x)*P(X=x).
= sum_x (-sum_y p(Y=y|X=x)log p(Y=y|X=x)) * P(X=x)
now derive Pagallo & Haussler's formula
= -sum_{x,y} p(Y=y, X=x) log p(Y=y|X=x)
Here we estimate p(Y=y, X=x) by counting, but if we have priors on the probabilities of the labels, then
p(x,y) = p(x|y)*p(y) = count(x,y)/s(y)* prior(y)
and p(x) = sum_y prior(y) count(x,y)/s(y).
By counting we get the following:
-sum_{x,y} num(Y=y,X=x)/num-rec * log num(Y=y,X=x)/num(X=x)
Parameters:
splitAndLabelDist - Distributions over each split and label pair.
splitDist - The distribution over splits.
totalWeight - The total weight distributed.
Returns:
The conditional entropy.

cond_entropy

public static double cond_entropy(InstanceList instList,
                                  int attrNumX)
Computes conditional entropy of the label given attribute X. From Ross, Conditional entropy is defined as:
H(Y|X) = sum_x H(Y|X=x)*P(X=x).
= sum_x (-sum_y p(Y=y|X=x)log p(Y=y|X=x)) * P(X=x)
now derive Pagallo & Haussler's formula
= -sum_{x,y} p(Y=y, X=x) log p(Y=y|X=x)
Here we estimate p(Y=y, X=x) by counting, but if we have priors on the probabilities of the labels, then
p(x,y) = p(x|y)*p(y) = count(x,y)/s(y)* prior(y)
and p(x) = sum_y prior(y) count(x,y)/s(y).
By counting we get the following:
-sum_{x,y} num(Y=y,X=x)/num-rec * log num(Y=y,X=x)/num(X=x)
Parameters:
instList - The instance list over which conditional entropy is calculated.
attrNumX - The number of the attribute for which conditional entropy is requested.
Returns:
The conditional entropy.

mutual_info

public static double mutual_info(double ent,
                                 double[][] splitAndLabelDist,
                                 double[] splitDist,
                                 double totalWeight)
Compute the mutual information which is defined as I(Y;X) = H(Y) - H(Y|X). Some researchers like Quinlan call this "gain." This is the amount of information gained about the category value of an instance after we test the variable X.
Parameters:
ent - Entropy value.
splitAndLabelDist - Distributions over each split and label pair.
splitDist - The distribution over splits.
totalWeight - Total weight of the Instances trained on.
Returns:
The mutual information value.

mutual_info

public static double mutual_info(InstanceList instList,
                                 int attrNumX)
Compute the mutual information which is defined as I(Y;X) = H(Y) - H(Y|X). Some researchers like Quinlan call this "gain." This is the amount of information gained about the category value of an instance after we test the variable X.
Parameters:
instList - The instance list over which mutual information is calculated.
attrNumX - The number of the attribute for which mutual information is requested.
Returns:
The mutual information value.

mutual_info

public static double mutual_info(InstanceList instList,
                                 double ent,
                                 int attrNumX)
Compute the mutual information which is defined as I(Y;X) = H(Y) - H(Y|X). Some researchers like Quinlan call this "gain." This is the amount of information gained about the category value of an instance after we test the variable X.
Parameters:
instList - The instance list over which mutual information is calculated.
ent - Entropy value.
attrNumX - The number of the attribute for which mutual information is requested.
Returns:
The mutual information value.

build_nominal_attr_split_dist

public static double[] build_nominal_attr_split_dist(InstanceList[] currentLevel,
                                                     int attrNum)
Builds the distribution arrays necessary for calculating conditional entropy for nominal attributes. All of the splitAndLabelDist arrays of the Instance Lists are concatenated. The unaccounted instances allow the list of nodes to be partial, i.e., not to contain all instances. The split will be created so that the unaccounted instances are in an extra split with the same label, so that the entropy will be decreased correctly as if they were in a pure node.
Parameters:
currentLevel - The list of instances in the current partition for which a split is being determined.
attrNum - The number of the attribute for which mutual information is requested.
Returns:
The distribution over splits.

build_nominal_attr_split_dist

public static double[] build_nominal_attr_split_dist(InstanceList[] currentLevel,
                                                     int attrNum,
                                                     double unaccountedWeight)
Builds the distribution arrays necessary for calculating conditional entropy for nominal attributes. All of the splitAndLabelDist arrays of the Instance Lists are concatenated. The unaccounted instances allow the list of nodes to be partial, i.e., not to contain all instances. The split will be created so that the unaccounted instances are in an extra split with the same label, so that the entropy will be decreased correctly as if they were in a pure node.
Parameters:
currentLevel - The list of instances in the current partition for which a split is being determined.
attrNum - The number of the attribute for which mutual information is requested.
unaccountedWeight - Weight that is not accounted for in the list of instances.
Returns:
The distribution over splits.

j_measure

public static double j_measure(double[][] splitAndLabelDist,
                               double[] splitDist,
                               double[] labelCounts,
                               int x,
                               double totalWeight)
Compute the J-measure. See papers by Goodman and Smyth, such as Data Engineering, v.4, no.4, pp.301-316, 1992. The J-measure summed over all values of x gives info-gain. The J-measure is
sum_y p(x,y)log(p(x,y)/(p(x)p(y)))
1/n * sum_y n(x,y)log(n(x,y)*n/(n(x)n(y)))
Used in t_entropy.java.
Parameters:
splitAndLabelDist - Distributions over each split and label pair.
splitDist - The distribution over splits.
labelCounts - Counts of each label found in the data.
x - The x value for the j-measure equation.
totalWeight - Total weight of all data.
Returns:
The j-measure value.

j_measure

public static double j_measure(InstanceList instList,
                               int attrNumX,
                               int x)
Compute the J-measure. See papers by Goodman and Smyth, such as Data Engineering, v.4, no.4, pp.301-316, 1992. The J-measure summed over all values of x gives info-gain. The J-measure is
sum_y p(x,y)log(p(x,y)/(p(x)p(y))) 1/n * sum_y n(x,y)log(n(x,y)*n/(n(x)n(y)))
Used in t_entropy.java.
Parameters:
instList - The list of Instances over which a j measure is to be calculated.
attrNumX - The number of attributes in the Schema of the Instances supplied.
x - The x value for the j-measure equation.
Returns:
The j-measure value.

build_real_and_label_columns

public static RealAndLabelColumn[] build_real_and_label_columns(InstanceList instList,
                                                                int attrNum)
Builds columns of real values and their associated label values. Invokes InstanceList's transpose function to provide a single column for the passed attribute number, sorts it, and returns the columns to the caller. The second calling argument, if set to an attribute index, results in a single column being transposed and sorted. When set to UNDEFINED_INT, all columns are so treated.
Parameters:
instList - The instance list containing the instance values for the attribute.
attrNum - The number of the attribute for which the real and label column is requested.
Returns:
The columns of real values and their associated labels, organized by attribute.

build_real_and_label_column

public static RealAndLabelColumn build_real_and_label_column(InstanceList instList,
                                                             int attrNum)
Builds a column of real values and their associated label values for the given attribute. Invokes InstanceList's transpose function to provide a single column for the passed attribute number, sorts it, and returns the columns to the caller. The second calling argument, if set to an attribute index, results in a single column being transposed and sorted. When set to UNDEFINED_INT, all columns are so treated.
Parameters:
instList - The instance list containing the instance values for the attribute.
attrNum - The number of the attribute for which the real and label column is requested.
Returns:
The column of real values and their associated labels.