|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Object | +--shared.Entropy
This class handles all of the Entropy based calculations. All logs
are base 2 (other bases just scale the entropy). The reason for using
log_bin is simply that the examples in Quinlan's C4.5 book use it, and
those examples were used for testing. It's also a measure of the
number of "bits" in information theory, so it's appealing in that sense
too. The computation is based on:
1. "Boolean Feature Discovery in Empirical Learning" / Pagallo and
Haussler.
2. "A First Course in Probability, 2nd Edition" / Ross, pages 354-359.
3. "C4.5: Programs for Machine Learning" / Ross Quinlan, pages 18-24.
| Field Summary | |
static double |
M_LOG2E
Constant for binary log calculations. |
| Constructor Summary | |
Entropy()
|
|
| Method Summary | |
static double |
auto_lbound_min_split(double totalWeight)
Automatically determine a good lower bound for minSplit, based on the total weight of an instance list at the start of training. |
static double[] |
build_nominal_attr_split_dist(InstanceList[] currentLevel,
int attrNum)
Builds the distribution arrays necessary for calculating conditional entropy for nominal attributes. |
static double[] |
build_nominal_attr_split_dist(InstanceList[] currentLevel,
int attrNum,
double unaccountedWeight)
Builds the distribution arrays necessary for calculating conditional entropy for nominal attributes. |
static RealAndLabelColumn |
build_real_and_label_column(InstanceList instList,
int attrNum)
Builds a column of real values and their associated label values for the given attribute. |
static RealAndLabelColumn[] |
build_real_and_label_columns(InstanceList instList,
int attrNum)
Builds columns of real values and their associated label values. |
static double[][] |
build_split_and_label_dist(InstanceList[] currentLevel,
int attrNum)
Build the splitAndLabelDist and splitDist arrays needed for calculating conditional entropy. |
static double[][] |
build_split_and_label_dist(InstanceList[] currentLevel,
int attrNum,
double unaccountedWeight)
Build the splitAndLabelDist and splitDist arrays needed for calculating conditional entropy. |
static double |
cond_entropy(double[][] splitAndLabelDist,
double[] splitDist,
double totalWeight)
Computes conditional entropy of the label given attribute X. |
static double |
cond_entropy(InstanceList instList,
int attrNumX)
Computes conditional entropy of the label given attribute X. |
static double |
entropy(double[] labelCount)
Compute the entropy H(Y) for label Y. |
static double |
entropy(double[] labelCount,
double totalInstanceWeight)
Compute the entropy H(Y) for label Y. |
static double |
entropy(InstanceList instList)
Compute the entropy H(Y) for label Y. |
static void |
fill_scores(RealAndLabelColumn realAndLabelColumn,
SplitScore split,
double minSplit,
double theEntropy,
java.util.Vector outScores,
IntRef numDistinctValues,
double[][] splitAndLabelDist,
double[] splitDist)
Fills the Vector of scores with the scores for all the thresholds. |
static double |
find_best_score(double totalKnownWeight,
java.util.Vector scores,
double minSplit,
IntRef bestSplitIndex)
Search a score array to find the best score/index. |
static void |
find_best_threshold(RealAndLabelColumn realAndLabelColumn,
double minSplit,
SplitScore split,
DoubleRef bestThreshold,
IntRef bestSplitIndex,
IntRef numDistinctValues,
int smoothInst,
double smoothFactor)
Compute the best threshold for RealAndLabelColumn(s). |
static double[][] |
get_score_array(RealAndLabelColumn realAndLabelColumn,
SplitScore split,
double minSplit,
java.util.Vector outScores,
IntRef numDistinctValues,
int smoothInst,
double smoothFactor)
Calculates the distribution array for the given sorted RealAndLabelColumn. |
static double |
j_measure(double[][] splitAndLabelDist,
double[] splitDist,
double[] labelCounts,
int x,
double totalWeight)
Compute the J-measure. |
static double |
j_measure(InstanceList instList,
int attrNumX,
int x)
Compute the J-measure. |
static double |
log_bin(double num)
Returns the log base two of the supplied number. |
static double |
min_split(double totalWeight,
int numTotalCategories,
double lowerBoundMinSplit,
double upperBoundMinSplit,
double minSplitPercent)
Returns the minSplit which is used in find_best_threshold(), given lowerBoundMinSplit, upperBoundMinSplit, and minSplitPercent. |
static double |
min_split(InstanceList instList,
double lowerBoundMinSplit,
double upperBoundMinSplit,
double minSplitPercent)
Returns the minSplit which is used in find_best_threshold(), given lowerBoundMinSplit, upperBoundMinSplit, and minSplitPercent. |
static double |
min_split(InstanceList instList,
double lowerBoundMinSplit,
double upperBoundMinSplit,
double minSplitPercent,
boolean ignoreNumCat)
Returns the minSplit which is used in find_best_threshold(), given lowerBoundMinSplit, upperBoundMinSplit, and minSplitPercent. |
static double |
mutual_info(double ent,
double[][] splitAndLabelDist,
double[] splitDist,
double totalWeight)
Compute the mutual information which is defined as I(Y;X) = H(Y) - H(Y|X). |
static double |
mutual_info(InstanceList instList,
double ent,
int attrNumX)
Compute the mutual information which is defined as I(Y;X) = H(Y) - H(Y|X). |
static double |
mutual_info(InstanceList instList,
int attrNumX)
Compute the mutual information which is defined as I(Y;X) = H(Y) - H(Y|X). |
static void |
smooth_scores(java.util.Vector scores,
int smoothInst,
double smoothFactor)
Normalizes the scores towards the instance at the supplied index. |
| Methods inherited from class java.lang.Object |
clone,
equals,
finalize,
getClass,
hashCode,
notify,
notifyAll,
toString,
wait,
wait,
wait |
| Field Detail |
public static double M_LOG2E
| Constructor Detail |
public Entropy()
| Method Detail |
public static double auto_lbound_min_split(double totalWeight)
totalWeight - The total weight of instances in this
InstanceList partition.public static double log_bin(double num)
num - The number for which a log is requested.
public static void find_best_threshold(RealAndLabelColumn realAndLabelColumn,
double minSplit,
SplitScore split,
DoubleRef bestThreshold,
IntRef bestSplitIndex,
IntRef numDistinctValues,
int smoothInst,
double smoothFactor)
realAndLabelColumn - The column of real values and their associated labelsminSplit - The split value for which conditional entropy is minimal.split - The scores for all available splits.bestThreshold - The real value that is the best threshold over the supplied real values.bestSplitIndex - Index of the best split in the supplied RealAndLabelColumn.numDistinctValues - The number of non-equal values.smoothInst - The index of the instance to be smoothed toward.smoothFactor - The amount of normalization done towards the specified instance index.
public static double[][] get_score_array(RealAndLabelColumn realAndLabelColumn,
SplitScore split,
double minSplit,
java.util.Vector outScores,
IntRef numDistinctValues,
int smoothInst,
double smoothFactor)
realAndLabelColumn - The supplied column of real values and their associated label values.split - The scores for all available splits.minSplit - The split value for which conditional entropy is minimal.outScores - The scores after smoothing.numDistinctValues - The number of non-equal values.smoothInst - The index of the instance to be smoothed towards.smoothFactor - The amount of normalization done towards the specified instance index.
public static void smooth_scores(java.util.Vector scores,
int smoothInst,
double smoothFactor)
scores - The set of scores to be smoothed.smoothInst - The index of the instance to be smoothed towards.smoothFactor - The amount of normalization done towards the specified instance index.
public static void fill_scores(RealAndLabelColumn realAndLabelColumn,
SplitScore split,
double minSplit,
double theEntropy,
java.util.Vector outScores,
IntRef numDistinctValues,
double[][] splitAndLabelDist,
double[] splitDist)
realAndLabelColumn - The column of real values and their associated labels over which thresholds are
created.split - The SplitScore used for scoring this threshold split.minSplit - The minimum value for splits.theEntropy - The Entropy valueoutScores - The Vector of scores to be filled.numDistinctValues - The number of distinct real values for this attribute.splitAndLabelDist - Distributions over each split and label pair.splitDist - The distribution over splits.public static double entropy(double[] labelCount)
labelCount - The count of each label found in the data.
public static double entropy(double[] labelCount,
double totalInstanceWeight)
labelCount - The count of each label found in the data.totalInstanceWeight - The total weight for all of the instances.public static double entropy(InstanceList instList)
instList - The supplied instances for whihc entropy will be calculated.
public static double find_best_score(double totalKnownWeight,
java.util.Vector scores,
double minSplit,
IntRef bestSplitIndex)
totalKnownWeight - Total weight of all Instances for which a value is known.scores - The scores of the available Splits.minSplit - The minimum value for a split.bestSplitIndex - The index of the best split.
public static double[][] build_split_and_label_dist(InstanceList[] currentLevel,
int attrNum)
currentLevel - The list of instances in the current partition
for which a split is being determined.attrNum - The number of the attribute over which a split and label distribution is to be
built.
public static double[][] build_split_and_label_dist(InstanceList[] currentLevel,
int attrNum,
double unaccountedWeight)
currentLevel - The list of instances in the current partition for which a split is being
determined.attrNum - The number of the attribute over which a split and label distribution is to be
built.unaccountedWeight - The weight for instances that are not
accounted for in this partition.
public static double min_split(double totalWeight,
int numTotalCategories,
double lowerBoundMinSplit,
double upperBoundMinSplit,
double minSplitPercent)
upperBoundMinSplit - Upper bound for the minimum split value.totalWeight - The total weight of all instances in the list of instances for which a split
is requested.numTotalCategories - Number of possible values an instance may be categorized as.lowerBoundMinSplit - Lower bound for the minimum split value.minSplitPercent - The percentage of total weight per category that represents the minimum value
for a split.
public static double min_split(InstanceList instList,
double lowerBoundMinSplit,
double upperBoundMinSplit,
double minSplitPercent,
boolean ignoreNumCat)
instList - The list of instances over which a split is requested.upperBoundMinSplit - Upper bound for the minimum split value.lowerBoundMinSplit - Lower bound for the minimum split value.minSplitPercent - The percentage of total weight per category that represents the minimum value
for a split.ignoreNumCat - Indicator that the number of values that an instance may be classified as should
be ignored for this split computation.
public static double min_split(InstanceList instList,
double lowerBoundMinSplit,
double upperBoundMinSplit,
double minSplitPercent)
instList - The list of instances over which a split is requested.upperBoundMinSplit - Upper bound for the minimum split value.lowerBoundMinSplit - Lower bound for the minimum split value.minSplitPercent - The percentage of total weight per category that represents the minimum value
for a split.
public static double cond_entropy(double[][] splitAndLabelDist,
double[] splitDist,
double totalWeight)
splitAndLabelDist - Distributions over each split and label pair.splitDist - The distribution over splits.totalWeight - The total weight distributed.
public static double cond_entropy(InstanceList instList,
int attrNumX)
instList - The instance list over which conditional entropy is calculated.attrNumX - The number of the attribute for which conditional entropy is requested.
public static double mutual_info(double ent,
double[][] splitAndLabelDist,
double[] splitDist,
double totalWeight)
ent - Entropy value.splitAndLabelDist - Distributions over each split and label pair.splitDist - The distribution over splits.totalWeight - Total weight of the Instances trained on.
public static double mutual_info(InstanceList instList,
int attrNumX)
instList - The instance list over which mutual information is calculated.attrNumX - The number of the attribute for which mutual information is requested.
public static double mutual_info(InstanceList instList,
double ent,
int attrNumX)
instList - The instance list over which mutual information is calculated.ent - Entropy value.attrNumX - The number of the attribute for which mutual information is requested.
public static double[] build_nominal_attr_split_dist(InstanceList[] currentLevel,
int attrNum)
currentLevel - The list of instances in the current partition for which a split is being
determined.attrNum - The number of the attribute for which mutual information is requested.
public static double[] build_nominal_attr_split_dist(InstanceList[] currentLevel,
int attrNum,
double unaccountedWeight)
currentLevel - The list of instances in the current partition for which a split is being
determined.attrNum - The number of the attribute for which mutual information is requested.unaccountedWeight - Weight that is not accounted for in the list of instances.
public static double j_measure(double[][] splitAndLabelDist,
double[] splitDist,
double[] labelCounts,
int x,
double totalWeight)
splitAndLabelDist - Distributions over each split and label pair.splitDist - The distribution over splits.labelCounts - Counts of each label found in the data.x - The x value for the j-measure equation.totalWeight - Total weight of all data.
public static double j_measure(InstanceList instList,
int attrNumX,
int x)
instList - The list of Instances over which a j measure is to be
calculated.attrNumX - The number of attributes in the Schema of the Instances
supplied.x - The x value for the j-measure equation.
public static RealAndLabelColumn[] build_real_and_label_columns(InstanceList instList,
int attrNum)
instList - The instance list containing the instance values for the attribute.attrNum - The number of the attribute for which the real and label column is
requested.
public static RealAndLabelColumn build_real_and_label_column(InstanceList instList,
int attrNum)
instList - The instance list containing the instance values for the attribute.attrNum - The number of the attribute for which the real and label column is
requested.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||