|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Object
|
+--shared.Globals
|
+--shared.Categorizer
|
+--nb.NaiveBayesCat
This categorizer returns the category (label) that had the greatest relative probability of being correct, assuming independence of attributes. Relative probability of a label is calculated by multiplying the relative probability for each attribute. The calculation of relative probabity for a label on a single attribute depends on whether the attribute is descrete or continuous. By Bayes Theorem, P(L=l | X1=x1, X2=x2, ... Xn=xn) = P(X1=x1, X2=x2, ... Xn=xn | L=l)*P(L=l)/P(X) where P(X) is P(X1=x1, ..., Xn=xn). Since P(X) is constant independent of the classes, we can ignore it. The Naive Bayesian approach asssumes complete independence of the attributes GIVEN the label, thus P(X1=x1, X2=x2, ... Xn=xn | L=l) = P(X1=x1|L=l)*P(X2=x2|L)*... P(Xn=xn|L) and P(X1=x1|L=l) = P(X1=x1 ^ L=l)/P(L=l) where this quantity is approximated form the data. When the computed pe tie in favor of the most preva the same value, we break the tie in favor of the most prevalent label. If the instance being categorized has the first attribute = 1, and in the training set label A occured 20 times, 10 of which had value 1 for the first attribute, then the relative probability is 10/20 = 0.5. For continuous (real) attributes, the relative probability is based on the Normal Distribution of the values of the attribute on training instances with the label. The actual calculation is done with the Normal Density; constants, which do not affect the relative probability between labels, are ignored. For example, say 3 training instances have label 1 and these instances have the following values for a continous attribute: 35, 50, 65. The program would use the mean and variance of this "sample" along with the attribute value of the instance that is being categorized in the Normal Density equation. The evaluation of the Normal Density equation, without constant factors, provides the relative probability. Unknown attributes are skipped over. Assumptions : This method calculates the probability of a label as the product of the probabilities of each attribute. This is assuming that the attributes are independent, a condition not likely corresponding to reality. Thus the "Naive" of the title. This method assumes that all continous attributes have a Normal distribution for each label value. Comments : For nominal attributes, if a label does not have any occurences for a given attribute value of the test instance, a probability of noMatchesFactor * ( 1 / # instances in training set ) is used. For nominal attributes, if an attribute value does not occur in the training set, the attribute is skipped in the categorizer, since it does not serve to differentiate the labels. The code can handle dealing with unknowns as a special value by doing the is_unknown only in the real attribute case. Helper class NBNorm is a simple structure to hold the parameters needed to calculate the Normal Distribution of each Attribute,Label pair. The NBNorms are stored in a Array2 table "continNorm" which is indexed by attribute number and label value. For continuous attributes the variance must not equal 0 since it is in the denominator. If the variance is undefined for a label value (e.g. if a label only has only one instance in the training set), NaiveBayesInd will declare the variance to be defaultVariance, a static variable. In cases where the variance is defined but equal to 0, NaiveBayesInd will declare the variance to be epsilon, a very small static variable. For continous attributes, if a label does not occur in the training set, a zero relative probability is assigned. If a label occurs in the training set but only has unknown values for the attribute, noMatchesFactor is used as in the nominal attribute case above. Complexity : categorize() is O(ln) where l = the number of categories and n = the number of attributes.
| Field Summary | |
static double |
defaultEvidenceFactor
|
static double |
defaultKLThreshold
|
static boolean |
defaultLaplaceCorrection
|
static double |
defaultMEstimateFactor
Categorizer option defaults. |
static double |
defaultNoMatchesFactor
|
static int |
defaultUnknownIsValue
|
static boolean |
defaultUseEvidenceProjection
|
static double |
defaultVariance
Value to use for Vaiance when actual variance is undefined becase there is only one occurance. |
static java.lang.String |
endl
|
static double |
epsilon
Value to use for Variance when actual variance = 0: |
static int |
unknownAuto
|
static int |
unknownNo
Ported from C++ > enum UnknownIsValueEnum { unknownNo, unknownYes, unknownAuto }; //C++ equivalent |
static int |
unknownYes
|
| Constructor Summary | |
NaiveBayesCat(NaiveBayesCat source)
Copy Constructor. |
|
NaiveBayesCat(java.lang.String dscr,
InstanceList instList)
Constructor |
|
| Method Summary | |
AugCategory |
categorize(Instance instance)
Categorizes a single instances based upon the training data. |
int |
class_id()
Deprecated. CLASS_NB_CATEGORIZER has been deprecated |
java.lang.Object |
clone()
Returns a pointer to a deep copy of this NaiveBayesCat. |
static NBNorm[][] |
compute_contin_norm(InstanceList instList)
Compute the norms of the continuous attributes |
static double[] |
compute_importance(InstanceList instList)
Computes importance values for each nominal attribute using the mutual_info (entropy). |
void |
display_struct(java.io.BufferedWriter stream,
DisplayPref dp)
Prints a readable representation of the Cat to the given stream. |
static double |
findMax(double[] d)
findMax finds the largest value for an array of doubles |
static double |
findMin(double[] d)
findMin finds the smallest value for an array of doubles |
double |
get_evidence_factor()
fuctions for retrieving and setting optional variables. |
double |
get_kl_threshold()
|
double |
get_m_estimate_factor()
|
double |
get_no_matches_factor()
|
int |
get_unknown_is_value()
|
boolean |
get_use_evidence_projection()
|
boolean |
get_use_laplace()
|
static void |
init_class_prob(BagCounters nominCounts,
double trainWeight,
double[] prob,
boolean useLaplace,
boolean useEvidenceProjection,
double evidenceFactor)
Initialize the probabilities to be the class probabilities P(L = l) |
static double |
kullback_leibler_distance(double[] p,
double[] q)
Compute a Kullback Leibler distance metric given an array of p(x) and q(x) for all x. |
void |
OK(int level)
Check state of object after training. |
CatDist |
score(Instance instance)
Returns a category given an instance by checking all attributes in schema and returning category with highest relative probability. |
void |
set_evidence_factor(double f)
|
void |
set_kl_threshold(double th)
|
void |
set_m_estimate_factor(double m)
set m value for L'aplace correction. |
void |
set_no_matches_factor(double nm)
|
void |
set_unknown_is_value(int unk)
set_unknown_is_value sets the value of unknownIsValue. |
void |
set_use_evidence_projection(boolean b)
|
void |
set_use_laplace(boolean lap)
|
static double |
sumArray(double[] d)
sumArray() adds the values of all the ellements in the given array |
boolean |
supports_backfit()
|
double |
total_train_weight()
|
| Methods inherited from class shared.Categorizer |
build_distr,
description,
get_distr,
get_log_level,
get_log_options,
get_log_stream,
get_schema,
has_distr,
num_categories,
set_description,
set_distr,
set_log_level,
set_log_options,
set_log_prefixes,
set_log_stream,
set_original_distr,
set_used_attr,
supports_scoring,
total_weight |
| Methods inherited from class java.lang.Object |
equals,
finalize,
getClass,
hashCode,
notify,
notifyAll,
toString,
wait,
wait,
wait |
| Field Detail |
public static final java.lang.String endl
public static final int unknownNo
public static final int unknownYes
public static final int unknownAuto
public static final double defaultMEstimateFactor
public static final boolean defaultLaplaceCorrection
public static final int defaultUnknownIsValue
public static final double defaultKLThreshold
public static final double defaultNoMatchesFactor
public static final boolean defaultUseEvidenceProjection
public static final double defaultEvidenceFactor
public static final double epsilon
public static final double defaultVariance
| Constructor Detail |
public NaiveBayesCat(java.lang.String dscr,
InstanceList instList)
dscr - - the description of this Inducer.instList - - training data.public NaiveBayesCat(NaiveBayesCat source)
source - - the NaiveBayesCat to copy.| Method Detail |
public AugCategory categorize(Instance instance)
instance - - the instance to categorize.public int class_id()
public java.lang.Object clone()
public static NBNorm[][] compute_contin_norm(InstanceList instList)
instList - - the instances to calculate.public static double[] compute_importance(InstanceList instList)
instList - - the instances to use.
public void display_struct(java.io.BufferedWriter stream,
DisplayPref dp)
stream - The output stream to be written to.dp - The preferences for display.public static double findMax(double[] d)
d - - the array of doubles.public static double findMin(double[] d)
d - - the array of doubles.public double get_evidence_factor()
public double get_kl_threshold()
public double get_m_estimate_factor()
public double get_no_matches_factor()
public int get_unknown_is_value()
public boolean get_use_evidence_projection()
public boolean get_use_laplace()
public void set_evidence_factor(double f)
public void set_kl_threshold(double th)
public static void init_class_prob(BagCounters nominCounts,
double trainWeight,
double[] prob,
boolean useLaplace,
boolean useEvidenceProjection,
double evidenceFactor)
nominCoutns - - the BagCounter to initilize.
public static double kullback_leibler_distance(double[] p,
double[] q)
p - - the first arrayq - - the second array.public void OK(int level)
public CatDist score(Instance instance)
instance - - the instance to be scored.public void set_m_estimate_factor(double m)
m - - the new m-estimate factor.public void set_no_matches_factor(double nm)
public void set_unknown_is_value(int unk)
unk - - the new value of unknownIsValuepublic void set_use_evidence_projection(boolean b)
public void set_use_laplace(boolean lap)
public static double sumArray(double[] d)
d[] - and array of doubles to addpublic boolean supports_backfit()
public double total_train_weight()
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||