nb
Class NaiveBayesCat

java.lang.Object
  |
  +--shared.Globals
        |
        +--shared.Categorizer
              |
              +--nb.NaiveBayesCat

public class NaiveBayesCat
extends Categorizer

This categorizer returns the category (label) that had the greatest relative probability of being correct, assuming independence of attributes. Relative probability of a label is calculated by multiplying the relative probability for each attribute. The calculation of relative probabity for a label on a single attribute depends on whether the attribute is descrete or continuous. By Bayes Theorem, P(L=l | X1=x1, X2=x2, ... Xn=xn) = P(X1=x1, X2=x2, ... Xn=xn | L=l)*P(L=l)/P(X) where P(X) is P(X1=x1, ..., Xn=xn). Since P(X) is constant independent of the classes, we can ignore it. The Naive Bayesian approach asssumes complete independence of the attributes GIVEN the label, thus P(X1=x1, X2=x2, ... Xn=xn | L=l) = P(X1=x1|L=l)*P(X2=x2|L)*... P(Xn=xn|L) and P(X1=x1|L=l) = P(X1=x1 ^ L=l)/P(L=l) where this quantity is approximated form the data. When the computed pe tie in favor of the most preva the same value, we break the tie in favor of the most prevalent label. If the instance being categorized has the first attribute = 1, and in the training set label A occured 20 times, 10 of which had value 1 for the first attribute, then the relative probability is 10/20 = 0.5. For continuous (real) attributes, the relative probability is based on the Normal Distribution of the values of the attribute on training instances with the label. The actual calculation is done with the Normal Density; constants, which do not affect the relative probability between labels, are ignored. For example, say 3 training instances have label 1 and these instances have the following values for a continous attribute: 35, 50, 65. The program would use the mean and variance of this "sample" along with the attribute value of the instance that is being categorized in the Normal Density equation. The evaluation of the Normal Density equation, without constant factors, provides the relative probability. Unknown attributes are skipped over. Assumptions : This method calculates the probability of a label as the product of the probabilities of each attribute. This is assuming that the attributes are independent, a condition not likely corresponding to reality. Thus the "Naive" of the title. This method assumes that all continous attributes have a Normal distribution for each label value. Comments : For nominal attributes, if a label does not have any occurences for a given attribute value of the test instance, a probability of noMatchesFactor * ( 1 / # instances in training set ) is used. For nominal attributes, if an attribute value does not occur in the training set, the attribute is skipped in the categorizer, since it does not serve to differentiate the labels. The code can handle dealing with unknowns as a special value by doing the is_unknown only in the real attribute case. Helper class NBNorm is a simple structure to hold the parameters needed to calculate the Normal Distribution of each Attribute,Label pair. The NBNorms are stored in a Array2 table "continNorm" which is indexed by attribute number and label value. For continuous attributes the variance must not equal 0 since it is in the denominator. If the variance is undefined for a label value (e.g. if a label only has only one instance in the training set), NaiveBayesInd will declare the variance to be defaultVariance, a static variable. In cases where the variance is defined but equal to 0, NaiveBayesInd will declare the variance to be epsilon, a very small static variable. For continous attributes, if a label does not occur in the training set, a zero relative probability is assigned. If a label occurs in the training set but only has unknown values for the attribute, noMatchesFactor is used as in the nominal attribute case above. Complexity : categorize() is O(ln) where l = the number of categories and n = the number of attributes.


Field Summary
static double defaultEvidenceFactor
           
static double defaultKLThreshold
           
static boolean defaultLaplaceCorrection
           
static double defaultMEstimateFactor
          Categorizer option defaults.
static double defaultNoMatchesFactor
           
static int defaultUnknownIsValue
           
static boolean defaultUseEvidenceProjection
           
static double defaultVariance
          Value to use for Vaiance when actual variance is undefined becase there is only one occurance.
static java.lang.String endl
           
static double epsilon
          Value to use for Variance when actual variance = 0:
static int unknownAuto
           
static int unknownNo
          Ported from C++ > enum UnknownIsValueEnum { unknownNo, unknownYes, unknownAuto }; //C++ equivalent
static int unknownYes
           
 
Fields inherited from class shared.Categorizer
CATEGORIZER_ID_BASE, CLASS_ATTR_CATEGORIZER, CLASS_ATTR_EQ_CATEGORIZER, CLASS_ATTR_SUBSET_CATEGORIZER, CLASS_BAD_CATEGORIZER, CLASS_BAGGING_CATEGORIZER, CLASS_CASCADE_CATEGORIZER, CLASS_CLUSTER_CATEGORIZER, CLASS_CONST_CATEGORIZER, CLASS_CONSTRUCT_CATEGORIZER, CLASS_DISC_CATEGORIZER, CLASS_DISC_NODE_CATEGORIZER, CLASS_DTREE_CATEGORIZER, CLASS_IB_CATEGORIZER, CLASS_LAZYDT_CATEGORIZER, CLASS_LEAF_CATEGORIZER, CLASS_LINDISCR_CATEGORIZER, CLASS_MAJORITY_CATEGORIZER, CLASS_MULTI_SPLIT_CATEGORIZER, CLASS_MULTITHRESH_CATEGORIZER, CLASS_NB_CATEGORIZER, CLASS_ODT_CATEGORIZER, CLASS_ONE_R_CATEGORIZER, CLASS_OPTION_CATEGORIZER, CLASS_PROJECT_CATEGORIZER, CLASS_RDG_CATEGORIZER, CLASS_STACKING_CATEGORIZER, CLASS_TABLE_CATEGORIZER, CLASS_THRESHOLD_CATEGORIZER, logOptions
 
Fields inherited from class shared.Globals
badCategorizer, CONFIDENCE_INTERVAL_Z, DBG, DEFAULT_DATA_EXT, DEFAULT_EPSILON, DEFAULT_EVAL_LIMIT, DEFAULT_LAMBDA, DEFAULT_MAX_EVALS, DEFAULT_MAX_STALE, DEFAULT_MIN_EXP_EVALS, DEFAULT_NAMES_EXT, DEFAULT_SAS_SEED, DEFAULT_SEARCH_METHOD, DEFAULT_SHOW_TEST_SET_PERF, DEFAULT_TEST_EXT, DISPLAY_NAMES, EMPTY_STRING, FIRST_CATEGORY_VAL, FIRST_NOMINAL_VAL, LEFT_NODE, MAX_NUM_CATEGORIES, Mcerr, Mcout, optionServer, optionsFileName, REAL_MAX, RIGHT_NODE, SHOW_TEST_SET_PERF_HELP, SINGLE_QUOTE, STORED_REAL_MAX, TS, UNDEFINED_INT, UNDEFINED_REAL, UNDEFINED_VARIANCE, UNKNOWN_AUG_CATEGORY, UNKNOWN_CATEGORY_VAL, UNKNOWN_NODE, UNKNOWN_NOMINAL_VAL, UNKNOWN_STORED_REAL_VAL, UNKNOWN_VAL_STR
 
Constructor Summary
NaiveBayesCat(NaiveBayesCat source)
          Copy Constructor.
NaiveBayesCat(java.lang.String dscr, InstanceList instList)
          Constructor
 
Method Summary
 AugCategory categorize(Instance instance)
          Categorizes a single instances based upon the training data.
 int class_id()
          Deprecated. CLASS_NB_CATEGORIZER has been deprecated
 java.lang.Object clone()
          Returns a pointer to a deep copy of this NaiveBayesCat.
static NBNorm[][] compute_contin_norm(InstanceList instList)
          Compute the norms of the continuous attributes
static double[] compute_importance(InstanceList instList)
          Computes importance values for each nominal attribute using the mutual_info (entropy).
 void display_struct(java.io.BufferedWriter stream, DisplayPref dp)
          Prints a readable representation of the Cat to the given stream.
static double findMax(double[] d)
          findMax finds the largest value for an array of doubles
static double findMin(double[] d)
          findMin finds the smallest value for an array of doubles
 double get_evidence_factor()
          fuctions for retrieving and setting optional variables.
 double get_kl_threshold()
           
 double get_m_estimate_factor()
           
 double get_no_matches_factor()
           
 int get_unknown_is_value()
           
 boolean get_use_evidence_projection()
           
 boolean get_use_laplace()
           
static void init_class_prob(BagCounters nominCounts, double trainWeight, double[] prob, boolean useLaplace, boolean useEvidenceProjection, double evidenceFactor)
          Initialize the probabilities to be the class probabilities P(L = l)
static double kullback_leibler_distance(double[] p, double[] q)
          Compute a Kullback Leibler distance metric given an array of p(x) and q(x) for all x.
 void OK(int level)
          Check state of object after training.
 CatDist score(Instance instance)
          Returns a category given an instance by checking all attributes in schema and returning category with highest relative probability.
 void set_evidence_factor(double f)
           
 void set_kl_threshold(double th)
           
 void set_m_estimate_factor(double m)
          set m value for L'aplace correction.
 void set_no_matches_factor(double nm)
           
 void set_unknown_is_value(int unk)
          set_unknown_is_value sets the value of unknownIsValue.
 void set_use_evidence_projection(boolean b)
           
 void set_use_laplace(boolean lap)
           
static double sumArray(double[] d)
          sumArray() adds the values of all the ellements in the given array
 boolean supports_backfit()
           
 double total_train_weight()
           
 
Methods inherited from class shared.Categorizer
build_distr, description, get_distr, get_log_level, get_log_options, get_log_stream, get_schema, has_distr, num_categories, set_description, set_distr, set_log_level, set_log_options, set_log_prefixes, set_log_stream, set_original_distr, set_used_attr, supports_scoring, total_weight
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

endl

public static final java.lang.String endl

unknownNo

public static final int unknownNo
Ported from C++ > enum UnknownIsValueEnum { unknownNo, unknownYes, unknownAuto }; //C++ equivalent

unknownYes

public static final int unknownYes

unknownAuto

public static final int unknownAuto

defaultMEstimateFactor

public static final double defaultMEstimateFactor
Categorizer option defaults.

defaultLaplaceCorrection

public static final boolean defaultLaplaceCorrection

defaultUnknownIsValue

public static final int defaultUnknownIsValue

defaultKLThreshold

public static final double defaultKLThreshold

defaultNoMatchesFactor

public static final double defaultNoMatchesFactor

defaultUseEvidenceProjection

public static final boolean defaultUseEvidenceProjection

defaultEvidenceFactor

public static final double defaultEvidenceFactor

epsilon

public static final double epsilon
Value to use for Variance when actual variance = 0:

defaultVariance

public static final double defaultVariance
Value to use for Vaiance when actual variance is undefined becase there is only one occurance.
Constructor Detail

NaiveBayesCat

public NaiveBayesCat(java.lang.String dscr,
                     InstanceList instList)
Constructor
Parameters:
dscr - - the description of this Inducer.
instList - - training data.

NaiveBayesCat

public NaiveBayesCat(NaiveBayesCat source)
Copy Constructor.
Parameters:
source - - the NaiveBayesCat to copy.
Method Detail

categorize

public AugCategory categorize(Instance instance)
Categorizes a single instances based upon the training data.
Overrides:
categorize in class Categorizer
Parameters:
instance - - the instance to categorize.
Returns:
the predicted category.

class_id

public int class_id()
Deprecated. CLASS_NB_CATEGORIZER has been deprecated

Simple Method to return an ID.
Returns:
- an int representing this Categorizer.

clone

public java.lang.Object clone()
Returns a pointer to a deep copy of this NaiveBayesCat.
Overrides:
clone in class Categorizer
Returns:
- the copy of this Categorizer.

compute_contin_norm

public static NBNorm[][] compute_contin_norm(InstanceList instList)
Compute the norms of the continuous attributes
Parameters:
instList - - the instances to calculate.
Returns:
the array[][] of NBNorms.

compute_importance

public static double[] compute_importance(InstanceList instList)
Computes importance values for each nominal attribute using the mutual_info (entropy). Static function; used as helper by train() below.
Parameters:
instList - - the instances to use.
Returns:
- the array[] of importance values.

display_struct

public void display_struct(java.io.BufferedWriter stream,
                           DisplayPref dp)
Prints a readable representation of the Cat to the given stream.
Overrides:
display_struct in class Categorizer
Tags copied from class: Categorizer
Parameters:
stream - The output stream to be written to.
dp - The preferences for display.

findMax

public static double findMax(double[] d)
findMax finds the largest value for an array of doubles
Parameters:
d - - the array of doubles.
Returns:
the maximum number.

findMin

public static double findMin(double[] d)
findMin finds the smallest value for an array of doubles
Parameters:
d - - the array of doubles.
Returns:
the minimum number.

get_evidence_factor

public double get_evidence_factor()
fuctions for retrieving and setting optional variables.

get_kl_threshold

public double get_kl_threshold()

get_m_estimate_factor

public double get_m_estimate_factor()

get_no_matches_factor

public double get_no_matches_factor()

get_unknown_is_value

public int get_unknown_is_value()

get_use_evidence_projection

public boolean get_use_evidence_projection()

get_use_laplace

public boolean get_use_laplace()

set_evidence_factor

public void set_evidence_factor(double f)

set_kl_threshold

public void set_kl_threshold(double th)

init_class_prob

public static void init_class_prob(BagCounters nominCounts,
                                   double trainWeight,
                                   double[] prob,
                                   boolean useLaplace,
                                   boolean useEvidenceProjection,
                                   double evidenceFactor)
Initialize the probabilities to be the class probabilities P(L = l)
Parameters:
nominCoutns - - the BagCounter to initilize.

kullback_leibler_distance

public static double kullback_leibler_distance(double[] p,
                                               double[] q)
Compute a Kullback Leibler distance metric given an array of p(x) and q(x) for all x. The Kullback Leibler distance is defined as: sum over all x p(x) log(p(x)/q(x)) We assume that no p(x) or q(x) are zero
Parameters:
p - - the first array
q - - the second array.
Returns:
- the computer distance between p and q.

OK

public void OK(int level)
Check state of object after training. Checks 1) BagCounter is ok, and 2) the number of test cases is > 0, and 3) that there are no variances = 0.

score

public CatDist score(Instance instance)
Returns a category given an instance by checking all attributes in schema and returning category with highest relative probability. The relative probability is being estimated for each label. The label with the highest values is the category returned. The probability for a given label is P(Nominal Attributes)*P(Continuous Attributes) Since the probability is a product, we can factor out any constants that will be multiplied times every label, since this will not change the ordering of labels. P(Continuous Attribute Value X) is caculated using the normal density: Normal(X) = 1/(sqrt(2*pi)*std-dev)*exp((-1/2)*(X-mean)^2/var) This calculation can be stripped of the constant (sqrt(2*pi)) without changing the outcome. P(Nominal Attributes) is calculated as the percentage of a label's training set that had the test instance's value for a each attribute. The majority label is returned if all are equal. See this file's header for more information.
Overrides:
score in class Categorizer
Parameters:
instance - - the instance to be scored.

set_m_estimate_factor

public void set_m_estimate_factor(double m)
set m value for L'aplace correction.
Parameters:
m - - the new m-estimate factor.

set_no_matches_factor

public void set_no_matches_factor(double nm)

set_unknown_is_value

public void set_unknown_is_value(int unk)
set_unknown_is_value sets the value of unknownIsValue. the variable unknownIsValue must be either 1, 2, or 3. Any other value fails and gives an Error.
Parameters:
unk - - the new value of unknownIsValue

set_use_evidence_projection

public void set_use_evidence_projection(boolean b)

set_use_laplace

public void set_use_laplace(boolean lap)

sumArray

public static double sumArray(double[] d)
sumArray() adds the values of all the ellements in the given array
Parameters:
d[] - and array of doubles to add
Returns:
the sum of the doubles

supports_backfit

public boolean supports_backfit()

total_train_weight

public double total_train_weight()