Naïve Bayes in Hadoop

Naïve Bayes is a probabilistic data mining classifier which fits nicely into the MapReduce model and gives pretty good predictive performance for its simplicity. The Hadoop implementation uses a single map/reduce operation to calculate the mean and standard deviation of each attribute/class combination, as well as the global class distribution of the training dataset.

Some basic pseudo code:

instance = single row of training set
instance.class = class/target of row
instance.attributes = list of attributes

This will produce a file of means, standard deviations and a class distribution for which you can then load into a model (such as that found in Weka- weka.classifiers.bayes.NaiveBayes.distributionForInstance). This doesn’t support discrete attributes yet, only numeric/real ones. Working on it.

I was able to process a ~5GB file of ~200k rows/2000 attributes per row in 4 minutes on 30 nodes.

8 thoughts on “Naïve Bayes in Hadoop

  1. Cliff Moon

    Just to let you know your standard deviation calculation is incorrect. It should be:

    sqrt(abs(sumSq – mean * mean) / count)

  2. Pingback: Building a Naive Bayes Classifier in the Browser using Map-Reduce — Gary Sieling

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">