Naïve Bayes is a probabilistic data mining classifier which fits nicely into the MapReduce model and gives pretty good predictive performance for its simplicity. The Hadoop implementation uses a single map/reduce operation to calculate the mean and standard deviation of each attribute/class combination, as well as the global class distribution of the training dataset.

Some basic pseudo code:

instance = single row of training set

instance.class = class/target of row

instance.attributes = list of attributes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
def map(key, instance): i = 0 for attribute in instance.attributes: collect(instance.class + "_" + i, attribute) i++ collect("target_" + instance.class, 1) # class distribution def reduce(key, values): if key.startsWith("target_"): # reduce class dist keys sum = 0 for v in values: sum += v collect(key,sum) else: # reduce attribute/class keys sum=0 sumSq = 0 count = 0 for v in values: sum += v sumSq += v*v count++ mean = sum/count collect(key + "_mean", mean) collect(key + "_stddev", sqrt(abs(sumSq - mean * sum) / count)) |

This will produce a file of means, standard deviations and a class distribution for which you can then load into a model (such as that found in Weka- weka.classifiers.bayes.NaiveBayes.distributionForInstance). This doesn’t support discrete attributes yet, only numeric/real ones. Working on it.

I was able to process a ~5GB file of ~200k rows/2000 attributes per row in 4 minutes on 30 nodes.

Cliff MoonJust to let you know your standard deviation calculation is incorrect. It should be:

sqrt(abs(sumSq – mean * mean) / count)

PetroHi,

where did the ‘collect’ function come from?

Thanks,

Petro

NickHi

The collect function is provided by Hadoop

-Nick

NickAlso note, while this looks like python it is pseudo code.

Nick@Cliff

Thanks, however you may want to check out Knuths std dev function as it more accurate

http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

On-line algorithm

UdayNick

Is this code available for academic purpose to test on some data? Thanks

Uday

Pingback: Building a Naive Bayes Classifier in the Browser using Map-Reduce — Gary Sieling

GaryGreat- thanks for the useful resource. I translated this into Javascript so I could build a proof of concept of map-reduce in the browser – thought you might find it interesting.