nick has a blog

Naïve Bayes in Hadoop

with 5 comments

Naïve Bayes is a probabilistic data mining classifier which fits nicely into the MapReduce model and gives pretty good predictive performance for its simplicity. The Hadoop implementation uses a single map/reduce operation to calculate the mean and standard deviation of each attribute/class combination, as well as the global class distribution of the training dataset.

Some basic pseudo code:

instance = single row of training set
instance.class = class/target of row
instance.attributes = list of attributes

def map(key, instance):
	i = 0
	for attribute in instance.attributes:
		collect(instance.class + "_" + i, attribute)
		i++
 
	collect("target_" + instance.class, 1) # class distribution
 
def reduce(key, values):
	if key.startsWith("target_"): # reduce class dist keys
		sum = 0
		for v in values:
			sum += v
                collect(key,sum)
 
	else: # reduce attribute/class keys
		sum=0
		sumSq = 0
		count = 0
		for v in values:
			sum += v
			sumSq += v*v
			count++
 
		mean = sum/count
		collect(key + "_mean", mean)
		collect(key + "_stddev", sqrt(abs(sumSq - mean * sum) / count))

This will produce a file of means, standard deviations and a class distribution for which you can then load into a model (such as that found in Weka- weka.classifiers.bayes.NaiveBayes.distributionForInstance). This doesn’t support discrete attributes yet, only numeric/real ones. Working on it.

I was able to process a ~5GB file of ~200k rows/2000 attributes per row in 4 minutes on 30 nodes.

Written by Nick

April 19th, 2009 at 9:44 pm

5 Responses to 'Naïve Bayes in Hadoop'

Subscribe to comments with RSS or TrackBack to 'Naïve Bayes in Hadoop'.

  1. Just to let you know your standard deviation calculation is incorrect. It should be:

    sqrt(abs(sumSq - mean * mean) / count)

    Cliff Moon

    15 Nov 09 at 1:34 pm

  2. Hi,

    where did the ‘collect’ function come from?

    Thanks,
    Petro

    Petro

    18 Nov 09 at 7:52 am

  3. Hi
    The collect function is provided by Hadoop
    -Nick

    Nick

    18 Nov 09 at 11:44 am

  4. Also note, while this looks like python it is pseudo code.

    Nick

    18 Nov 09 at 11:45 am

  5. @Cliff
    Thanks, however you may want to check out Knuths std dev function as it more accurate

    http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

    On-line algorithm

    Nick

    18 Nov 09 at 11:47 am

Leave a Reply