nick has a blog

Archive for the ‘Machine Learning’ Category

Naïve Bayes in Hadoop

with 5 comments

Naïve Bayes is a probabilistic data mining classifier which fits nicely into the MapReduce model and gives pretty good predictive performance for its simplicity. The Hadoop implementation uses a single map/reduce operation to calculate the mean and standard deviation of each attribute/class combination, as well as the global class distribution of the training dataset.

Some basic pseudo code:

instance = single row of training set
instance.class = class/target of row
instance.attributes = list of attributes

def map(key, instance):
	i = 0
	for attribute in instance.attributes:
		collect(instance.class + "_" + i, attribute)
		i++
 
	collect("target_" + instance.class, 1) # class distribution
 
def reduce(key, values):
	if key.startsWith("target_"): # reduce class dist keys
		sum = 0
		for v in values:
			sum += v
                collect(key,sum)
 
	else: # reduce attribute/class keys
		sum=0
		sumSq = 0
		count = 0
		for v in values:
			sum += v
			sumSq += v*v
			count++
 
		mean = sum/count
		collect(key + "_mean", mean)
		collect(key + "_stddev", sqrt(abs(sumSq - mean * sum) / count))

This will produce a file of means, standard deviations and a class distribution for which you can then load into a model (such as that found in Weka- weka.classifiers.bayes.NaiveBayes.distributionForInstance). This doesn’t support discrete attributes yet, only numeric/real ones. Working on it.

I was able to process a ~5GB file of ~200k rows/2000 attributes per row in 4 minutes on 30 nodes.

Written by Nick

April 19th, 2009 at 9:44 pm

Machine Learning with Hadoop

without comments

At University this year I am working on a machine learning framework using Hadoop (MapReduce), with the intent on running it on the University cluster. Initially I started playing with Disco, but it was a bit tedious to setup on two nodes, let alone 100, so Hadoop it is. So far progress has been good, as I have a working prototype that can take generic classifiers and evaluate them (e.g. the basic functionality of Weka). The MapReduce model was a bit unusual at first, but once you understand the basics it is insanely easy to use, which is always a bonus.

mlhadoop

The framework itself has been kept fairly basic, map functions of classifiers are provided with an Instance of data (a single row of a training dataset), this then allows the classifier to query attribute types and values. The reducer output of a classifier produces a Model file, which the evaluator then uses to evaluate the classifier on a test dataset. This was a little hacky, because reducers only produce key/value pairs the Model files have to be highly customised to each classifier (as such, a classifier must implement a model parser), a little extra coding, but for the extra effort you get massive scalability..which is always good to have. I have tried to give classifiers lots of flexibility in terms of how they operate. Many algorithms are going to require multiple MapReduce jobs, so a classifier is able to create new tasks as required. This sort of functionality would allow for meta classifiers like Bagging to be implemented as well. I am still pondering on adding cross validation support, but given that cross validation is generally used to compensate for smaller datasets, it probably isn’t necessary.

Initial testing looks good, on a small setup I have at home (two Dual Xeon 3Ghz/6GB RAM servers)  I was able to process a 2GB dataset in three minutes, using the extremely basic zero rule classifier (which is similar in terms of functionality as a word count, so not to intensive). The first real classifier I am going to implement is Naive Bayes, it seems a fairly popular choice in the literature for map reduce applications, probably because it is so simple. In addition to this I have decided I will never ever buy servers that aren’t going into a data centre. Noise is not productive!

Written by Nick

March 25th, 2009 at 8:32 pm