nick has a blog

Archive for the ‘University’ Category

Naïve Bayes in Hadoop

with 5 comments

Naïve Bayes is a probabilistic data mining classifier which fits nicely into the MapReduce model and gives pretty good predictive performance for its simplicity. The Hadoop implementation uses a single map/reduce operation to calculate the mean and standard deviation of each attribute/class combination, as well as the global class distribution of the training dataset.

Some basic pseudo code:

instance = single row of training set
instance.class = class/target of row
instance.attributes = list of attributes

def map(key, instance):
	i = 0
	for attribute in instance.attributes:
		collect(instance.class + "_" + i, attribute)
		i++
 
	collect("target_" + instance.class, 1) # class distribution
 
def reduce(key, values):
	if key.startsWith("target_"): # reduce class dist keys
		sum = 0
		for v in values:
			sum += v
                collect(key,sum)
 
	else: # reduce attribute/class keys
		sum=0
		sumSq = 0
		count = 0
		for v in values:
			sum += v
			sumSq += v*v
			count++
 
		mean = sum/count
		collect(key + "_mean", mean)
		collect(key + "_stddev", sqrt(abs(sumSq - mean * sum) / count))

This will produce a file of means, standard deviations and a class distribution for which you can then load into a model (such as that found in Weka- weka.classifiers.bayes.NaiveBayes.distributionForInstance). This doesn’t support discrete attributes yet, only numeric/real ones. Working on it.

I was able to process a ~5GB file of ~200k rows/2000 attributes per row in 4 minutes on 30 nodes.

Written by Nick

April 19th, 2009 at 9:44 pm

Machine Learning with Hadoop

without comments

At University this year I am working on a machine learning framework using Hadoop (MapReduce), with the intent on running it on the University cluster. Initially I started playing with Disco, but it was a bit tedious to setup on two nodes, let alone 100, so Hadoop it is. So far progress has been good, as I have a working prototype that can take generic classifiers and evaluate them (e.g. the basic functionality of Weka). The MapReduce model was a bit unusual at first, but once you understand the basics it is insanely easy to use, which is always a bonus.

mlhadoop

The framework itself has been kept fairly basic, map functions of classifiers are provided with an Instance of data (a single row of a training dataset), this then allows the classifier to query attribute types and values. The reducer output of a classifier produces a Model file, which the evaluator then uses to evaluate the classifier on a test dataset. This was a little hacky, because reducers only produce key/value pairs the Model files have to be highly customised to each classifier (as such, a classifier must implement a model parser), a little extra coding, but for the extra effort you get massive scalability..which is always good to have. I have tried to give classifiers lots of flexibility in terms of how they operate. Many algorithms are going to require multiple MapReduce jobs, so a classifier is able to create new tasks as required. This sort of functionality would allow for meta classifiers like Bagging to be implemented as well. I am still pondering on adding cross validation support, but given that cross validation is generally used to compensate for smaller datasets, it probably isn’t necessary.

Initial testing looks good, on a small setup I have at home (two Dual Xeon 3Ghz/6GB RAM servers)  I was able to process a 2GB dataset in three minutes, using the extremely basic zero rule classifier (which is similar in terms of functionality as a word count, so not to intensive). The first real classifier I am going to implement is Naive Bayes, it seems a fairly popular choice in the literature for map reduce applications, probably because it is so simple. In addition to this I have decided I will never ever buy servers that aren’t going into a data centre. Noise is not productive!

Written by Nick

March 25th, 2009 at 8:32 pm

Parasitic JavaScript

without comments

I thought I would make a post about my recently completed honours project: Parasitic JavaScript. The idea came to me during a lecture about parasitic computing, or using computers to perform complex calculations without the owners knowledge. The first parasitic computing implementation used TCP packets to calculate 3-SAT solutions. Obviously sending millions of TCP packets over the place to calculate a + b = c is not very efficient, but it worked. Several days after thinking of the idea, I had a working prototype - and a few weeks later I changed my honours project. Using JavaScript has several benefits, the major being almost every web browser has it, and it also has the ability to request remote webpages without the using intervening (aka AJAX) . Using a work unit distribution model I was able to get web browsers to perform several distributed tasks: ray tracing, data mining (perceptron), and finding n-queens solutions using a genetic algorithm.

One of the issues I had was that on slower computers (e.g. my old iBook @ 1.33Ghz/1GB RAM) the computer would become a bit sluggish. So I started experimenting with ways of estimating the performance of the computer, as JavaScript provides no methods for this. I found that JavaScript’s setTimeout method would be less accurate when the computer was under high-load. So I developed a basic algorithm that monitors the accuracy of setTimeout and adjusts the speed at which Parasitic JavaScript was executing. This made Parasitic JavaScript play nicely with the older computers, while maximizing the execution speed of newer ones. 

You can download the full report here.

Recently I have been playing with MapReduce, and it got me thinking about how it could be implemented as Parasitic JavaScript. It would be interesting because there are many different tasks that can fall under the MapReduce model, and it would certainly reduce the amount of work required by the developer to get working in a distributed manner. Maybe this can be my next project!

Written by Nick

January 11th, 2009 at 5:06 pm

Exploring disco

without comments

At uni this year I am doing a directed study on distributing machine learning algorithms using MapReduce, with the intention of running it on the uni cluster. MapReduce is an interesting approach to distribution, it has two operations, map and reduce. The map operation is where all the processing happens for a particular item of data, and the reduce function combines the results of the map operations. Disco is a MapReduce framework that allows you to use Python, which happens to be my language of the moment.  It took a bit to get setup on my servers but several debian installs later and a few VM clones it’s working! 

The majority of MapReduce implementations of machine learning algorithms I found were some form of numerical function like bayes or neural networks. I thought I would start a bit different and try and create a rule learner, I am thinking of starting with a basic one like CN2. CN2 is fairly simple rule learner that just searches for rules with good coverage using an entropy or laplace function, and constantly repeating the search until it finds something acceptable. I think it is going to involve using multiple map and reduce chains, with each iteration of the chain doing  a refinement to the rules. Time to start coding..

Written by Nick

January 10th, 2009 at 11:30 pm

Posted in University