Transition into losing your database

I had the unpleasant experience recently of accidentally typing drop database backoffice; on our production servers at work during some database maintenance. This is most certainly something I do not recommend anyone do, but here is how we were able to recover everything and the lessons we learnt.

Context

First a little background on our database and backup setup. At The Nile we operate multiple master-slave replicated MySQL database servers. The replication occurs via MySQL’s built-in binary logs, which record all changes to the master. We have a slave server dedicated to performing daily backups of all of our databases, as well as hourly backups of important tables. Our database servers have 8 cores, 64GB of RAM, 10x 10K SAS disks in RAID 10. Our slave servers had NO replication delay, which is ultimately why I am writing this post.

The database that got dropped is our core system database, it is a central repository for orders and product availability information. However, it is a separate system from our frontends – all orders are imported in a batch process, as opposed to when customers place the orders. It currently contains about 300 million rows and is 80GB in size. It’s not our biggest database, but it is still large.

Our backup format is just pure SQL from mysqldump. We arrange the backups such that each table is in a separate compressed gzip file. This allows partial imports of data, in the event of single table corruption, additionally it allows us to import tables in parallel. Once the backup process is complete the tables are packaged into an uncompressed tarball.

The arrangement of backups can have a significant impact on recovery speed, and it is important to consider this when creating a backups system. If the tarball was compressed rather than the individual tables when we start a recovery process it will require us to expand ~80GB of SQL onto the disk. By compressing the table files, we are only expanding ~10GB onto the disk. It also reduces the amount of data being read from the disk while importing, as we are able to use zcat table.sql.gz | mysql backoffice to import a table, moving an I/O bottleneck to CPU/Memory.

The last part of backups is binary logging, most of the time binary logs are used for slave replication. But when things go wrong they have a complete log of all the changes that have occurred since the last full backup (..at least you hope so). The mysqlbinlog tool converts binary logs into SQL. In order to use binary logs you will need to know the time the database was dropped as well as the time the full backup import was started.

Recovery begins

If you have backups, recovery is a fairly simple process. It can just take awhile. Here are the steps we took:

  1. Firstly, turn off everything. The last thing you want is some automated system script modifying a half-imported database.
  2. Copy the backups to the master and expand.
  3. The database in question had two very large tables (~50m rows each), the rest of the tables had a maximum of 10 million rows. So in we started three import processes, the two large tables were imported in separately, while all the other tables imported in the same process.
  4. Drink coffee for 4 hours…
  5. Start the bin log import.

The bin log import is probably the most difficult part of the process. You have to construct an SQL file that contains all the changes since the last backup, up until the point of failure. In our case we knew the exact time when the backup had started, and we knew the time when I entered the drop command. For us the command looked a little like:

mysqlbinlog -d backoffice --start-datetime="2011-03-23 04:00:00" --end-datetime="2011-03-23 11:00:00" binlog.005430 binlog.005431 binlog.005432 binlog.005433 binlog.005434 binlog.005435 > binlog.sql

Once you have the file it’s important to create a new user with restricted permissions: can only insert,update,select the bad database and no drop access. Your specific requirements may be different. Now you just have to import the file:

mysql -f -urestricted_user -p backoffice < binlog.sql

Once that’s complete, it’s time for a beer.

Lessons learnt

It’s always important to learn things from disasters, otherwise you’re bound to repeat them. For us, the most important thing was the fact that our replication was near instant. As drop statements are replicated all of our slaves executed the drop command, consequently losing our live backup. It’s important to remember that slaves help protect against hardware failure, but not human error. As such we are implementing delayed replication using mk-slave-delay.

Additional steps that can be taken include restricting access to the drop command. However, in our case we already had this in place. As I was performing some clean-up maintenance at the time, I had used an account with drop access and it only takes a little bit of muscle memory to type the wrong word.

Graduated, moved to Australia

So, it’s been a busy few months. In July, I completed all my course work, the final of which was a report on machine learning using Hadoop/MapReduce. The result was a nice simple method for distributing machine learning algorithms over a Hadoop cluster without modification. The report is available for download. In August, I packed up and moved to Sydney, Australia where I am working for The Nile as a senior systems engineer. In October, I flew back to NZ and officially graduated, with a Bachelor in Computing and Mathematical science. What next, I don’t know! But I hope this year is just as exciting as last!

Naïve Bayes in Hadoop

Naïve Bayes is a probabilistic data mining classifier which fits nicely into the MapReduce model and gives pretty good predictive performance for its simplicity. The Hadoop implementation uses a single map/reduce operation to calculate the mean and standard deviation of each attribute/class combination, as well as the global class distribution of the training dataset.

Some basic pseudo code:

instance = single row of training set
instance.class = class/target of row
instance.attributes = list of attributes

This will produce a file of means, standard deviations and a class distribution for which you can then load into a model (such as that found in Weka- weka.classifiers.bayes.NaiveBayes.distributionForInstance). This doesn’t support discrete attributes yet, only numeric/real ones. Working on it.

I was able to process a ~5GB file of ~200k rows/2000 attributes per row in 4 minutes on 30 nodes.

Machine Learning with Hadoop

At University this year I am working on a machine learning framework using Hadoop (MapReduce), with the intent on running it on the University cluster. Initially I started playing with Disco, but it was a bit tedious to setup on two nodes, let alone 100, so Hadoop it is. So far progress has been good, as I have a working prototype that can take generic classifiers and evaluate them (e.g. the basic functionality of Weka). The MapReduce model was a bit unusual at first, but once you understand the basics it is insanely easy to use, which is always a bonus.

The framework itself has been kept fairly basic, map functions of classifiers are provided with an Instance of data (a single row of a training dataset), this then allows the classifier to query attribute types and values. The reducer output of a classifier produces a Model file, which the evaluator then uses to evaluate the classifier on a test dataset. This was a little hacky, because reducers only produce key/value pairs the Model files have to be highly customised to each classifier (as such, a classifier must implement a model parser), a little extra coding, but for the extra effort you get massive scalability..which is always good to have. I have tried to give classifiers lots of flexibility in terms of how they operate. Many algorithms are going to require multiple MapReduce jobs, so a classifier is able to create new tasks as required. This sort of functionality would allow for meta classifiers like Bagging to be implemented as well. I am still pondering on adding cross validation support, but given that cross validation is generally used to compensate for smaller datasets, it probably isn’t necessary.

Initial testing looks good, on a small setup I have at home (two Dual Xeon 3Ghz/6GB RAM servers)  I was able to process a 2GB dataset in three minutes, using the extremely basic zero rule classifier (which is similar in terms of functionality as a word count, so not to intensive). The first real classifier I am going to implement is Naive Bayes, it seems a fairly popular choice in the literature for map reduce applications, probably because it is so simple. In addition to this I have decided I will never ever buy servers that aren’t going into a data centre. Noise is not productive!

Stripping JFIF Comments from JPEG Images

Most JPEG images include a JFIF comment, which usually describes the creator of the image (not to be confused with Exif data). This data is usually in the range of 50 to 100 bytes, so not much, but it might make a difference if you have a 56k modem and a lot of images. Because this data is effectively useless it is safe to remove it from images.

Here is a python script that will do it:

The script basically searches for the JFIF comment header (xFFxEE) and strips the data out. Expect unexpected consequences if the images being processed do not contain JFIF comments!

Roasting Coffee!

warmingAside from computing, my other hobby is coffee!  I have a Hottop coffee roaster that allows me to well, roast coffee.  I’ve had it for awhile now and have built up quite a collection of different varieties of green beans (I lost count how many, but it fills up a good portion of the coffee room).   Coffee roasting is actually a fairly boring thing to do, it takes about 30 minutes all up, and for the most part you do nothing. The interesting parts are blending and watching them eject out of the roaster, and if you are into profiling that can be interesting but the Hottop automates some of the process. 

Most people who roast coffee take great care in the science of blending, I don’t. I prefer to pick some random things and give it a go, I do occasionally get some undrinkable cups, but I get some awesome ones too. But in saying that I do have some no-go blends, and try to aim for a full bodied coffee with minimal acidity.

The roasting process on the Hottop is fairly straight-forward, let it pre-heat to 75°C pour in the beans (that you blended while it was pre-heating) and wait about 18 minutes. Once the temperature of the roaster gets to about 210°C things start to happen fast, there are three stages of cracks. Coffee roasted just past the first crack is what you usually get in a cafe, the second crack is a darker roast, and the third crack is what you get at Starbucks (charcoal). At the end Hottop ejects the beans out of the roaster onto a rotating cooling tray to cool down. Then you start drinking the next day!

spin

Parasitic JavaScript

I thought I would make a post about my recently completed honours project: Parasitic JavaScript. The idea came to me during a lecture about parasitic computing, or using computers to perform complex calculations without the owners knowledge. The first parasitic computing implementation used TCP packets to calculate 3-SAT solutions. Obviously sending millions of TCP packets over the place to calculate a + b = c is not very efficient, but it worked. Several days after thinking of the idea, I had a working prototype – and a few weeks later I changed my honours project. Using JavaScript has several benefits, the major being almost every web browser has it, and it also has the ability to request remote webpages without the using intervening (aka AJAX) . Using a work unit distribution model I was able to get web browsers to perform several distributed tasks: ray tracing, data mining (perceptron), and finding n-queens solutions using a genetic algorithm.

One of the issues I had was that on slower computers (e.g. my old iBook @ 1.33Ghz/1GB RAM) the computer would become a bit sluggish. So I started experimenting with ways of estimating the performance of the computer, as JavaScript provides no methods for this. I found that JavaScript’s setTimeout method would be less accurate when the computer was under high-load. So I developed a basic algorithm that monitors the accuracy of setTimeout and adjusts the speed at which Parasitic JavaScript was executing. This made Parasitic JavaScript play nicely with the older computers, while maximizing the execution speed of newer ones. 

You can download the full report here.

Recently I have been playing with MapReduce, and it got me thinking about how it could be implemented as Parasitic JavaScript. It would be interesting because there are many different tasks that can fall under the MapReduce model, and it would certainly reduce the amount of work required by the developer to get working in a distributed manner. Maybe this can be my next project!

Exploring disco

At uni this year I am doing a directed study on distributing machine learning algorithms using MapReduce, with the intention of running it on the uni cluster. MapReduce is an interesting approach to distribution, it has two operations, map and reduce. The map operation is where all the processing happens for a particular item of data, and the reduce function combines the results of the map operations. Disco is a MapReduce framework that allows you to use Python, which happens to be my language of the moment.  It took a bit to get setup on my servers but several debian installs later and a few VM clones it’s working! 

The majority of MapReduce implementations of machine learning algorithms I found were some form of numerical function like bayes or neural networks. I thought I would start a bit different and try and create a rule learner, I am thinking of starting with a basic one like CN2. CN2 is fairly simple rule learner that just searches for rules with good coverage using an entropy or laplace function, and constantly repeating the search until it finds something acceptable. I think it is going to involve using multiple map and reduce chains, with each iteration of the chain doing  a refinement to the rules. Time to start coding..