October 12, 2008

ActiveRecord and extremely large tables or queries

Using larger and larger datasets, ActiveRecord has performance problems when trying to iterate over the data. Consider this simple example to iterate over a table of genes, and print out the name of each. This is an common use case for accessing database rows.

Gene.all.each {|gene| puts gene.name}

Unfortunately, if the gene table contains millions of rows, ActiveRecord will (try) to pull all of the rows into memory first, and then iterate over them as array. Pulling so much data into memory will make the process take a very long time. The solution to this problem is to pull smaller chunks of the data in at a time, and then iterate over each of these chunks in sequence.

This is the type of solution provided by paginating_find, which uses the SQL commands of LIMIT and OFFSET to pull smaller chunks of the table into memory. The advantage of using LIMIT and OFFSET is that they are (I believe) database agnostic and so will work across any database.

As discussed in the comments of this post on Jamis Buck’s blog, using the OFFSET command requires the DB engine to linearly search through the records to find the correct point at which the chunk of returned data begins. Therefore using OFFSET you may expect that the time taken to return the data increases proportionally with the size of the dataset. A (possibly MySQL specific) solution described by both Jamis Buck and Michael Schuerig relies can split the dataset into smaller chunks based on the primary key. Since the primary key is indexed, the time taken by the DB to find the correct place to start the next chunk of rows should be much faster.

Update

Michael Schuerig has also pointed out that his plugin accepts ActiveRecord syntax, so that table joins and conditions can be given, and then the returned data iterated over in smaller chunks.

SQL queries

These two above solutions describe iterating over single tables over data using ActiveRecord syntax, but what if you want to feed ActiveRecord a complex SQL query then iterate over the results in chunks? This was the question I asked the UK North West Ruby Users Group and was given a neat solution that relies on using the ActiveRecord connect method. If you use ActiveRecord::Base.connection.execute(statement) the data created is not returned in bulk, but can instead can be pulled from the database one row at a time. The only drawback is that the method to find the headers of the returned data is database adaptor specific, in the instance of MySQL, the example method is fetch_fields, called on the returned data object.

Native Drivers

I think it’s worth pointing out in a discussion about Ruby and database access, that if you install the native driver gem for a given database this can result in performance increase. For example if you’re using MySQL, this would be

sudo gem install mysql

I believe there are similar implementations for other database adaptors, which a Google search should find.

Example iterator using id column

Dump an SQL query to file, using execute.

August 13, 2008

Where time goes at the start of a new research project

One of the lessons I’ve learnt from my PhD is how much time is spent re-reading and editing a manuscript, which makes makes you wish you were still doing the fun computational analysis at the start . The second thing I learnt, is that when you do move on to starting a new research project you remember how much time you spend getting your data into the right format first.

Time sink 1: Getting database structure right

I always use a database for storing my data, as it is saves all the hassle from trying to combine data from many flat files. Using a databases does however have an overhead associated with setting up. The table structures need to be correct otherwise the queries can take hours to run. For example, how normalised should the structure be? This is computer science jargon for making sure each ‘piece’ of data appears no more than once in the database, and everything else is just a reference to it’s id. I’m sure normalisation is important, but I only need the database to work well enough for me it do what I need to. A fair bit of time is spent in trial and error – modifying and creating the tables as I go along.

Time sink 2: Knowing that my data is correct

Given a file containing results I need, in a few hours I can write a parser to get the data out of the file and into my database. The problem is that I need to know that the everything has been entered correctly. What if there’s a bug in my code? Most of the time the code will throw an error, but the bugs that don’t are the ones I’m worried about. With a large dataset I won’t be able to spot any errors by eye.

My answer is to use validations . These are sets of Ruby code that sit on top of the data model, and stop it from being saved to the database until some tests have been passed. For example I’ve saved a set of yeast gene nucleotide alignments into a database, and I then want to pull each column out of the alignment and save it as a row in the database. As I’m going calculate the cost of each amino acid, then relate the cost back to the position in the alignment sequence, I have to be sure that each database row matches correctly to the position in the alignment, and contains the correct amino acids. Writing a set of validations I make sure of this each time a row is saved. There is however a price, as with 2.2 million rows, testing each before saving takes time.

Time sink 3: Getting process to run in parallel

I can’t afford to leave long processes running for a week or more, as I need to move on to the next stage of the research. Running processes in parallel over a cluster is the obvious way to speed things up, however like databases, there is an overhead involved with using a cluster.

Parallelisation is personal niggle for me, as it’s easy to do by hand, but hard to automate. For example, I could manually break my dataset up into ten pieces, send each out to a node in a computer cluster, then start a script to analyse the data. I’ll then need to join the results back together, and then use another script to parse the data into a database. My problem is that doing all this manually can very easily lead to me making mistakes in mixing result files up, or running scripts in the wrong order. I want something that I can start with one command line operation, and then when it’s finished all the results are done and in my database.

Starfish is Ruby library for making parallel work easy. Starfish automatically takes care of breaking up a large datasets, and splitting it out between however many jobs I’ve started. Using Starfish I’ve been able to cut down the work that would have taken three days, to just five hours. This means that my database isn’t cast in stone, I can clear it, change the structure (see point one), then reload all the data again in around a day. This freedom wouldn’t be the case if I had to wait a week to reload the data every time I changed something. Starfish has been given a bit of beating on the web, because it’s not a “pure” map reduce implementation, but I don’t care about pure, I care about getting my data in the database quickly.