October 12, 2008

ActiveRecord and extremely large tables or queries

Using larger and larger datasets, ActiveRecord has performance problems when trying to iterate over the data. Consider this simple example to iterate over a table of genes, and print out the name of each. This is an common use case for accessing database rows.

Gene.all.each {|gene| puts gene.name}

Unfortunately, if the gene table contains millions of rows, ActiveRecord will (try) to pull all of the rows into memory first, and then iterate over them as array. Pulling so much data into memory will make the process take a very long time. The solution to this problem is to pull smaller chunks of the data in at a time, and then iterate over each of these chunks in sequence.

This is the type of solution provided by paginating_find, which uses the SQL commands of LIMIT and OFFSET to pull smaller chunks of the table into memory. The advantage of using LIMIT and OFFSET is that they are (I believe) database agnostic and so will work across any database.

As discussed in the comments of this post on Jamis Buck’s blog, using the OFFSET command requires the DB engine to linearly search through the records to find the correct point at which the chunk of returned data begins. Therefore using OFFSET you may expect that the time taken to return the data increases proportionally with the size of the dataset. A (possibly MySQL specific) solution described by both Jamis Buck and Michael Schuerig relies can split the dataset into smaller chunks based on the primary key. Since the primary key is indexed, the time taken by the DB to find the correct place to start the next chunk of rows should be much faster.

Update

Michael Schuerig has also pointed out that his plugin accepts ActiveRecord syntax, so that table joins and conditions can be given, and then the returned data iterated over in smaller chunks.

SQL queries

These two above solutions describe iterating over single tables over data using ActiveRecord syntax, but what if you want to feed ActiveRecord a complex SQL query then iterate over the results in chunks? This was the question I asked the UK North West Ruby Users Group and was given a neat solution that relies on using the ActiveRecord connect method. If you use ActiveRecord::Base.connection.execute(statement) the data created is not returned in bulk, but can instead can be pulled from the database one row at a time. The only drawback is that the method to find the headers of the returned data is database adaptor specific, in the instance of MySQL, the example method is fetch_fields, called on the returned data object.

Native Drivers

I think it’s worth pointing out in a discussion about Ruby and database access, that if you install the native driver gem for a given database this can result in performance increase. For example if you’re using MySQL, this would be

sudo gem install mysql

I believe there are similar implementations for other database adaptors, which a Google search should find.

Example iterator using id column

Dump an SQL query to file, using execute.

September 29, 2008

I wish there was Ruby on Rails for data

I’d like to think that learning Ruby on Rails has benefited my research. I’m certain that ActiveRecord has made it much, much easier to bridge the gap between my code and my database. I think validations make it easier for me to weed out bad data points in large data sets. I know for sure that RSpec has made it easy for me to test for every bug that I can think of in my code.

My nagging worry is that Rails was primarily designed for building web applications with a ‘nice’ sized dataset in the database. I can’t really say what a nice size dataset is, but I can guess that it is not 14 million rows. The difference in what Rails was designed for, and me using it for bioinformatics is highlighted when I need search for information about creating a certain type of spec, versus information about processing an ActiveRecord model over a cluster. I think that data processing, such statistics, analysis, and plotting, is where the gap lies between using Rails for its original purpose in building web applications, and subverting it to create a framework for a data centric project.

April 29, 2008

Reflection on a year of (attempted) open notebook science

A year of work on the importance of amino acid biosynthetic cost has led to the submission of a manuscript, and a preprint available on Nature Preceedings. The openness in this project was inspired by reading Jean Claude Bradley’s and Cameron Neylon’s blogs about open notebook science. I already believed in the philosophy behind open source software, and I thought that any early feedback would be useful to my research. In addition to any input received, I thought that early sharing of my research would in turn be useful to contribute back to the community.

The platform I chose was a blog, allowing results to posted as I produce them. I was already familiar with blogging, and Wordpress makes creating and maintaining a blog simple. During the early stages of my project I found it quite useful to blog, as it helped me to clarify my results and ideas while the project was still taking shape. I tried to do this about once a week, on a Friday, and summarise my latest results. Having this record of results was also helpful to refer to when discussing my latest findings. When we were writing the manuscript I also found it useful to browse back through all the entries I had created and include any ideas I had forgotten about. However, as the project progressed blogging became less important, as I had already produced my main findings and was more focused on writing the manuscript.

As for sharing information I found that writing a summary blog my research takes rather a large amount of effort. Furthermore my  blog is the only gateway to my research, and results only become available when I make the time and effort write them up. This therefore doesn’t satisfy Jean Claude Bradley’s criteria of no insider knowledge, but rather could be described as being selectively open about my research. On the positive side a blog post is a concise summary that distills my most recent progress in a way I hope is easily accessible to a casual reader. Another interesting point is that posting all my results online meant they were indexed by Google, as you would expect, but this also lead to some strange occurrences when searching online for material. For example searching for “Akashi & Gojobori”, a paper I based my work on, brings up two links to my blog ahead of the original manuscript. I find this a bit embarrassing, and I wonder if the paper authors have also encountered this?

With less time to spend on blogging, I also tried to stream my research using Twitter, sending short messages automatically using a bash script every time I committed an SVN update. While this approach takes a lot less effort on my part, I think this is the opposite end of the spectrum to blogging, and spews out large amounts of obscure repository check in messages. Ultimately I think it is of little interest for even someone directly involved in the project.

I’m still interested in open notebook science, though my lack of posting might indicate otherwise. I’m going to continue trying out new methods of sharing bioinformatics research, and the start of a new research project gives me the chance to start afresh in these approaches. My main focus should be passive approaches that build into my work flow without too much effort, but also produce a meaningful summary of the research. Therefore in addition to a blog I think it is important to maintain a summary page of the research, otherwise it may be difficult for people to understand what the point of my research is when they first come across my blog. I think this is similar to the combined wiki and blog format used by Jean Claude Bradley. Having spent some time thinking about I how could implement this, I think a landing page should be readily auto generated from the results. In my head I’m thinking a Ruby on Rails type of approach, with a templating library such as HAML and a series of Rake tasks to regenerate the landing page with any new results, as well as send out a twitter update.

Finally I thought it might be interesting to adopt version numbers for the project, similar to those used in software development. The usual layout is something like 1.2.3. The last number would be used to track simple code edits. The second number would be used to show milestones in the overall project, for example each could correspond to a figure. The first number would then be the manuscript revision. Every time a new manuscript is prepared for submission, this could be updated, where the first manuscript preparation would have the number 1.0.0 Hopefully this type of numbering would make the project easier to track and interested parties could see if the research has been updated significantly since they last checked.

In summary, open notebook science has not really had a large positive effect on my research. I think that this is mainly because using a blog alone is not an effective method of communicating scientific progress, because it requires substantial effort on my part to update, and second tracking the current state of the research can be difficult. However, I still believe that the principles of open notebook science can be beneficial to my research. In the next couple of months I’ll try some new methods to see what does work.