January 5, 2010

Genomics in a small microbiology lab

My post-doc is doing genomics of micro-organisms from starved cave environments. Several universities in the Kentucky area have banded together to get a sequencer which allows a small microbiology lab like ourselves to do sequencing for a few thousand dollars. The biology department here doesn’t have the dedicated computing cluster required for genomic assembly and analysis however the availability of on demand computing resources means this isn’t a problem as we can rent a virtual machine with 64GB of RAM by the hour. The only bottleneck in my project will therefore be my ability to formulate a research question and properly analyse The genomic data.

The availability of cheaper sequencing and by-the-hour computer time means that smaller research laboratories are no longer restricted in their ability to do genomics. It’s not hard to imagine a few years ago that sequencing costs put novel genomics out of reach for most labs, while only labs at large institutions had access to dedicated computing facilities. From my experience of moving from a large to small university it seems the financial and infrastructure barriers for doing genomics are now much lower. Genomics, in microbes at least, can now be carried out by hundreds of smaller labs instead of clustered at a few large sequencing centres and universities.

I remember when I started doing my masters five years ago that most papers began by discussing the "explosion of sequence data", but I think the availability of cheaper sequencing means that the explosion is just beginning. Now is a great time to be a bioinformatician – sequencing and computational power are now much easier to access and the problem will be finding people that can manage and process the data.

December 22, 2009

Deciding what genomes to sequence



We’re aiming to sequence three genomes from three different cave environments, where each cave differs in the degree of nutrient starvation. We will sequence P. fluorescens isolates from each cave and examine how the genomes have adapted to the starved cave environments compared with the available genomes of P. fluorescens from soil or plants.

We’ll be using Roche/454 sequencing which provide ~800,000 reads of genomic DNA where each read is approximately 500 nucleotides in length. In total a single 454 plate should provide 400Mbp of sequence data. Previous sequencing of P. fluorescens has the shown the genome size is just under 7Mbp and therefore if we sequenced a single P. fluorescens genome this would generate 400Mbp / 7 Mbp = 57X coverage. We would however like to sequence multiple genomes and there are two options for this.

Rubber grid

A rubber grid can be placed over the sequencing plate to divide it into individual segments, where each segment can be used to sequence a genome. The rubber template however covers approximately one third of the sequencing plate and will reduce the amount of reads from 400Mbp to 266Mbp. Therefore if we sequenced four P. fluorescens genomes using the rubber template to divide the plate this would theoretically provide 10X coverage for each genome.

Sequence tags

The second approach to sequencing multiple genomes involves using sequencing tags. Each DNA fragment has a small oligonucleotide tag attached, where each tag is unique to one of the P. fluorescens isolates. As the fragment is sequenced the tag is also sequenced, and this allows the sequenced DNA to be attributed to a source genome based on the attached tag. There is therefore no need to use a rubber gasket and the full 400Mbp of sequence data can be produced from the plate. This could theoretically provide 14X coverage for 4 genomes, or 11X coverage for 5 genomes. This therefore may seem like the obvious choice, but the sequencing facility tried using sequencing tags before and therefore there is a risk in trying something for the first time.

Making a decision

Over the next few days we have to decide the aim of this project, especially since my funding is only for one year. One option is to sequence one genome from each cave and then compare the cave genomes with existing P. fluorescens genomes. This would give some indication of how the genomes differ between caves and how they are adapted to caves.

A second approach could be to sequence multiple genomes from two caves. This would allow not only examination of how each cave has shaped the genome but also how the variable the genome is between the same species in the same environment.

December 15, 2009

Starting a new post doc at NKU

Two weeks ago I started my new job as a post-doctoral researcher at Northern Kentucky University. I’ll be working in the Barton geomicrobiology lab doing bioinformatics. I’m looking forward to being a bioinformatician in a microbiology laboratory and I think the combination of computation and wet skills in the lab complement each other.

I’ll be working on doing comparative genomics of Pseudomonas species isolated from cave environments. Microbes growing in caves often have limited rRNA similarity with existing characterised microbes and a low rRNA similarity is indicative of a correspondingly low genome similarity. Several Pseudomonas soil and human pathogen species have already been sequenced which means the genomes of cave Pseudomonas can be compared with the genomes of sequenced Pseudomonas species.

My PhD at Manchester was in the area of systems biology and molecular evolution so the position here will be a chance for me to learn new skills in genome assembly and annotation. As my first post doc I think it’s also important for me to publish and establish my career as a scientist. Furthermore I will have to think about what direction I want to take my career and what I want to spend the next several years doing.

October 12, 2008

ActiveRecord and extremely large tables or queries

Using larger and larger datasets, ActiveRecord has performance problems when trying to iterate over the data. Consider this simple example to iterate over a table of genes, and print out the name of each. This is an common use case for accessing database rows.

Gene.all.each {|gene| puts gene.name}

Unfortunately, if the gene table contains millions of rows, ActiveRecord will (try) to pull all of the rows into memory first, and then iterate over them as array. Pulling so much data into memory will make the process take a very long time. The solution to this problem is to pull smaller chunks of the data in at a time, and then iterate over each of these chunks in sequence.

This is the type of solution provided by paginating_find, which uses the SQL commands of LIMIT and OFFSET to pull smaller chunks of the table into memory. The advantage of using LIMIT and OFFSET is that they are (I believe) database agnostic and so will work across any database.

As discussed in the comments of this post on Jamis Buck’s blog, using the OFFSET command requires the DB engine to linearly search through the records to find the correct point at which the chunk of returned data begins. Therefore using OFFSET you may expect that the time taken to return the data increases proportionally with the size of the dataset. A (possibly MySQL specific) solution described by both Jamis Buck and Michael Schuerig relies can split the dataset into smaller chunks based on the primary key. Since the primary key is indexed, the time taken by the DB to find the correct place to start the next chunk of rows should be much faster.

Update

Michael Schuerig has also pointed out that his plugin accepts ActiveRecord syntax, so that table joins and conditions can be given, and then the returned data iterated over in smaller chunks.

SQL queries

These two above solutions describe iterating over single tables over data using ActiveRecord syntax, but what if you want to feed ActiveRecord a complex SQL query then iterate over the results in chunks? This was the question I asked the UK North West Ruby Users Group and was given a neat solution that relies on using the ActiveRecord connect method. If you use ActiveRecord::Base.connection.execute(statement) the data created is not returned in bulk, but can instead can be pulled from the database one row at a time. The only drawback is that the method to find the headers of the returned data is database adaptor specific, in the instance of MySQL, the example method is fetch_fields, called on the returned data object.

Native Drivers

I think it’s worth pointing out in a discussion about Ruby and database access, that if you install the native driver gem for a given database this can result in performance increase. For example if you’re using MySQL, this would be

sudo gem install mysql

I believe there are similar implementations for other database adaptors, which a Google search should find.

Example iterator using id column

Dump an SQL query to file, using execute.

September 29, 2008

I wish there was Ruby on Rails for data

I’d like to think that learning Ruby on Rails has benefited my research. I’m certain that ActiveRecord has made it much, much easier to bridge the gap between my code and my database. I think validations make it easier for me to weed out bad data points in large data sets. I know for sure that RSpec has made it easy for me to test for every bug that I can think of in my code.

My nagging worry is that Rails was primarily designed for building web applications with a ‘nice’ sized dataset in the database. I can’t really say what a nice size dataset is, but I can guess that it is not 14 million rows. The difference in what Rails was designed for, and me using it for bioinformatics is highlighted when I need search for information about creating a certain type of spec, versus information about processing an ActiveRecord model over a cluster. I think that data processing, such statistics, analysis, and plotting, is where the gap lies between using Rails for its original purpose in building web applications, and subverting it to create a framework for a data centric project.