February 15, 2010

A combined approach to genome sequencing

The aim of this research project is to sequence the genomes of four Pseudomonas fluorescens isolates from two different cave environments. The resulting genome sequences will allow us to estimate of the selective pressure on the same species in two caves and then understand the overall selection pressures in nutrient starved caves. The additional P. fluorescens genomes already available for Pf-5, SBW25 and Pf0-1 will also allow comparative genomics for completely different environments and further analysis of the P. fluorescens pan-genome. I’ll outline our sequencing strategy below.

De Novo genome sequencing

If there are genomes already available for P. fluorescens strains then the obvious choice should be comparative based sequencing and assembly. However comparative analysis of three P. fluorescens species shows the genomes of P. fluorescens strains are much more dissimilar than might be expected for strains of the same species. Therefore it is likely the genomes our cave strains will be similarly unrelated to existing sequences.

Based on this we’re going to use 454 sequencing and perform de novo assembly of two P. fluorescens isolates; we’ll sequence a single isolate from two different cave sites. We would have preferred to have sequenced four genomes but this was ruled out by the low coverage per genome (~9X) and additional sample preparation costs which I’ve discussed previously. So instead we’ll sequence two genomes on a 454 plate split in half using the standard rubber gasket. Each sample will be prepared as a combination of paired-end and shotgun reads. This will provide uniform 35-40X coverage for each genome with the paired reads to improve assembly.

Comparative genome assembly

We ruled out comparative assembly using short but high read sequencing, such as Illumina or AB SOLiD because of the anticipated low sequence identity between P. fluorescens strains. Nevertheless once we have two de novo assembled cave genomes hopefully we can do comparative assembly of other cave isolates using these as scaffolds. Previous P. fluorescens genomes show a low sequence identity but we hope that isolates of the same species from the same site will have enough genome similarity to allow one genome to act as a scaffold for the comparative assembly of a second.

Therefore we will take a two further isolates of the same species from each site and send them for AB SOLiD sequencing. This will provide 90X coverage for each genome at a much cheaper price compared with 454 sequencing. Hopefully the combination of 454 and AB SOLiD will produce large amounts of data to compare variability between strains and sites.

February 11, 2010

Adapting to generating my own data

I’ve spent the last five years as a computational scientist and my research begins with pulling data out of files. I’m far removed from the laboratories that generated the data in the first place. This past month however I’ve had to learn and decide about producing enough data to generate a complete genome. Prior to starting this post doc in November 2009 I assumed that second generation sequencing easily allowed small labs like us to obtain complete genome sequences. The reality however incurs problems I would have never considered.

For example: paired-end sequencing is useful for assembling sequence reads into a de novo scaffold because the distance between each pair of reads is known. However the extra effort required to prepare a paired-end sample results in an extra cost of a couple of thousand dollars. The expenses involved in research and how this affects the project outline are not something I have had to consider before because all I usually need is a computer and a desk.

Apart from just cost we also have to decide how many genomes we want to sequence and how to do this on a single 454 plate. One approach is to use a rubber gasket to divide the plate into 2, 4, 8, or 16 sections and allocate a single sample to each section. The downside of this approach is that the gasket covers the sequencing wells on the plate and therefore the more plate is divided the less the available sequencing capacity. The alternative to the rubber gasket is to label each sample with molecular barcodes however this will incur more costs because of the additional sample preparation.

When determining how many genomes to sequence we also had to consider the amount of sequence coverage for each genome. As we try to sequence more genomes there is less read depth for each individual genome and therefore each genome is harder to assemble. This is a constraint on our research aim of sequencing four Pseudomonas fluorescens isolates. This means our choice of research question is a balance of what we can theoretically achieve given the costs and amount of sequencing coverage available on a 454 plate.

Choices

I’m writing this as a from the the point of view of my initial surprise about the difficulties of planning sequencing project rather than to complain. The people who will do the sequencing for us been very helpful. Also it’s cheaper second generation sequencing that has made this research project possible.

January 5, 2010

Genomics in a small microbiology lab

My post-doc is doing genomics of micro-organisms from starved cave environments. Several universities in the Kentucky area have banded together to get a sequencer which allows a small microbiology lab like ourselves to do sequencing for a few thousand dollars. The biology department here doesn’t have the dedicated computing cluster required for genomic assembly and analysis however the availability of on demand computing resources means this isn’t a problem as we can rent a virtual machine with 64GB of RAM by the hour. The only bottleneck in my project will therefore be my ability to formulate a research question and properly analyse The genomic data.

The availability of cheaper sequencing and by-the-hour computer time means that smaller research laboratories are no longer restricted in their ability to do genomics. It’s not hard to imagine a few years ago that sequencing costs put novel genomics out of reach for most labs, while only labs at large institutions had access to dedicated computing facilities. From my experience of moving from a large to small university it seems the financial and infrastructure barriers for doing genomics are now much lower. Genomics, in microbes at least, can now be carried out by hundreds of smaller labs instead of clustered at a few large sequencing centres and universities.

I remember when I started doing my masters five years ago that most papers began by discussing the "explosion of sequence data", but I think the availability of cheaper sequencing means that the explosion is just beginning. Now is a great time to be a bioinformatician – sequencing and computational power are now much easier to access and the problem will be finding people that can manage and process the data.

June 4, 2008

Using Github, Lighthouse, and Twitter in my research

I think git is great, and I now use this git instead of subversion to version my research. Github is the natural place to host a git repository and so that’s what I’ve been doing with my latest research project. A lot of big Ruby projects use Lighthouse to track bugs, features requests, and things like this, so I thought I would try out how useful Lighthouse is for managing my bioinformatics research. So far I can say that it has been pretty handy, whenever I have an an idea or todo for my research I can log it as a ticket on my lighthouse page; like an online todo list for software development. I know that there are plenty of other systems such as Trac and Bugzilla, and I haven’t tried any of these, but for me Lighthouse is simple to use and does the job.

What’s also great is the integration between Github and Lighthouse where I mark up my Github commit messages to indicate that the patch I’m seding solves a particular ticket on Lighthouse. Github will understand this, and automatically send the update to Lighthouse for the corresponding ticket. When the ticket status is updated, a link is automatically added pointing back to the git commit for the patch. As and aside Github also integrates with twitter so my commit messages and a link to the patch are automatically sent to twitter, without me having to write a custom bash script.

I know this sort of setup won’t be to everyone’s taste, but I thought it worth mentioning that my experiences with these services so far has been positive.

February 27, 2008

Persistence of stale results on the web

I wrote in a previous post that, in yeast, the codon adaptation of a transcript explains ~40% of the variation in expressed levels. Extending this analysis to a protein data set from the same experiment, I stated that codon adaptation explains, contrastingly, very little of expression. More recently, to verify this result, we analysed codon adaptation trends in a second independent protein data set. In this data set codon adaptation does appear to be significant, again explaining approximately ~40% of the variation. Therefore in one dataset measured protein expression is unrelated to codon adaptation, while in the second, expression is.

This disparity in results is not the reason I’m writing this post, but rather last week while I revising the manuscript I was looking for a reference on codon adaptation in yeast. A Google search for this finds the post I’ve just described, where I state that CAI is not important for protein expression. Having your own blog appear in a Google search is rather surprising, even more so when, in hindsight I know the posted findings are not completely correct. If someone else did a similar search, and found my post, how seriously would they consider the results? Well probably not much since the data is on a blog, and not in a journal; but it has made me think about posting results that further investigation disproves, but still persist on the web in a Google indexed blog post.

Once my manuscript is (hopefully) published, I will go back through all my previous blog posts and link them to the article. This way, anyone finding this blog or any posts will be directed to the manuscript that describes the ultimate findings more accurately and in more detail; after peer review