February 15, 2010

A combined approach to genome sequencing

The aim of this research project is to sequence the genomes of four Pseudomonas fluorescens isolates from two different cave environments. The resulting genome sequences will allow us to estimate of the selective pressure on the same species in two caves and then understand the overall selection pressures in nutrient starved caves. The additional P. fluorescens genomes already available for Pf-5, SBW25 and Pf0-1 will also allow comparative genomics for completely different environments and further analysis of the P. fluorescens pan-genome. I’ll outline our sequencing strategy below.

De Novo genome sequencing

If there are genomes already available for P. fluorescens strains then the obvious choice should be comparative based sequencing and assembly. However comparative analysis of three P. fluorescens species shows the genomes of P. fluorescens strains are much more dissimilar than might be expected for strains of the same species. Therefore it is likely the genomes our cave strains will be similarly unrelated to existing sequences.

Based on this we’re going to use 454 sequencing and perform de novo assembly of two P. fluorescens isolates; we’ll sequence a single isolate from two different cave sites. We would have preferred to have sequenced four genomes but this was ruled out by the low coverage per genome (~9X) and additional sample preparation costs which I’ve discussed previously. So instead we’ll sequence two genomes on a 454 plate split in half using the standard rubber gasket. Each sample will be prepared as a combination of paired-end and shotgun reads. This will provide uniform 35-40X coverage for each genome with the paired reads to improve assembly.

Comparative genome assembly

We ruled out comparative assembly using short but high read sequencing, such as Illumina or AB SOLiD because of the anticipated low sequence identity between P. fluorescens strains. Nevertheless once we have two de novo assembled cave genomes hopefully we can do comparative assembly of other cave isolates using these as scaffolds. Previous P. fluorescens genomes show a low sequence identity but we hope that isolates of the same species from the same site will have enough genome similarity to allow one genome to act as a scaffold for the comparative assembly of a second.

Therefore we will take a two further isolates of the same species from each site and send them for AB SOLiD sequencing. This will provide 90X coverage for each genome at a much cheaper price compared with 454 sequencing. Hopefully the combination of 454 and AB SOLiD will produce large amounts of data to compare variability between strains and sites.

May 28, 2008

Relationship between amino acid usage the cost of synthesising the amino acid

For my next research project I’m going to continue to look at the role biosynthetic cost plays in the cell. This time I want to see if the cost of synthesising an amino has a role in the rate of protein sequence evolution. A simple assumption is that selection will favour biosynthetically cheaper amino acids in proteins, because there will be less cost to synthesise the protein as a whole. Immediately following this idea, I must also assume that functional selection is a counter pressure, as amino acids will also be selected for their biochemical properties. As an example an expensive but specific enzyme will probably confer a greater fitness effect than a cheaper but non specific enzyme.

The first idea for this project is that expensive amino acids will be fixed across orthologs. My reasoning is that expensive amino acids must play an important functional role otherwise they will be selected out in favour of a cheaper amino acid. Contrastingly cheaper amino acids I believe will exhibit varying levels of fixation. This will be because any amino acid may have an important functional role irregardless of cost, and therefore be fixed. At the same time though, the selection pressure to minimise biosynthetic cost will mean cheaper amino acids are used instead of expensive ones; the amino acid used doesn’t matter, as long as it is biosynthetically cheap.

One possible way of testing the different pressures of cost versus function is to compare the usage of amino acids inside and outside of protein domains. There will probably be a greater selection pressure for amino acid based on their biochemical properties inside domains, while outside of domains the selection for the use of cheaper amino acids may greater. This is a crude approximation as the pressures for both structure and function will be much more subtle than just whether the amino acid is inside a domain or not, however with a large dataset I think should be possible to see if there are any tangible effects. This is related to what Pedro discussed where he speculated that the expansion of peptide binding domains may have a relationship with cost.

Another factor I would like to consider is the relationship with expression level. For example, saving a single ATP molecule per protein by changing an amino acid, would have an equivalent 10,000 saving of ATP molecules in the cell for the most highly expressed proteins. Therefore I think that the selection pressure to minimise cost will be greater at higher expression levels, where there is greater selection pressure to minimise cost than that of low expression levels.

Methods

My initial plan was to run the CodeML tool on a set of protein coding orthologs for closely related S.cerevisiae species. CodeML derives the amino acid substitution rates for each position in a set of aligned sequences, and I can compare this with the average of cost of the amino at that position. Controlling for function I can use Pfam to identify domains in the orthologs, then compare the average cost and rate of evolution of the amino acids inside and outside of domains. As possibility to also take account of structure in the analysis, the tool crescendo calculates the expectation of an amino acid at a position given the local environment in the protein. Controlling for expression level should be relatively simple as there are many yeast gene expression data sets that I can use as a co correlating variable.

Image of three index cards describing future work

April 29, 2008

Reflection on a year of (attempted) open notebook science

A year of work on the importance of amino acid biosynthetic cost has led to the submission of a manuscript, and a preprint available on Nature Preceedings. The openness in this project was inspired by reading Jean Claude Bradley’s and Cameron Neylon’s blogs about open notebook science. I already believed in the philosophy behind open source software, and I thought that any early feedback would be useful to my research. In addition to any input received, I thought that early sharing of my research would in turn be useful to contribute back to the community.

The platform I chose was a blog, allowing results to posted as I produce them. I was already familiar with blogging, and WordPress makes creating and maintaining a blog simple. During the early stages of my project I found it quite useful to blog, as it helped me to clarify my results and ideas while the project was still taking shape. I tried to do this about once a week, on a Friday, and summarise my latest results. Having this record of results was also helpful to refer to when discussing my latest findings. When we were writing the manuscript I also found it useful to browse back through all the entries I had created and include any ideas I had forgotten about. However, as the project progressed blogging became less important, as I had already produced my main findings and was more focused on writing the manuscript.

As for sharing information I found that writing a summary blog my research takes rather a large amount of effort. Furthermore my  blog is the only gateway to my research, and results only become available when I make the time and effort write them up. This therefore doesn’t satisfy Jean Claude Bradley’s criteria of no insider knowledge, but rather could be described as being selectively open about my research. On the positive side a blog post is a concise summary that distills my most recent progress in a way I hope is easily accessible to a casual reader. Another interesting point is that posting all my results online meant they were indexed by Google, as you would expect, but this also lead to some strange occurrences when searching online for material. For example searching for “Akashi & Gojobori”, a paper I based my work on, brings up two links to my blog ahead of the original manuscript. I find this a bit embarrassing, and I wonder if the paper authors have also encountered this?

With less time to spend on blogging, I also tried to stream my research using Twitter, sending short messages automatically using a bash script every time I committed an SVN update. While this approach takes a lot less effort on my part, I think this is the opposite end of the spectrum to blogging, and spews out large amounts of obscure repository check in messages. Ultimately I think it is of little interest for even someone directly involved in the project.

I’m still interested in open notebook science, though my lack of posting might indicate otherwise. I’m going to continue trying out new methods of sharing bioinformatics research, and the start of a new research project gives me the chance to start afresh in these approaches. My main focus should be passive approaches that build into my work flow without too much effort, but also produce a meaningful summary of the research. Therefore in addition to a blog I think it is important to maintain a summary page of the research, otherwise it may be difficult for people to understand what the point of my research is when they first come across my blog. I think this is similar to the combined wiki and blog format used by Jean Claude Bradley. Having spent some time thinking about I how could implement this, I think a landing page should be readily auto generated from the results. In my head I’m thinking a Ruby on Rails type of approach, with a templating library such as HAML and a series of Rake tasks to regenerate the landing page with any new results, as well as send out a twitter update.

Finally I thought it might be interesting to adopt version numbers for the project, similar to those used in software development. The usual layout is something like 1.2.3. The last number would be used to track simple code edits. The second number would be used to show milestones in the overall project, for example each could correspond to a figure. The first number would then be the manuscript revision. Every time a new manuscript is prepared for submission, this could be updated, where the first manuscript preparation would have the number 1.0.0 Hopefully this type of numbering would make the project easier to track and interested parties could see if the research has been updated significantly since they last checked.

In summary, open notebook science has not really had a large positive effect on my research. I think that this is mainly because using a blog alone is not an effective method of communicating scientific progress, because it requires substantial effort on my part to update, and second tracking the current state of the research can be difficult. However, I still believe that the principles of open notebook science can be beneficial to my research. In the next couple of months I’ll try some new methods to see what does work.