March 1, 2010

First look at Pseudomonas fluorescens sequencing results

We have received the 454 results for our two samples thanks to the University of Kentucky AGCT sequencing centre. The genomes sequenced were P. fluorescens isolates, R124 and KY485, cultured from two separate caves sites. The relationships of these cave strains to other P.fluorescens strains is shown below in the phylogenetic tree constructed from a 16S ribosomal gene alignment. This tree highlights the relationship of the cave strains we are sequencing to those already sequenced or being sequenced.

Pseudomonas fluorescens 16S phylogenetic tree

The current genomic scaffolds are available on github for both R124 and KY485 strains. I’m going to update the repositories as gaps are closed and the genomes annotated. So far the initial results of the sequencing show the genomes of both isolates are larger than we expected. The predicted genome size and coverage from the Roche GS De Novo Assembler (newbler) run for each strain is illustrated in the chart below (See here for the R code and data). The figure includes the genome sizes of already sequenced P. fluorescens isolates as references.

Genomic coverage of Pseudomonas fluorescens sequencing

Genome size

The graph shows both P. fluorescens genomes appear larger than those of existing genomes. The R124 strain is predicted to be marginally larger by ~0.3 MBp than the largest already sequenced P. fluorescens genome while the KY485 strain is much larger by >4 MBp. The sequence data however is relatively fresh and therefore we expect the estimated genome size will change as we try to generate a complete build. Furthermore I believe there is the possibility the current data contains sequences from plasmids which would inflate the size estimates.

Sequencing coverage

The unexpected large size of each genome resulted in less coverage than we hoped. The total genomic coverage in scaffolds is highlighted by the darker grey bars in the barchart above. The R124 assembly has a reasonable ~85% of the predicted genome at 22X coverage. However we have only ~44% of the KY485 genome at 17X coverage – less than half the genome. This therefore indicates a large portion of the KY485 genome is still unknown.

Next step

Over the next weeks we will be trying to bridge gaps in the smaller of the two genomes using PCR and traditional sequencing. I’ll also be trying to estimate size of the gaps in each genome assembly using other P. fluorescens genomes as a reference. I’ll also try to determine if any differences genomic GC content suggest the presence of plasmids in the sequencing data.

February 15, 2010

A combined approach to genome sequencing

The aim of this research project is to sequence the genomes of four Pseudomonas fluorescens isolates from two different cave environments. The resulting genome sequences will allow us to estimate of the selective pressure on the same species in two caves and then understand the overall selection pressures in nutrient starved caves. The additional P. fluorescens genomes already available for Pf-5, SBW25 and Pf0-1 will also allow comparative genomics for completely different environments and further analysis of the P. fluorescens pan-genome. I’ll outline our sequencing strategy below.

De Novo genome sequencing

If there are genomes already available for P. fluorescens strains then the obvious choice should be comparative based sequencing and assembly. However comparative analysis of three P. fluorescens species shows the genomes of P. fluorescens strains are much more dissimilar than might be expected for strains of the same species. Therefore it is likely the genomes our cave strains will be similarly unrelated to existing sequences.

Based on this we’re going to use 454 sequencing and perform de novo assembly of two P. fluorescens isolates; we’ll sequence a single isolate from two different cave sites. We would have preferred to have sequenced four genomes but this was ruled out by the low coverage per genome (~9X) and additional sample preparation costs which I’ve discussed previously. So instead we’ll sequence two genomes on a 454 plate split in half using the standard rubber gasket. Each sample will be prepared as a combination of paired-end and shotgun reads. This will provide uniform 35-40X coverage for each genome with the paired reads to improve assembly.

Comparative genome assembly

We ruled out comparative assembly using short but high read sequencing, such as Illumina or AB SOLiD because of the anticipated low sequence identity between P. fluorescens strains. Nevertheless once we have two de novo assembled cave genomes hopefully we can do comparative assembly of other cave isolates using these as scaffolds. Previous P. fluorescens genomes show a low sequence identity but we hope that isolates of the same species from the same site will have enough genome similarity to allow one genome to act as a scaffold for the comparative assembly of a second.

Therefore we will take a two further isolates of the same species from each site and send them for AB SOLiD sequencing. This will provide 90X coverage for each genome at a much cheaper price compared with 454 sequencing. Hopefully the combination of 454 and AB SOLiD will produce large amounts of data to compare variability between strains and sites.

January 5, 2010

Genomics in a small microbiology lab

My post-doc is doing genomics of micro-organisms from starved cave environments. Several universities in the Kentucky area have banded together to get a sequencer which allows a small microbiology lab like ourselves to do sequencing for a few thousand dollars. The biology department here doesn’t have the dedicated computing cluster required for genomic assembly and analysis however the availability of on demand computing resources means this isn’t a problem as we can rent a virtual machine with 64GB of RAM by the hour. The only bottleneck in my project will therefore be my ability to formulate a research question and properly analyse The genomic data.

The availability of cheaper sequencing and by-the-hour computer time means that smaller research laboratories are no longer restricted in their ability to do genomics. It’s not hard to imagine a few years ago that sequencing costs put novel genomics out of reach for most labs, while only labs at large institutions had access to dedicated computing facilities. From my experience of moving from a large to small university it seems the financial and infrastructure barriers for doing genomics are now much lower. Genomics, in microbes at least, can now be carried out by hundreds of smaller labs instead of clustered at a few large sequencing centres and universities.

I remember when I started doing my masters five years ago that most papers began by discussing the "explosion of sequence data", but I think the availability of cheaper sequencing means that the explosion is just beginning. Now is a great time to be a bioinformatician – sequencing and computational power are now much easier to access and the problem will be finding people that can manage and process the data.