May 10, 2010

Searching further for genomic islands using variance in tetranucleotide usage

Identifying genomic islands to understand the likely degree of mutualism in nutrient starved environments is one of the main hypotheses in our research. I previous tried to identify genomic islands using multivariate statistics of tetranucleotide sequence composition.

I’ve since further tried to identify genomic islands based on a related approach examining local versus global variance in tetranucleotide usage. This analysis also suggested some possible genomic islands in the Pseudomonas fluorescens R124 genome as regions with a divergent localised variance in tetranucleotide usage compared with the rest of the genome. One interesting result was that a genome scaffold with no sequence similarity to any reference genome did not show any unusual tetranucleotide variance. This leads to further questions about the origin of this region and the possible functionality encoded within.

April 21, 2010

Looking for genomic islands through variability in tetranucleotide usage

I initially thought our Pseudomonas fluorescens sequencing data might contain plasmid DNA because one of the sequence scaffolds showed no sequence similarity to other P. fluorescens genomes. Morgan Langille however pointed out in the comments that this sequence could just as easily be the result of a horizontal gene transfer event.

I’ve been trying to learn about genomic islands in microbes and recently I’ve been looking at identifying genomic islands through unusual frequencies in tetranucleotide usage. The theory is that horizontally transferred DNA will show differential usage of tetranucleotides compared with the rest of the genome. My analysis was based on that of Dick et al. who used self organising maps to look for genomic islands in whole community sequence data. My initial results suggest there may be some regions which are genomic islands however further work will be needed. Next I’m going to look at differences in local and global tetranucleotide variance which also seems useful for identifying genomic islands.

March 15, 2010

Discovering a plasmid in our sequence data

Last week I determined the likely order of our Pseudomonas fluorescens R124 sequencing scaffolds by mapping them on to reference genomes from the same species. This mapping to reference genomes also indicated two of the sequence scaffolds ( 5 and 8 ) didn’t align (see this figure) and therefore may not be part of the genome assembly. The next logical step therefore was to find out what type of sequence these scaffolds represented.

A megablast search showed scaffold 5 did align to reference P fluorescens genomes which was surprising since, as I wrote above, scaffold 5 did not appear to part of the assembly. After a closer look however scaffold 5 is only ~5Kb in size while the scaffold map I produced was on a megabase scale. Therefore scaffold 5 was just too small to be seen by eye when compared to the other much large scaffolds.

The blast search using scaffold 8 returned a more interesting result. The best hit was a plasmid in Pseudomonas syringae pv. phaseolicola. The alignment between scaffold 8 and the plasmid is shown below (click for the larger version) where the plasmid open reading frames are shown in red and the aligned scaffold 8 regions are shown in blue.

This result indicates the likely reason that scaffold 8 does not align to any of the reference genomes is because it is plasmid in origin rather than genomic. A further blastx search with this scaffold identified four regions with sequence similarity to known proteins which are as follows: conjugal transfer proteins involved in the tranfer of genetic material, topoisomerases involved in unwinding DNA, and relaxases and replicases which are likely to be involved in plasmid replication. There was a fifth type of protein may be be related to Type IV (DNA or protein) secretion however the functional annotation of these was less clear. The blastx image result is shown below.

I’m still learning microbial genomics and I suspect it’s unsurprising to discover a plasmid containing sequence similarity to genes involved in replication and transfer. What does spark my interested is that the above blast image shows the rest of the plasmid does not appear in first 100 results returned by blast. This might indicate there is relatively novel data with low sequence similarity known genes waiting to be analysed.

UPDATE: Morgan Langille has rightly pointed out in a comment below that scaffold 8 could have low sequence similarity and still be part of the R124 genome if it’s an inserted genomic island.

March 8, 2010

Estimating genome scaffold order using reference genomes

The P. fluorescens genome sequencing results arrived last week and so far I’ve been looking at how we can begin to assemble the scaffolds from the smaller of the two genomes R124 into a complete draft. The are small holes in the scaffolds which we will PCR across but the harder task is cross the gaps between scaffolds which could be ten or hundreds of thousands of kilobases long.

There are genomes available for other strains of the P. fluorescens species so these can therefore be used as a template to determine the order of our R124 scaffolds. I initially tried blasting the scaffolds against a reference genome and and plotting the density of blast hits. However when I posted these results on FriendFeed Max pointed out this plot was difficult to interpret and Rob Syme suggested using mummer.

Mummer is a software package for aligning genomes and so I used the nucmer part of the package to compare the R124 scaffolds against the reference genomes of three other P. fluorescens strains. The plot below visualises each nucmer alignment match. This figure indicates the possible order of the scaffolds and also suggests that scaffold 5 (last row) does not appear in any of the reference genomes.

Nucmer Scaffold Alignment of R124 Scaffolds

I also visualised the nucmer results as a dotplot between the R124 scaffolds and the reference strains. This plot indicates the likely orientation of the scaffolds and also suggests possible rearrangements in scaffold 3 (purple) and scaffold 7 (yellow) versus the reference strains – a result which I find rather interesting.

March 3, 2010

Find repeating N regions in a fasta file with bioruby

Useful for designing primers to PCR across gaps in genome contigs