Identifying genomic islands through tetranucleotide variance
Outline
Horizontally transfered regions may be detected from unusual nucleotide composition in the genome. This analysis used differences between local and global variance in tetranucleotide usage to identify possible genomic islands.
Aims
Identifying genomic islands indicating horizontal gene transfer and mutualism in cave microbes is one of the main hypotheses of our research. In a previous analysis I used nucleotide frequency to find likely genomic islands, this analysis I'm using a second sequence composition approach to identify genomic islands in Pseudomonas fluorescens R124. This method compares the local and global variance in tetranucleotide usage where genomic islands are regions having a different local variance in tetranucleotide usage. Localised differences in tetranucleotide usage therefore indicate the DNA may be the result of a horizontal gene transfer. Descriptions of this method can be and found in Reva and Tümmler 2004 and Reva and Tümmler 2005.
Results
Globally and locally normalised tetranucleotide was calculated for all scaffolds in the R124 genome using the oligwords software. An overview of the oligoword algorithm is outlined as follows. The genomic sequence for the R124 scaffolds are divided into fragments using a 8kb sliding window with a 2kb step size. The tetranucleotide frequency for each fragment is then counted using a 4bp sliding window moving in 1bp steps. The sum of squares is calculated for the differences (SSD) between observed tetranucleotide expected frequency and what is expected given the GC content. The SSD is normalised by expected variance in tetranucleotides for both the length of the 8kb fragment (local variance) and the entire scaffold sequence (global variance). Globally and locally normalised estimates of tetranucleotide variance for the P. fluorescens R124 scaffolds is shown in the figure below.
Genomic regions with similar global and local tetranucleotide variance will lie on the diagonal in the figure. The figure however highlights that some regions in scaffolds 2 and 4 are off the diagonal line and therefore have a different estimates of local and global variances, are therefore possible genomic islands. Scaffold 8 does not appear to have unusual tetranucleotide variance even though it shows no sequence similarity with other P. fluorescens genomes. This does not preclude the possibility that scaffold 8 is a genomic island or plasmid, just that it does not have appear to have unusual tetranucleotide usage.
One important point to note is that analysis of genome sequence composition only suggests possible genomic islands. Regions of the genome may have atypical composition but are not the result of horizontal gene transfer. Examples of this are ribosomes which have specific codon usage for faster translation. Furthermore there may be horizontally transferred regions which in time the sequence composition has been ameliorated to match the composition of the host genome. This therefore may mean HGT regions are not detected by composition alone. A full discussion of this can be found in the review by Lagiell et. al.
Methods
The oligowords software v1.6.1.1 was used to calculate fragments from the R124 scaffolds using an 8kb sliding window with a 2kb step size. Global and local tetranucleotide relative variance normalised by mononucleotide frequency was calculated for each fragment (n1_4mer:GRV + n1_4mer:RV parameters). Full details on the algorithm can be found in the implementation section from Ganesan et. al 2008 (Open Access).