P. fluorescens genomics

Multivariate analysis of oligonucleotide frequency

Outline

Different nucleotide composition in the genome may indicate horizontally transferred regions and genomic islands. The analysis compares the tetranucleotide usage across the P. fluorescens genome to highlight regions with atypical composition.

Aims

The Pseudomonas fluorescens strains we sequenced were isolated from nutrient starved caves. Due to the nutrient starvation in these environments we hypothesised that will be microbial competition for limited resources which will select for horizontal gene transfer and the uptake of DNA encoding novel metabolic activity. Therefore based on this hypothesis we expect to find evidence of genomic islands and plasmids. A common method to identify genomic islands is through variability in %GC content, where regions with a markedly different %GC usage are considered an island. Comparison of orthologs in Escherichia. coli and Salmonella typhi has however showed that %GC content alone may be a poor indicator of horizontal gene transfer.

An alternative method to identify genomic islands may be through comparing variation in oligonucleotide frequency, in particular the differential usage of tetranucleotides (4-mers). Several articles describe how nucleotide frequency may be used to detect genomic islands, and as a starting point I examined tetra-nucleotide variation across the P. fluorescens R124 genome. The results of this analysis are outlined below.

Results

I split the P. fluorescens R124 genome into 5Kbp fragments and calculated the frequency of all possible tetranucleotides in each fragment. Similar to analysis by Dick et al. I used dimensionality reduction to visualise the distribution of genomic fragments to identify any outliers which indicate possibly horizontally transferred DNA. Dick et al. used self organising maps but I used singular value decomposition (SVD) as I am more familiar with this method.

One use of SVD is to identify the underlying patterns in a data matrix which explain a large degree of variation. The SVD of a matrix produces a number of singular values (e.g. underlying patterns) to which each variable in the data contributes a weight. In this case the variable examined is the tetranucleotide usage in each 5kb genome fragment.

This figure indicates the first singular value shows a strong signal and the remaining singular values are much weaker. This may indicate that a large degree of the variability in the tetranucleotide frequencies is explained by a single underlying trend.

Another output of the SVD analysis is that each 5kb genome fragment is more or less associated to each of the singular values. Therefore the distribution of genome fragments can be visualised in two dimensions by plotting the weight of each fragment for the first two singular values. This is shown in the figure below. Each point represents one of the 5Kbp genome fragments and the distance between any two points indicates their similarity in tetranucleotide composition for the first two singular values. The aim of this figure is to identify fragments with low tetranucleotide similarity to the rest of the genome and therefore possible genomic islands.

This figure indicates a continuous arc in the tetranucleotide distribution of 5Kbp genome fragments. The dense cluster of points in the bottom right of the figure shows that most of the P. fluorescens R124 genome has a homogeneous usage of tetranucleotides - which might be expected. A long tail at the top of the figure and a few points at the bottom indicated regions with differential usage of tetranucleotides. There regions are candidates for genomic islands and should be investigated further.

Methods

All 'N' regions in the P. fluorescens R124 genomic scaffolds were stripped and the remaining sequence was split into non-overlapping 5Kbp fragments. Fragments smaller then 5Kbp were ignored. A 4bp sliding window moving in 1bp steps was used to count tetranucleotide frequencies in each 5Kbp fragment. These frequencies were calculated and summed over both the forward and reverse complement strands of each fragment. This produced a 256 by 1238 size matrix where each row corresponded to one of the 1238 genome fragments and each column in the frequency of one of the possible 256 combinations of four nucleotides. Using R this matrix was row normalised then decomposed using singular value decomposition.

Comments

blog comments powered by Disqus