The last two weeks I’ve been trying to identify groups of orthologous genes across the genomes of four Pseudomonas fluorescens strains, including our strain R124. The R124 genome was annotated using the the IMG-ER service which also assigns genes to clusters of orthologous genes (COGs) during the process. I’d specifically like to be able to estimate the size of the pan-genome for Pseudomonas fluorescens though.
The pan-genome describes which orthologs are present in a group of related genomes. The core genome is the orthologs found in all genomes, the dispensable genome is the orthologs found in some of the species, while the unique genome is specific to an individual. Differences between genes in each of these groups will identify the possible differences in lifestyle between strains encoded in the genome. The size of the pan-genome also suggests the degree of the variation between the species/strains.
To estimate the pan-genome for P. fluorescens I followed the methods used by Hogg et. al which are described in the supplementary material as being suitable for pan-genome estimation. I modified this method to include a clustering technique and the steps I took are described in brief below.
Ortholog identification

All P. fluorescens protein coding gene sequences from all four strains were compiled into a fasta nucleotide database. The protein sequence of each gene was then used to search all six-frame translated sequences using tfasty. Using six frame translations attempts to account for possible frame shift errors in pair-wise comparisons.

Positive gene matches were those identified as having at least 70% sequence identity over at least 70% the length of the shortest of the two sequences.

The tfasty matches between sequences was used to generate a network of sequence similarity. Each node in the network represents a P. fluorescens gene from any of four genomes and each arc represents sequence similarity between the two sequences.

Groups of orthologous genes were then identified in the this network using the Markov clustering software (MCL). This software simulates a random walk over a network and more similar nodes are more likely to appear together in the same walk. Groups of orthologous genes were identified as those clustering together.
Clustering results
I wanted to test this clustering approach to determine how effective this process was in identifying orthologs. I aligned each cluster of sequences using MAFFT and then determined the consensus sequence from this alignment. The distribution of consensus sequence length as a percentage of the total alignment length is shown in the histogram below.

This figure shows that the majority of ortholog clusters have a consensus sequence length greater than 50% of the alignment length. However what I found unusual was that 2.3% of gene clusters produced a consensus sequence less than this.
I compared the number of sequences in the alignment, and length of alignment to determine if these factors could be related to the consensus sequence length. The graphs for these are shown below and suggest that neither of these two factors have an obvious effect.


I also compared the cluster efficiency, a metric produced by MCL, to determine if the way in which the data is clustered and therefore which sequences are included in each ortholog group is related to the consensus estimation. This also appeared to be unrelated.

Finally I manually looked at the alignments that had a low consensus length. In many of these alignment there were small sequences included which were much shorter than the rest of the sequences in the alignment. Plotting the difference between the largest and smallest sequence in each ortholog cluster does suggest there is a trend.

The likely reason for this is that small proteins are matching a conserved domain in a larger sequence and therefore being considered a match when generating the network for clustering. It’s debatable at what length a shorter sequence may not be considered to encode an orthologous function, but I opted for an arbitrary cut-off of a 50% difference in size. Repeating the analysis using this cutoff generated the following distribution of consensus length where the number of sequences with consensus less than 50% dropped for 2.3% to 1.6%.

This does not completely eliminate the poorly aligned gene clusters nor does this appear to reduce the obvious bias related to differences in sequence size. However this did seem to eliminate about a third of the poor sequence alignments. The best way to deal with the remaining clusters may just be to remove them.