November 16, 2010

Getting the terminal version of vim 7.3 to work on OSX 10.6

MacVim is an excellent GUI editor but I like to use the terminal version of vim from time-to-time also. The OSX 10.6 operating system ships with vim 7.2 which doesn’t include some of the cool new features in 7.3 such as relative line numbering, colour columns and undofiles (check out the excellent gundo plugin).

Googling around on the web suggests downloading the latest vim source using mercurial and compiling with the extend command line options. I spent a good few hours trying this with no luck – the python interpreter kept failing. So here’s how I did it in the end much more quickly using homebrew.


brew install macvim
echo "`ls /usr/local/Cellar/macvim/*/MacVim.app/Contents/MacOS/Vim | head -n 1` \$*" > /usr/local/bin/vim
chmod 755 /usr/local/bin/vim

UPDATE:

Seth has pointed out in the comments a simpler way to get the same result. Use a symbolic link called ‘vim’ to the mvim binary. The mvim program will then work out that it’s being called as vim rather than the GUI version.


brew install macvim
ln -s /usr/local/bin/mvim /usr/local/bin/vim

November 15, 2010

Genome scaffolding with Ruby

The slides I gave from a talk this weekend.

August 10, 2010

Identifying groups of orthologous genes across Pseudomonas fluorescens strains

The last two weeks I’ve been trying to identify groups of orthologous genes across the genomes of four Pseudomonas fluorescens strains, including our strain R124. The R124 genome was annotated using the the IMG-ER service which also assigns genes to clusters of orthologous genes (COGs) during the process. I’d specifically like to be able to estimate the size of the pan-genome for Pseudomonas fluorescens though.

The pan-genome describes which orthologs are present in a group of related genomes. The core genome is the orthologs found in all genomes, the dispensable genome is the orthologs found in some of the species, while the unique genome is specific to an individual. Differences between genes in each of these groups will identify the possible differences in lifestyle between strains encoded in the genome. The size of the pan-genome also suggests the degree of the variation between the species/strains.

To estimate the pan-genome for P. fluorescens I followed the methods used by Hogg et. al which are described in the supplementary material as being suitable for pan-genome estimation. I modified this method to include a clustering technique and the steps I took are described in brief below.

Ortholog identification

All P. fluorescens protein coding gene sequences from all four strains were compiled into a fasta nucleotide database. The protein sequence of each gene was then used to search all six-frame translated sequences using tfasty. Using six frame translations attempts to account for possible frame shift errors in pair-wise comparisons.

Positive gene matches were those identified as having at least 70% sequence identity over at least 70% the length of the shortest of the two sequences.

The tfasty matches between sequences was used to generate a network of sequence similarity. Each node in the network represents a P. fluorescens gene from any of four genomes and each arc represents sequence similarity between the two sequences.

Groups of orthologous genes were then identified in the this network using the Markov clustering software (MCL). This software simulates a random walk over a network and more similar nodes are more likely to appear together in the same walk. Groups of orthologous genes were identified as those clustering together.

Clustering results

I wanted to test this clustering approach to determine how effective this process was in identifying orthologs. I aligned each cluster of sequences using MAFFT and then determined the consensus sequence from this alignment. The distribution of consensus sequence length as a percentage of the total alignment length is shown in the histogram below.

This figure shows that the majority of ortholog clusters have a consensus sequence length greater than 50% of the alignment length. However what I found unusual was that 2.3% of gene clusters produced a consensus sequence less than this.

I compared the number of sequences in the alignment, and length of alignment to determine if these factors could be related to the consensus sequence length. The graphs for these are shown below and suggest that neither of these two factors have an obvious effect.

I also compared the cluster efficiency, a metric produced by MCL, to determine if the way in which the data is clustered and therefore which sequences are included in each ortholog group is related to the consensus estimation. This also appeared to be unrelated.

Finally I manually looked at the alignments that had a low consensus length. In many of these alignment there were small sequences included which were much shorter than the rest of the sequences in the alignment. Plotting the difference between the largest and smallest sequence in each ortholog cluster does suggest there is a trend.

The likely reason for this is that small proteins are matching a conserved domain in a larger sequence and therefore being considered a match when generating the network for clustering. It’s debatable at what length a shorter sequence may not be considered to encode an orthologous function, but I opted for an arbitrary cut-off of a 50% difference in size. Repeating the analysis using this cutoff generated the following distribution of consensus length where the number of sequences with consensus less than 50% dropped for 2.3% to 1.6%.

This does not completely eliminate the poorly aligned gene clusters nor does this appear to reduce the obvious bias related to differences in sequence size. However this did seem to eliminate about a third of the poor sequence alignments. The best way to deal with the remaining clusters may just be to remove them.

July 11, 2010

Genome annotation using the JGI’s Integrated Microbial Genomes

Last week our genome was annotated at the Joint Genome Institutes’s Integrate Microbial Genomes resource. This tool was recommended to me after speaking to the two helpful people at the JGI’s stand during American Society for Microbiology conference last month.

Submission and annotation process

Submitting our genome sequence data for annotation was a simple process which first requires an IMG account. The microbe to be annotated also requires a GOLD genome project identifier. I had previously created an NCBI genome project and these appear to automatically also be given GOLD ids, so I was able to use this. To submit a genome for annotation I went to the data submission page, selected “IMG ER Submissions”, then clicked “Submit Dataset to IMG ER” at the bottom of the page. The species name was enough to search and find our genome project so that I could then upload all of our scaffolds and non-scaffolded contigs for annotation. Our genome annotation was completed in around a day, and the annotation results were available a few days later.

Annotation results

An early look at the results of the genome annotation suggests nothing unusual. The number of genes, genes per megabase, and mean gene size for P. fluorescens R124 appears similar to the other P. fluorescens strains. One interesting point though was that no ribosomal RNA or CRISPR regions were found in the genome. I believe this is likely because these types of repetitive regions are those which cannot be easily assembled from the ~500bp length 454 reads, and are therefore gaps in the current assembly.

June 30, 2010

Building a draft genome sequence using a reference genome

I’d like to be able to have a draft sequence of our genome as opposed to just the set of ordered scaffolds. A complete draft sequence would be useful for identifying genomic rearrangements relative to other Pseudomonas fluorescens strains. A draft genome sequence will also allow me to estimate the distances between genes.

The likely order of scaffolds can be estimated by aligning our sequences to a closely related reference genome. This is easy to do using using nucmer, part of the mummer package. The figure below is an example of simply plotting the results of each scaffold match to three different reference strains.

Nucmer Scaffold Alignment of R124 Scaffolds

This output is useful but I wanted to move further to generate a genome map of each scaffold. This is somewhat more difficult to do though because rearrangements and repeats produce hits for the same scaffold in more than one area. I found the most pragmatic solution was to manually curate the nucmer alignment results by hand and order the sequencing scaffolds and contigs based on the longest continuous set of matches to the reference genome. I wrote the likely order as a YAML file which looks something like this:

This allowed to me specify the order of the scaffolds and contigs, the probable size of the unresolved gaps, and also add comments for regions I’m unsure about. Going from a YAML file to circos output is then just a case of rearranging the numbers to the correct format. This produced the map below which shows where the gaps between contigs and also shows the unresolved repeat regions left as gaps by newbler (green highlights on the inside track).

Genome map of R124 showing gaps and contigs