<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>michael barton</title>
	<atom:link href="http://www.michaelbarton.me.uk/research/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.michaelbarton.me.uk/research</link>
	<description>genomics, phylogenetics and molecular evolution</description>
	<lastBuildDate>Tue, 10 Aug 2010 09:30:18 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
	<atom:link rel='hub' href='http://www.michaelbarton.me.uk/research/?pushpress=hub'/>
		<item>
		<title>Identifying groups of orthologous genes across Pseudomonas fluorescens strains</title>
		<link>http://www.michaelbarton.me.uk/research/2010/08/identifying-groups-of-orthologous-genes-across-pseudomonas-fluorescens-strains/</link>
		<comments>http://www.michaelbarton.me.uk/research/2010/08/identifying-groups-of-orthologous-genes-across-pseudomonas-fluorescens-strains/#comments</comments>
		<pubDate>Tue, 10 Aug 2010 09:30:18 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[methods]]></category>
		<category><![CDATA[fasta]]></category>
		<category><![CDATA[orthology]]></category>
		<category><![CDATA[pan-genome]]></category>

		<guid isPermaLink="false">http://www.michaelbarton.me.uk/research/?p=229</guid>
		<description><![CDATA[The last two weeks I&#8217;ve been trying to identify groups of orthologous genes across the genomes of four Pseudomonas fluorescens strains, including our strain R124. The R124 genome was annotated using the the IMG-ER service which also assigns genes to clusters of orthologous genes (COGs) during the process. I&#8217;d specifically like to be able to [...]]]></description>
			<content:encoded><![CDATA[<p>The last two weeks I&#8217;ve been trying to identify groups of orthologous genes across the genomes of four <em>Pseudomonas fluorescens</em> strains, including our strain R124. The R124 genome was <a href="http://www.michaelbarton.me.uk/fluorescens-genomics/annotation/">annotated using the the IMG-ER service</a> which also assigns genes to clusters of orthologous genes (COGs) during the process. I&#8217;d specifically like to be able to estimate the size of the pan-genome for <em>Pseudomonas fluorescens</em> though.</p>
<p>The pan-genome describes which orthologs are present in a group of related genomes. The core genome is the orthologs found in all genomes, the dispensable genome is the orthologs found in some of the species, while the unique genome is specific to an individual. Differences between genes in each of these groups will identify the possible differences in lifestyle between strains encoded in the genome. The size of the pan-genome also suggests the degree of the variation between the species/strains.</p>
<p>To estimate the pan-genome for <em>P. fluorescens</em> I followed the methods used by <a href="http://www.ncbi.nlm.nih.gov/pubmed/17550610">Hogg et. al</a> which are described <a href="http://www.centerforgenomicsciences.org/documents/Hogg_etal_2007_supplemental_material/">in the supplementary material</a> as being suitable for pan-genome estimation. I modified this method to include a clustering technique and the steps I took are described in brief below.</p>
<h2 id="ortholog-identification">Ortholog identification</h2>
<p><img class="aligncenter size-medium" src="http://www.michaelbarton.me.uk/fluorescens-genomics/images/2010/07/search.svg" width="400" /></p>
<p>All <em>P. fluorescens</em> protein coding gene sequences from all four strains were compiled into a fasta nucleotide database. The protein sequence of each gene was then used to search all six-frame translated sequences using tfasty. Using six frame translations attempts to account for possible frame shift errors in pair-wise comparisons.</p>
<p><img class="aligncenter size-medium" src="http://www.michaelbarton.me.uk/fluorescens-genomics/images/2010/07/close_hits.svg" width="400" /></p>
<p>Positive gene matches were those identified as having at least 70% sequence identity over at least 70% the length of the shortest of the two sequences.</p>
<p><img class="aligncenter size-medium" src="http://www.michaelbarton.me.uk/fluorescens-genomics/images/2010/07/generate_network.svg" width="400" /></p>
<p>The tfasty matches between sequences was used to generate a network of sequence similarity. Each node in the network represents a <em>P. fluorescens</em> gene from any of four genomes and each arc represents sequence similarity between the two sequences.</p>
<p><img class="aligncenter size-medium" src="http://www.michaelbarton.me.uk/fluorescens-genomics/images/2010/07/identify_orthologs.svg" width="400" /></p>
<p>Groups of orthologous genes were then identified in the this network using the <a href="http://www.micans.org/mcl/">Markov clustering software (MCL)</a>. This software simulates a random walk over a network and more similar nodes are more likely to appear together in the same walk. Groups of orthologous genes were identified as those clustering together.</p>
<h2 id="clustering-results">Clustering results</h2>
<p>I wanted to test this clustering approach to determine how effective this process was in identifying orthologs. I aligned each cluster of sequences using MAFFT and then determined the consensus sequence from this alignment. The distribution of consensus sequence length as a percentage of the total alignment length is shown in the histogram below.</p>
<p><img class="aligncenter size-medium" src="http://www.michaelbarton.me.uk/research/wp-content/uploads/2010/07/consensus_length.png" width="350" /></p>
<p>This figure shows that the majority of ortholog clusters have a consensus sequence length greater than 50% of the alignment length. However what I found unusual was that 2.3% of gene clusters produced a consensus sequence less than this.</p>
<p>I compared the number of sequences in the alignment, and length of alignment to determine if these factors could be related to the consensus sequence length. The graphs for these are shown below and suggest that neither of these two factors have an obvious effect.</p>
<p><img class="aligncenter size-medium" src="http://www.michaelbarton.me.uk/research/wp-content/uploads/2010/07/consensus_vs_genes.png" width="350" /></p>
<p><img class="aligncenter size-medium" src="http://www.michaelbarton.me.uk/research/wp-content/uploads/2010/07/consensus_vs_length.png" width="350" /></p>
<p>I also compared the cluster efficiency, a metric produced by MCL, to determine if the way in which the data is clustered and therefore which sequences are included in each ortholog group is related to the consensus estimation. This also appeared to be unrelated.</p>
<p><img class="aligncenter size-medium" src="http://www.michaelbarton.me.uk/research/wp-content/uploads/2010/07/consensus_vs_cluster.png" width="350" /></p>
<p>Finally I manually looked at the alignments that had a low consensus length. In many of these alignment there were small sequences included which were much shorter than the rest of the sequences in the alignment. Plotting the difference between the largest and smallest sequence in each ortholog cluster does suggest there is a trend.</p>
<p><img class="aligncenter size-medium" src="http://www.michaelbarton.me.uk/research/wp-content/uploads/2010/07/consensus_vs_difference.png" width="350" /></p>
<p>The likely reason for this is that small proteins are matching a conserved domain in a larger sequence and therefore being considered a match when generating the network for clustering. It&#8217;s debatable at what length a shorter sequence may not be considered to encode an orthologous function, but I opted for an arbitrary cut-off of a 50% difference in size. Repeating the analysis using this cutoff generated the following distribution of consensus length where the number of sequences with consensus less than 50% dropped for 2.3% to 1.6%.</p>
<p><img class="aligncenter size-medium" src="http://www.michaelbarton.me.uk/research/wp-content/uploads/2010/07/consensus_vs_difference_trimmed.png" width="350" /></p>
<p>This does not completely eliminate the poorly aligned gene clusters nor does this appear to reduce the obvious bias related to differences in sequence size. However this did seem to eliminate about a third of the poor sequence alignments. The best way to deal with the remaining clusters may just be to remove them.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.michaelbarton.me.uk/research/2010/08/identifying-groups-of-orthologous-genes-across-pseudomonas-fluorescens-strains/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Genome annotation using the JGI&#8217;s Integrated Microbial Genomes</title>
		<link>http://www.michaelbarton.me.uk/research/2010/07/genome-annotation-using-the-jgis-integrated-microbial-genomes/</link>
		<comments>http://www.michaelbarton.me.uk/research/2010/07/genome-annotation-using-the-jgis-integrated-microbial-genomes/#comments</comments>
		<pubDate>Sun, 11 Jul 2010 09:00:32 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[results]]></category>
		<category><![CDATA[annotation]]></category>
		<category><![CDATA[fluorescens]]></category>
		<category><![CDATA[genome]]></category>
		<category><![CDATA[jgi]]></category>
		<category><![CDATA[pseudomonas]]></category>
		<category><![CDATA[R124]]></category>

		<guid isPermaLink="false">http://www.michaelbarton.me.uk/research/?p=217</guid>
		<description><![CDATA[Last week our genome was annotated at the Joint Genome Institutes&#8217;s Integrate Microbial Genomes resource. This tool was recommended to me after speaking to the two helpful people at the JGI&#8217;s stand during American Society for Microbiology conference last month. Submission and annotation process Submitting our genome sequence data for annotation was a simple process [...]]]></description>
			<content:encoded><![CDATA[<p>Last week our genome was annotated at the <a href="http://www.jgi.doe.gov/">Joint Genome Institutes&rsquo;s</a> <a href="http://img.jgi.doe.gov/">Integrate Microbial Genomes</a> resource. This tool was recommended to me after speaking to the two helpful people at the JGI&rsquo;s stand during American Society for Microbiology conference last month.</p>
<h3 id="submission-and-annotation-process">Submission and annotation process</h3>
<p>Submitting our genome sequence data for annotation was a simple process which first requires an IMG account. The microbe to be annotated also requires a <a href="http://www.genomesonline.org/">GOLD</a> genome project identifier. I had previously created an NCBI genome project and these appear to automatically also be given GOLD ids, so I was able to use this. To submit a genome for annotation I went to the <a href="http://merced.jgi-psf.org/cgi-bin/img_er_submit/main.cgi">data submission page</a>, selected &ldquo;IMG ER Submissions&rdquo;, then clicked &ldquo;Submit Dataset to IMG ER&rdquo; at the bottom of the page. The species name was enough to search and find our genome project so that I could then upload all of our scaffolds and non-scaffolded contigs for annotation. Our genome annotation was completed in around a day, and the annotation results were available a few days later.</p>
<h3 id="annotation-results">Annotation results</h3>
<p>An <a href="http://www.michaelbarton.me.uk/fluorescens-genomics/annotation/">early look at the results of the genome annotation</a> suggests nothing unusual. The number of genes, genes per megabase, and mean gene size for <em>P. fluorescens</em> R124 appears similar to the other <em>P. fluorescens</em> strains. One interesting point though was that no ribosomal RNA or CRISPR regions were found in the genome. I believe this is likely because these types of repetitive regions are those which cannot be easily assembled from the ~500bp length 454 reads, and are therefore gaps in the current assembly.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.michaelbarton.me.uk/research/2010/07/genome-annotation-using-the-jgis-integrated-microbial-genomes/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Building a draft genome sequence using a reference genome</title>
		<link>http://www.michaelbarton.me.uk/research/2010/06/building-a-draft-genome-sequence-using-a-reference-genome/</link>
		<comments>http://www.michaelbarton.me.uk/research/2010/06/building-a-draft-genome-sequence-using-a-reference-genome/#comments</comments>
		<pubDate>Wed, 30 Jun 2010 08:30:56 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[assembly]]></category>
		<category><![CDATA[comparative]]></category>
		<category><![CDATA[R124]]></category>

		<guid isPermaLink="false">http://www.michaelbarton.me.uk/research/?p=207</guid>
		<description><![CDATA[I&#8217;d like to be able to have a draft sequence of our genome as opposed to just the set of ordered scaffolds. A complete draft sequence would be useful for identifying genomic rearrangements relative to other Pseudomonas fluorescens strains. A draft genome sequence will also allow me to estimate the distances between genes. The likely [...]]]></description>
			<content:encoded><![CDATA[<p>I&rsquo;d like to be able to have a draft sequence of our genome as opposed to just the set of ordered scaffolds. A complete draft sequence would be useful for identifying genomic rearrangements relative to other <em>Pseudomonas fluorescens</em> strains. A draft genome sequence will also allow me to estimate the distances between genes.</p>
<p>The likely order of scaffolds can be estimated by aligning our sequences to a closely related reference genome. This is easy to do using using nucmer, part of <a href="http://mummer.sourceforge.net/">the mummer package</a>. The figure below is an example of simply plotting the results of each scaffold match to three different reference strains.</p>
<p><a href="http://www.michaelbarton.me.uk/research/wp-content/uploads/2010/03/scaffold_alignment1.png"><img alt="Nucmer Scaffold Alignment of R124 Scaffolds" class="aligncenter size-medium wp-image-162" height="300" src="http://www.michaelbarton.me.uk/research/wp-content/uploads/2010/03/scaffold_alignment1-300x300.png" title="Nucmer Scaffold Alignment of R124 Scaffolds" width="300" /></a></p>
<p>This output is useful but I wanted to move further to generate a genome map of each scaffold. This is somewhat more difficult to do though because rearrangements and repeats produce hits for the same scaffold in more than one area. I found the most pragmatic solution was to manually curate the nucmer alignment results by hand and order the sequencing scaffolds and contigs based on the longest continuous set of matches to the reference genome. I wrote the likely order as a YAML file which looks something like this:</p>
<p><script src="http://gist.github.com/457510.js?file=gistfile1.yml"></script></p>
<p>This allowed to me specify the order of the scaffolds and contigs, the probable size of the unresolved gaps, and also add comments for regions I&rsquo;m unsure about. Going from a YAML file to <a href="http://mkweb.bcgsc.ca/circos/">circos</a> output is then just a case of rearranging the numbers to the correct format. This produced the map below which shows where the gaps between contigs and also shows the unresolved repeat regions left as gaps by newbler (green highlights on the inside track).</p>
<p><a href="http://www.michaelbarton.me.uk/research/wp-content/uploads/2010/06/gaps_highlight.png"><img alt="Genome map of R124 showing gaps and contigs" class="aligncenter size-medium wp-image-206" height="300" src="http://www.michaelbarton.me.uk/research/wp-content/uploads/2010/06/gaps_highlight-300x300.png" title="Genome map of R124" width="300" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.michaelbarton.me.uk/research/2010/06/building-a-draft-genome-sequence-using-a-reference-genome/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Searching further for genomic islands using variance in tetranucleotide usage</title>
		<link>http://www.michaelbarton.me.uk/research/2010/05/searching-further-for-genomic-islands-using-variance-in-tetranucleotide-usage/</link>
		<comments>http://www.michaelbarton.me.uk/research/2010/05/searching-further-for-genomic-islands-using-variance-in-tetranucleotide-usage/#comments</comments>
		<pubDate>Mon, 10 May 2010 09:00:23 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[research]]></category>
		<category><![CDATA[results]]></category>
		<category><![CDATA[fluorescens]]></category>
		<category><![CDATA[genome-signature]]></category>
		<category><![CDATA[genomic-islands]]></category>
		<category><![CDATA[tetranucleotides]]></category>

		<guid isPermaLink="false">http://www.michaelbarton.me.uk/research/?p=199</guid>
		<description><![CDATA[Identifying genomic islands to understand the likely degree of mutualism in nutrient starved environments is one of the main hypotheses in our research. I previous tried to identify genomic islands using multivariate statistics of tetranucleotide sequence composition. I&#8217;ve since further tried to identify genomic islands based on a related approach examining local versus global variance [...]]]></description>
			<content:encoded><![CDATA[<p>Identifying genomic islands to understand the likely degree of mutualism in nutrient starved environments is one <a href="http://www.michaelbarton.me.uk/fluorescens-genomics/hypotheses/">of the main hypotheses in our research</a>. I previous tried to <a href="http://www.michaelbarton.me.uk/fluorescens-genomics/multivariate-analysis-of-oligonucleotide-frequency/">identify genomic islands using multivariate statistics of tetranucleotide sequence composition</a>.</p>
<p>I&#8217;ve since further tried to identify genomic islands based on a related approach examining <a href="http://www.michaelbarton.me.uk/fluorescens-genomics/tetranucleotide-variance-analysis/">local versus global variance in tetranucleotide usage</a>. This analysis also suggested some possible genomic islands in the <em>Pseudomonas fluorescens</em> R124 genome as regions with a divergent localised variance in tetranucleotide usage compared with the rest of the genome. One interesting result was that a genome scaffold <a href="http://www.michaelbarton.me.uk/research/2010/03/discovering-a-plasmid-in-our-sequence-data/">with no sequence similarity to any reference genome</a> did not show any unusual tetranucleotide variance. This leads to further questions about the origin of this region and the possible functionality encoded within.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.michaelbarton.me.uk/research/2010/05/searching-further-for-genomic-islands-using-variance-in-tetranucleotide-usage/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Looking for genomic islands through variability in tetranucleotide usage</title>
		<link>http://www.michaelbarton.me.uk/research/2010/04/looking-for-genomic-islands-through-variability-in-tetranucleotide-usage/</link>
		<comments>http://www.michaelbarton.me.uk/research/2010/04/looking-for-genomic-islands-through-variability-in-tetranucleotide-usage/#comments</comments>
		<pubDate>Wed, 21 Apr 2010 09:00:33 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[post doc]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[results]]></category>
		<category><![CDATA[genome-signature]]></category>
		<category><![CDATA[genomic-islands]]></category>
		<category><![CDATA[tetranucleotides]]></category>

		<guid isPermaLink="false">http://www.michaelbarton.me.uk/research/?p=190</guid>
		<description><![CDATA[I initially thought our Pseudomonas fluorescens sequencing data might contain plasmid DNA because one of the sequence scaffolds showed no sequence similarity to other P. fluorescens genomes. Morgan Langille however pointed out in the comments that this sequence could just as easily be the result of a horizontal gene transfer event. I&#8217;ve been trying to [...]]]></description>
			<content:encoded><![CDATA[<p>I initially thought our <em>Pseudomonas fluorescens</em> sequencing data might contain plasmid DNA <a href="http://www.michaelbarton.me.uk/research/2010/03/discovering-a-plasmid-in-our-sequence-data/">because one of the sequence scaffolds showed no sequence similarity to other <em>P. fluorescens genomes</em></a>. <a href="http://twitter.com/BetaScience">Morgan Langille</a> however <a href="http://www.michaelbarton.me.uk/research/2010/03/discovering-a-plasmid-in-our-sequence-data/#comments">pointed out in the comments</a> that this sequence could just as easily be the result of a horizontal gene transfer event.</p>
<p>I&#8217;ve been trying to learn about genomic islands in microbes and recently I&#8217;ve been looking at identifying genomic islands through unusual frequencies in tetranucleotide usage. The theory is that horizontally transferred DNA will show differential usage of tetranucleotides compared with the rest of the genome. My analysis was based on that of <a href="http://www.citeulike.org/user/michaelbarton/article/5626460">Dick <em>et al.</em></a> who used <a href="http://www.bioinformaticszen.com/blog/2007/07/exploring-multivariate-data-using-svd-and-som/">self organising maps</a> to look for genomic islands in whole community sequence data. My initial results <a href="http://www.michaelbarton.me.uk/fluorescens-genomics/multivariate-analysis-of-oligonucleotide-frequency/">suggest there may be some regions which are genomic islands</a> however further work will be needed. Next I&#8217;m going to look at differences in local and global tetranucleotide variance which also seems <a href="http://www.ncbi.nlm.nih.gov/pubmed/15239845">useful for identifying genomic islands</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.michaelbarton.me.uk/research/2010/04/looking-for-genomic-islands-through-variability-in-tetranucleotide-usage/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Discovering a plasmid in our sequence data</title>
		<link>http://www.michaelbarton.me.uk/research/2010/03/discovering-a-plasmid-in-our-sequence-data/</link>
		<comments>http://www.michaelbarton.me.uk/research/2010/03/discovering-a-plasmid-in-our-sequence-data/#comments</comments>
		<pubDate>Mon, 15 Mar 2010 09:00:48 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[results]]></category>
		<category><![CDATA[assembly]]></category>
		<category><![CDATA[blast]]></category>
		<category><![CDATA[genome]]></category>
		<category><![CDATA[plasmid]]></category>
		<category><![CDATA[scaffolds]]></category>

		<guid isPermaLink="false">http://www.michaelbarton.me.uk/?p=174</guid>
		<description><![CDATA[Last week I determined the likely order of our Pseudomonas fluorescens R124 sequencing scaffolds by mapping them on to reference genomes from the same species. This mapping to reference genomes also indicated two of the sequence scaffolds ( 5 and 8 ) didn&#8217;t align (see this figure) and therefore may not be part of the [...]]]></description>
			<content:encoded><![CDATA[<p>Last week I determined the likely order of our <em>Pseudomonas fluorescens R124</em> sequencing scaffolds by <a href="http://www.michaelbarton.me.uk/2010/03/estimating-genome-scaffold-order-using-reference-genomes/">mapping them on to reference genomes from the same species</a>. This mapping to reference genomes also indicated two of the sequence scaffolds ( 5 and 8 ) didn&#8217;t align (<a href="http://www.michaelbarton.me.uk/wp-content/uploads/2010/03/scaffold_alignment1.png">see this figure</a>) and therefore may not be part of the genome assembly. The next logical step therefore was to find out what type of sequence these scaffolds represented.</p>
<p>A megablast search showed scaffold 5 did align to reference <em>P fluorescens</em> genomes which was surprising since, as I wrote above, scaffold 5 did not appear to part of the assembly. After a closer look however scaffold 5 is only ~5Kb in size while the scaffold map I produced was on a megabase scale. Therefore scaffold 5 was just too small to be seen by eye when compared to the other much large scaffolds.</p>
<p>The blast search using scaffold 8 returned a more interesting result. The best hit was a <a href="http://www.ncbi.nlm.nih.gov/nuccore/71558859">plasmid in <em>Pseudomonas syringae</em> pv. phaseolicola</a>. The alignment between scaffold 8 and the plasmid is shown below (<a href="http://www.michaelbarton.me.uk/wp-content/uploads/2010/03/alignment.png">click for the larger version</a>) where the plasmid open reading frames are shown in red and the aligned scaffold 8 regions are shown in blue.</p>
<p><a href="http://www.michaelbarton.me.uk/wp-content/uploads/2010/03/alignment.png"><img class="aligncenter size-thumbnail wp-image-173" title="R124 Scaffold alignment to P. syringae plasmid" src="http://www.michaelbarton.me.uk/wp-content/uploads/2010/03/alignment-150x150.png" alt="" height="150" /></a></p>
<p>This result indicates the likely reason that scaffold 8 does not align to any of the reference genomes is because it is plasmid in origin rather than genomic. A further blastx search with this scaffold identified four regions with sequence similarity to known proteins which are as follows: <a href="http://www.ncbi.nlm.nih.gov/protein/71725274">conjugal transfer proteins</a> involved in the tranfer of genetic material, <a href="http://www.ncbi.nlm.nih.gov/protein/49188560">topoisomerases</a> involved in unwinding DNA, and <a href="http://www.ncbi.nlm.nih.gov/protein/71725294">relaxases</a> and <a href="http://www.ncbi.nlm.nih.gov/protein/71277125">replicases</a> which are likely to be involved in plasmid replication. There was a fifth type of protein may be be related <a href="http://www.ncbi.nlm.nih.gov/protein/38257080">to Type IV (DNA or protein) secretion</a> however the functional annotation of these was less clear. The blastx image result is shown below.</p>
<p><a href="http://www.michaelbarton.me.uk/wp-content/uploads/2010/03/blast.png"><img class="aligncenter size-medium wp-image-172" title="R124 Scaffold 8 Blast result" src="http://www.michaelbarton.me.uk/wp-content/uploads/2010/03/blast-300x265.png" alt="" width="300" /></a></p>
<p>I&#8217;m still learning microbial genomics and I suspect it&#8217;s unsurprising to discover a plasmid containing sequence similarity to genes involved in replication and transfer. What does spark my interested is that the above blast image shows the rest of the plasmid does not appear in first 100 results returned by blast. This might indicate there is relatively novel data with low sequence similarity known genes waiting to be analysed.</p>
<p><strong>UPDATE: <a id="dsq-author-user-39982856" rel="nofollow" href="http://twitter.com/BetaScience" target="_blank"><span style="font-weight: normal;">Morgan Langille</span></a><span style="font-weight: normal;"> has rightly pointed out in a comment below that scaffold 8 could have low sequence similarity and still be part of the R124 genome if it&#8217;s an inserted genomic island.</span></strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.michaelbarton.me.uk/research/2010/03/discovering-a-plasmid-in-our-sequence-data/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Estimating genome scaffold order using reference genomes</title>
		<link>http://www.michaelbarton.me.uk/research/2010/03/estimating-genome-scaffold-order-using-reference-genomes/</link>
		<comments>http://www.michaelbarton.me.uk/research/2010/03/estimating-genome-scaffold-order-using-reference-genomes/#comments</comments>
		<pubDate>Mon, 08 Mar 2010 10:00:17 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[results]]></category>
		<category><![CDATA[mummer]]></category>
		<category><![CDATA[nucmer]]></category>
		<category><![CDATA[R124]]></category>

		<guid isPermaLink="false">http://www.michaelbarton.me.uk/?p=163</guid>
		<description><![CDATA[The P. fluorescens genome sequencing results arrived last week and so far I&#8217;ve been looking at how we can begin to assemble the scaffolds from the smaller of the two genomes R124 into a complete draft. The are small holes in the scaffolds which we will PCR across but the harder task is cross the [...]]]></description>
			<content:encoded><![CDATA[<p>The <em>P. fluorescens</em> genome <a href="http://www.michaelbarton.me.uk/2010/03/pseudomonas-fluorescens-sequencing-results/">sequencing results arrived last week</a> and so far I&#8217;ve been looking at how we can begin to assemble the scaffolds from the smaller of the two genomes <a href="http://github.com/michaelbarton/Pseudomonas-fluorescens-R124-genome">R124</a> into a complete draft. The are small holes in the scaffolds which we will PCR across but the harder task is cross the gaps between scaffolds which could be ten or hundreds of thousands of kilobases long. </p>
<p>There are genomes available for other strains of the <em>P. fluorescens</em> species so these can therefore be used as a template to determine the order of our R124 scaffolds. I initially tried blasting the scaffolds against a reference genome and and plotting the density of blast hits. However when I <a href="http://friendfeed.com/michaelbarton/75219e91/r-density-plot-to-map-genome-scaffold-blastn">posted these results on FriendFeed</a> Max pointed out this plot was difficult to interpret and Rob Syme suggested using mummer.</p>
<p><a href="http://mummer.sourceforge.net/">Mummer</a> is a software package for aligning genomes and so I used the nucmer part of the package to compare the R124 scaffolds against the reference genomes of three other <em>P. fluorescens</em> strains. The plot below visualises each nucmer alignment match. This figure indicates the possible order of the scaffolds and also suggests that scaffold 5 (last row) does not appear in any of the reference genomes.</p>
<p><a href="http://www.michaelbarton.me.uk/wp-content/uploads/2010/03/scaffold_alignment1.png"><img alt="Nucmer Scaffold Alignment of R124 Scaffolds" class="aligncenter size-full wp-image-162" height="480" src="http://www.michaelbarton.me.uk/wp-content/uploads/2010/03/scaffold_alignment1.png" title="Nucmer Scaffold Alignment of R124 Scaffolds" width="480" /></a></p>
<p>I also visualised the nucmer results as a dotplot between the R124 scaffolds and the reference strains. This plot indicates the likely orientation of the scaffolds and also suggests possible rearrangements in scaffold 3 (purple) and scaffold 7 (yellow) versus the reference strains &#8211; a result which I find rather interesting.</p>
<p><a href="http://www.michaelbarton.me.uk/wp-content/uploads/2010/03/pairwise_alignment1.png"><img alt="" class="aligncenter size-full wp-image-161" height="480" src="http://www.michaelbarton.me.uk/wp-content/uploads/2010/03/pairwise_alignment1.png" title="Nucmer alignment of R124 scaffolds with reference genomes" width="480" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.michaelbarton.me.uk/research/2010/03/estimating-genome-scaffold-order-using-reference-genomes/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Find repeating N regions in a fasta file with bioruby</title>
		<link>http://www.michaelbarton.me.uk/research/2010/03/find-repeating-n-regions-in-a-fasta-file-with-bioruby/</link>
		<comments>http://www.michaelbarton.me.uk/research/2010/03/find-repeating-n-regions-in-a-fasta-file-with-bioruby/#comments</comments>
		<pubDate>Wed, 03 Mar 2010 10:00:47 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[methods]]></category>
		<category><![CDATA[bioruby]]></category>
		<category><![CDATA[contigs]]></category>
		<category><![CDATA[gaps]]></category>
		<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://www.michaelbarton.me.uk/?p=151</guid>
		<description><![CDATA[Useful for designing primers to PCR across gaps in genome contigs]]></description>
			<content:encoded><![CDATA[<p>Useful for designing primers to PCR across gaps in genome contigs</p>
<p><script src="http://gist.github.com/320001.js?file=print_gap_regions.rb"></script></p>
]]></content:encoded>
			<wfw:commentRss>http://www.michaelbarton.me.uk/research/2010/03/find-repeating-n-regions-in-a-fasta-file-with-bioruby/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>First look at Pseudomonas fluorescens sequencing results</title>
		<link>http://www.michaelbarton.me.uk/research/2010/03/pseudomonas-fluorescens-sequencing-results/</link>
		<comments>http://www.michaelbarton.me.uk/research/2010/03/pseudomonas-fluorescens-sequencing-results/#comments</comments>
		<pubDate>Mon, 01 Mar 2010 10:00:07 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[results]]></category>
		<category><![CDATA[454]]></category>
		<category><![CDATA[assembly]]></category>
		<category><![CDATA[fluorescens]]></category>
		<category><![CDATA[genome size]]></category>
		<category><![CDATA[sequencing]]></category>

		<guid isPermaLink="false">http://www.michaelbarton.me.uk/?p=131</guid>
		<description><![CDATA[We have received the 454 results for our two samples thanks to the University of Kentucky AGCT sequencing centre. The genomes sequenced were P. fluorescens isolates, R124 and KY485, cultured from two separate caves sites. The relationships of these cave strains to other P.fluorescens strains is shown below in the phylogenetic tree constructed from a [...]]]></description>
			<content:encoded><![CDATA[<p>We have received the 454 results for our two samples thanks to the <a href="http://www.uky.edu/Centers/AGTC/">University of Kentucky AGCT sequencing centre</a>. The genomes sequenced were <em>P. fluorescens</em> isolates, R124 and KY485, cultured <a href="http://wiki.cavescience.com/Research_Projects/Sequencing_Pseudomonas_cave_species/P._fluorescens_for_sequencing">from two separate caves sites</a>. The relationships of these cave strains to other <em>P.fluorescens</em> strains is shown below in the phylogenetic tree <a href="http://wiki.cavescience.com/Research_Projects/Sequencing_Pseudomonas_cave_species/16S_Phylogeny_of_P._fluorescens_isolates">constructed from a 16S ribosomal gene alignment</a>. This tree highlights the relationship of the cave strains we are sequencing to those already sequenced or being sequenced.</p>
<p><img alt="Pseudomonas fluorescens 16S phylogenetic tree" class="aligncenter size-full wp-image-130" height="556" src="http://www.michaelbarton.me.uk/wp-content/uploads/2010/02/tree.png" title="Pseudomonas fluorescens 16S phylogenetic tree" width="500" /></p>
<p>The current genomic scaffolds are available on github for both <a href="http://github.com/michaelbarton/Pseudomonas-fluorescens-R124-genome">R124</a> and <a href="http://github.com/michaelbarton/Pseudomonas-fluorescens-KY485-genome">KY485</a> strains. I&#8217;m going to update the repositories as gaps are closed and the genomes annotated. So far the initial results of the sequencing show the genomes of both isolates are larger than we expected. The predicted genome size and coverage from the Roche GS De Novo Assembler (newbler) run for each strain is illustrated in the chart below (<a href="http://gist.github.com/317935">See here for the R code and data</a>). The figure includes the genome sizes of already sequenced <em>P. fluorescens</em> isolates as references.</p>
<p><img alt="Genomic coverage of Pseudomonas fluorescens sequencing" class="aligncenter size-full wp-image-134" src="http://www.michaelbarton.me.uk/wp-content/uploads/2010/02/genome_size.png" title="Genomic coverage of Pseudomonas fluorescens sequencing" width="520" /></p>
<h2 id="genome-size">Genome size</h2>
<p>The graph shows both <em>P. fluorescens</em> genomes appear larger than those of existing genomes. The R124 strain is predicted to be marginally larger by ~0.3 MBp than the largest already sequenced <em>P. fluorescens</em> genome while the KY485 strain is much larger by &gt;4 MBp. The sequence data however is relatively fresh and therefore we expect the estimated genome size will change as we try to generate a complete build. Furthermore I believe there is the possibility the current data contains sequences from plasmids which would inflate the size estimates.</p>
<h2 id="sequencing-coverage">Sequencing coverage</h2>
<p>The unexpected large size of each genome resulted in less coverage than we hoped. The total genomic coverage in scaffolds is highlighted by the darker grey bars in the barchart above. The R124 assembly has a reasonable ~85% of the predicted genome at 22X coverage. However we have only ~44% of the KY485 genome at 17X coverage &#8211; less than half the genome. This therefore indicates a large portion of the KY485 genome is still unknown.</p>
<h2 id="next-step">Next step</h2>
<p>Over the next weeks we will be trying to bridge gaps in the smaller of the two genomes using PCR and traditional sequencing. I&#8217;ll also be trying to estimate size of the gaps in each genome assembly using other <em>P. fluorescens</em> genomes as a reference. I&#8217;ll also try to determine if any differences genomic GC content suggest the presence of plasmids in the sequencing data.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.michaelbarton.me.uk/research/2010/03/pseudomonas-fluorescens-sequencing-results/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Querying pubmed on the command line</title>
		<link>http://www.michaelbarton.me.uk/research/2010/02/querying-pubmed-on-the-command-line/</link>
		<comments>http://www.michaelbarton.me.uk/research/2010/02/querying-pubmed-on-the-command-line/#comments</comments>
		<pubDate>Mon, 22 Feb 2010 10:00:14 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[methods]]></category>
		<category><![CDATA[bioruby]]></category>
		<category><![CDATA[boson]]></category>
		<category><![CDATA[pubmed]]></category>

		<guid isPermaLink="false">http://www.michaelbarton.me.uk/?p=122</guid>
		<description><![CDATA[I&#8217;ve had to read a lot of papers since I&#8217;ve started my post-doc. Something that particularly bothers me is that it takes rather a lot of effort to get the pubmed record for interesting references in the paper I&#8217;m reading. I&#8217;m annoyed by the effort it takes to refine pubmed queries to find the specific [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve had to read a lot of papers since I&#8217;ve started my post-doc. Something that particularly bothers me is that it takes rather a lot of effort to get the pubmed record for interesting references in the paper I&#8217;m reading. I&#8217;m annoyed by the effort it takes to refine pubmed queries to find the specific paper I want.</p>
<p>I&#8217;ve built a small command for <a href="http://github.com/cldwalker/boson">boson</a> that combines with <a href="http://bioruby.open-bio.org/">bioruby</a> that allows querying pubmed at the command line and also makes refining by year, author, or journal relatively simple. You can see <a href="http://github.com/michaelbarton/pubmed-boson">the code on github</a> or watch the short video below to see how it works.</p>
<p><object width="480" height="295"><param name="movie" value="http://www.youtube.com/v/06BFmkAIr7s&#038;hl=en_US&#038;fs=1&#038;rel=0&#038;hd=1"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/06BFmkAIr7s&#038;hl=en_US&#038;fs=1&#038;rel=0&#038;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="295"></embed></object></p>
]]></content:encoded>
			<wfw:commentRss>http://www.michaelbarton.me.uk/research/2010/02/querying-pubmed-on-the-command-line/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic page generated in 4.405 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2010-09-19 14:12:19 -->

