December 1, 2007

Cleaning my dirty laundry in public: errors in my data

Recently, I wrote a post on Bioinformatics Zen about the importance of testing your code and data to make confirm you are producing what you think are. This is something I had thought about for a while, so I decided to test some of the work I have produced so far.

As part of my analysis of the effects of atomic and energetic cost in gene expression, I collected several data points for each gene: atomic composition, cost requirement for synthesis, codon adaptation index, and protein and transcript expression. These are all stored as tables in my database, and for the statistical analyses, are joined together using an SQL statement. Running the query, it appeared I was producing the data in the format I wanted, but to make sure I decided to test this.

I picked two proteins at random and using the sequence, manually calculated all the characteristics that I had calculated computationally. A short Ruby script compared the two sets of values with each other. I was hoping that everything was as I expected, but this was not the case.

Carbon content of amino acids

I had incorrectly counted the number of carbon atoms in the side chains for glutamine, and lysine. In particular, I overestimated the amount in lysine by 4 carbon atoms. These two mistakes meant that all the protein cost estimates based on amino acid side carbon content were wrong.

Stupidity: High – all I had to do was count the numbers of letter ‘C’ in each amino acid side chain.
Impact: Low – all proteins would have been affected by similar amounts, but still would have altered carbon content-expression correlations.

Incorrect protein sequence length

In my analysis I average out each cost type over protein length to give a per residue cost. My testing, however, showed that the calculations were wrong. After a little digging around I found the error lay in translating the DNA sequence into protein, translated stop codons append a ‘*’ to the protein sequence. Therefore all the lengths of the proteins were one residue too long, and therefore any calculated mean average was wrong.

Stupidity: Low – I never ever would have thought about this unless testing had spotted it
Impact: Low – Greater impact for small proteins, but most proteins are generally quite long.

Incorrect summing of costs over each protein

For each protein in my database I counted the number of each residue in the sequence. To calculate the total cost of a protein, I multiplied the frequency of each residue by the residue’s predicted cost, summed over the protein. Or at least that’s what I thought I was doing. An error in the SQL query meant that the sum was not being performed correctly, and instead the total cost of each protein was incorrectly calculated as being anywhere in the region of 2X – 10X the correct amount.

Stupidity: Low – The SQL GROUP BY clause is tricky and easy to make mistakes with
Impact: High – Each protein’s cost was incorrectly calculated by varying amounts

All these errors are fixed, and I feel a lot more confident in the conclusions I’m making. As well as fixing some mistakes, I hope this also serves to highlight the importance of testing in bioinformatics research.

  • I think it is great that you are doing this. Reporting and discussing errors openly and as soon as possible makes science a lot more efficient.
blog comments powered by Disqus