May 31, 2008

Motivations for getting involved in open research

Project involvement

In a recent post Cameron wrote about an idea for open research project based on Solexa sequencing. I wanted to add some of my own thoughts on motivations for getting involved in open research projects. I had originally written this as a comment for Cameron’s post, but it grew so large that I thought it better to post it as a trackback instead.

The reason why we haven’t had great successes on this thus far is fundamentally down to the size of the network we have in place and the bias in the expertise of that network towards specific areas.

I think another reason there hasn’t been much success is because there is little motivation to get involved in a large collaborative research project. Using open source software as a comparison, such as Apache. I think the reason Apache is successful is because a there is a financial incentive for companies to get involved. IBM will pay it engineers to get maintain Apache because it runs their websites and therefore it makes financial sense. However comparing this with science, I might be interested in research project X, and think it’s very cool, but what am I going to get out of it by getting involved? Unless I will be first or last author, I would be better suited to working on getting my own work published. This might be considered selfish, but in future when I apply for a job, I will be judged on my record of first/last author papers. Being on a paper in the middle of many other authors isn’t as useful, though arguably these large multi author paper are where the greatest research is done. This is my opinion at an early stage in my career, but I’d be interested in contrasting opinions on this.

Putting these two together one obvious solution is to find a problem that is well suited to the people who are around, may be of interest to them, and is also quite useful to solve.

I think one aspect of a successful open research project is that there is a core team committed to driving the work, but at the same time it is easy for other people to contribute. Using another open source software project, Ubuntu has a large community base, but what drives the project forward is a core team of Canonical developers that are always pushing towards the next goal. I believe the similar principle will be what work for ONS. You are a researcher working on your problem, you will be the one moving towards writing a manuscript. At the same time you use web 2.0 tools so people can make contributions with comments and ideas, which will be beneficial for them to get early access to your data, and for you to get feedback. Using FlyWiki as an example, the core researchers had a commitment to getting the work published, but at the same time they used the wiki to make of the data available in the interim period. Other researchers could then get early access to the data and feed into their own work

Applying this to Solexa sequencing, have a core team of engineers with an investment in getting the project working, but at the same time try to open everything up to the community as much as possible. “Here is where we are going. Here’s what we have, do you have any ideas, can you contribute?”. Twitter streams, friendfeed, lighthouse, github, Google groups etc. Not everything will work, but the lower the barrier to contribute, the more people will.

April 29, 2008

Reflection on a year of (attempted) open notebook science

A year of work on the importance of amino acid biosynthetic cost has led to the submission of a manuscript, and a preprint available on Nature Preceedings. The openness in this project was inspired by reading Jean Claude Bradley’s and Cameron Neylon’s blogs about open notebook science. I already believed in the philosophy behind open source software, and I thought that any early feedback would be useful to my research. In addition to any input received, I thought that early sharing of my research would in turn be useful to contribute back to the community.

The platform I chose was a blog, allowing results to posted as I produce them. I was already familiar with blogging, and Wordpress makes creating and maintaining a blog simple. During the early stages of my project I found it quite useful to blog, as it helped me to clarify my results and ideas while the project was still taking shape. I tried to do this about once a week, on a Friday, and summarise my latest results. Having this record of results was also helpful to refer to when discussing my latest findings. When we were writing the manuscript I also found it useful to browse back through all the entries I had created and include any ideas I had forgotten about. However, as the project progressed blogging became less important, as I had already produced my main findings and was more focused on writing the manuscript.

As for sharing information I found that writing a summary blog my research takes rather a large amount of effort. Furthermore my  blog is the only gateway to my research, and results only become available when I make the time and effort write them up. This therefore doesn’t satisfy Jean Claude Bradley’s criteria of no insider knowledge, but rather could be described as being selectively open about my research. On the positive side a blog post is a concise summary that distills my most recent progress in a way I hope is easily accessible to a casual reader. Another interesting point is that posting all my results online meant they were indexed by Google, as you would expect, but this also lead to some strange occurrences when searching online for material. For example searching for “Akashi & Gojobori”, a paper I based my work on, brings up two links to my blog ahead of the original manuscript. I find this a bit embarrassing, and I wonder if the paper authors have also encountered this?

With less time to spend on blogging, I also tried to stream my research using Twitter, sending short messages automatically using a bash script every time I committed an SVN update. While this approach takes a lot less effort on my part, I think this is the opposite end of the spectrum to blogging, and spews out large amounts of obscure repository check in messages. Ultimately I think it is of little interest for even someone directly involved in the project.

I’m still interested in open notebook science, though my lack of posting might indicate otherwise. I’m going to continue trying out new methods of sharing bioinformatics research, and the start of a new research project gives me the chance to start afresh in these approaches. My main focus should be passive approaches that build into my work flow without too much effort, but also produce a meaningful summary of the research. Therefore in addition to a blog I think it is important to maintain a summary page of the research, otherwise it may be difficult for people to understand what the point of my research is when they first come across my blog. I think this is similar to the combined wiki and blog format used by Jean Claude Bradley. Having spent some time thinking about I how could implement this, I think a landing page should be readily auto generated from the results. In my head I’m thinking a Ruby on Rails type of approach, with a templating library such as HAML and a series of Rake tasks to regenerate the landing page with any new results, as well as send out a twitter update.

Finally I thought it might be interesting to adopt version numbers for the project, similar to those used in software development. The usual layout is something like 1.2.3. The last number would be used to track simple code edits. The second number would be used to show milestones in the overall project, for example each could correspond to a figure. The first number would then be the manuscript revision. Every time a new manuscript is prepared for submission, this could be updated, where the first manuscript preparation would have the number 1.0.0 Hopefully this type of numbering would make the project easier to track and interested parties could see if the research has been updated significantly since they last checked.

In summary, open notebook science has not really had a large positive effect on my research. I think that this is mainly because using a blog alone is not an effective method of communicating scientific progress, because it requires substantial effort on my part to update, and second tracking the current state of the research can be difficult. However, I still believe that the principles of open notebook science can be beneficial to my research. In the next couple of months I’ll try some new methods to see what does work.

February 1, 2008

A short essay on Open Notebook Science

As you might expect from the name, Open Notebook Science (ONS) has similarities with Open Source Software. The clearest likeness between the two, is the belief that by sharing and collaborating, more can be achieved than through secrecy and competition. An open approach to software development is proven to be successful: the greatest achievement is the development, and increasing adoption of the Linux operating system. On this foundation other applications like the Apache web server, MySQL database, and the PHP scripting language have been built, and the combination of the four is the engine running many websites, including this one. If ONS can enjoy a fraction of the success open software does, then science can only benefit.

ONS didn’t occur spontaneously, but is a step in the liberalisation of science by the freedom that the Web allows. An early example is the arXiv.org server started in 1991 as a repository for the physics community to share manuscripts prior to publication, 17 years later it now contains ~450,000 articles. Another, often overlooked, example of openness are the free biological databases such as EMBL and GenBank which allow unrestricted access to the genomes for all sequenced organisms. More recently, many journals are adopting open science policies, whereby all research is freely available upon publication, where previously the reader had to pay a fee. Now, increasingly research funding bodies are also stipulating, as a condition, that any articles resulting from the research are freely available at least 6 months after publication, examples being NIH and BBSRC.

When you work in science, many ideas come from reading papers, attending talks, and speaking to colleagues in the pub. So I think it’s fair to say that we will profit from further sharing on websites, such blogs and wikis, and the more everybody is open, the more the community benefits as a whole. Of course, being open creates questions on how scientists can still be recognised for their work, as well as how research can be commercialised. Most importantly, peer-review is still the best arbiter of research quality, and raw results must be viewed with this in mind.

One of the earliest adopters of complete openness is Jean-Claude Bradley, where his own and students’ laboratory notebooks are stored on a wiki, and freely available for anyone to read – updated as results are being produced. Jean-Claude also first discussed the term “Open Notebook” in relation to this, when he defined it as the researcher’s notebook being open to the world, that there is no insider knowledge. From Jean-Claude’s example, a small but growing number of researchers have followed: using blogs, wikis, and project management systems to make their research available. Examples of people using blogs to share research are Cameron Neylon and Rosie Redfield whose research groups use blogs either as the primary lab book or as a forum for describing and discussing results. In addition to Jean Claude Bradley, other projects using wikis for ONS are 1CellPK and Maldi. Pedro Beltrao and Jeremiah Faith use software management systems, where many tools useful for tracking software development, are applicable to bioinformatics research.

There are questions that this kind of openness generates. For instance, what do the journals think about publishing research that has already appeared on a blog? For most journals informally posting your research online is considered in the same light as giving a talk at a conference. A few exceptions exist though, such as Cell and Lancet, but on the whole publishers like NPG, BMC and PLoS are happy with kind of sharing, though it is always worth checking. Another question worth asking is what is your University’s policy towards intellectual property: does it belong to the researcher or the institution? Which leads into another point, in the last few years researchers have been increasingly expected to consider how their work can be commercialised, but any work disclosed on a blog or wiki cannot be patented, which should be borne in mind when you post new ideas or methods. Finally, there is common sense – how do your collaborators feel about early sharing of research? Or could the work you’re posting online be considered politically sensitive – involving animals or embryonic stem cells?

If after reading this, and looking at ONS researcher websites, you think that ONS can be useful to you, where do you begin? In my experience a blog is safe and easy place to start. You can discuss other people’s research, and if you feel confident you could begin to mention the results you’ve been producing. Then depending on how you feel, move towards making your notebook entirely open, using a wiki. Services like wordpress.com and blogger.com, offer easy blog creation for free, while wikispaces.com can be used to create a wiki. At the moment there is no single standard application for ONS, so a good idea is to experiment and see what suits you and your research.

So what is the future for Open Notebook Science? At present, proposals have been created for an ONS network, and a session at PSB. There is a small, but increasing number of scientists who are adopting open practices into their research, while a further few follow the mantra of “no insider information” and are completely open. Returning to my point at the start of this article, the creator of Linux said, when talking about open source software, “Many eyes make all problems shallow” and if Open Notebook Science can benefit from similar principles, this will be to the advantage of the individual as well as the community as a whole.

January 18, 2008

Open Notebook Science software, reviews, and meetings

I am still revising the draft of my first manuscript, but at the same time planning my next research step as the finishing touches are applied to the current one.

Cost as an evolutionary pressure

My recent research looked at whether the expression of a gene negatively correlates with the metabolic cost of the encoded protein. The next project I’m considering is to explore if cost is also important in amino acid substitution rates. The hypothesis is that the use of expensive amino acids is selected against unless there is an overriding functional pressure for the amino acid at a particular site. I assume that a functional requirement at a specific position will outweigh any benefits that could be gained from using a cheaper amino acid. Therefore expensive amino acids should be more conserved across across orthologs, and that site specific conservation decreases with cost.

I was impressed with how Pedro used Google Code as a platform for Open Notebook Science. What interested me in particular about Google Code is the issue tracking system. I think this feature allows people to contribute ideas and suggestions more easily than a blog does. Anyone can raise an issue with the research, which can then be tracked and dealt with, very much in the same way as in software development. Google code also has the benefit of an SVN repository, which will make it easier to share code and data with other researchers.

Until now, I’ve been sharing my research via this blog and using Google Docs, but I think that using Google Code for this project will extend the opportunities to be open, and foster collaboration.

Bio::Blogs

Pedro has suggested that the next issue of Bio::Blogs focus on Open Science. I’m currently planning an essay to act as a round up of the current state of Open Notebook Science for this edition. The aim is to give a brief overview of what Open Notebook Science is, how it can be carried out, and some of the associated issues. Open Notebook Science already has a strong community, so any ideas or suggestions are very welcome. The outline I have so far is on Google Docs.

Open Notebook Science Meeting

Shirley Wu recently suggested holding an Open Notebook Science session at the Pacific Symposium on Biocomputing. A subject replied to by Cameron Neylon, and where ONS meetings had been proposed in the ONS Network grant application. I think this is a great idea, ONS is moving forward through discussions across blogs, which is relatively slow. A meeting will mean that people can meet face-to-face to discuss ONS issues and look at the direction ONS is taking, where ideas can be collaboratively reviewed and acted on, much faster than over the web.

However, I am worried about the financial barriers to attending a conference like this. For example as a PhD student I will find it difficult to pay for travel to a meeting in Hawaii, as much I would love to go there. The converse also applies to someone outside Europe attending a conference in the UK. In the spirit of ONS, where anyone who wants to take is part is encouraged, the same should apply to a meeting, where anyone who wants to attend can.

Of course in principle this sounds great, but how can this be done? Each continent could have a specific meeting for their region, which reduces travel costs. However this option suffers from splitting the community, and multiplying the cost of organising a single conference into possibly five different ones.

Another option is industrial partnership, where sponsor money can be used to fund travel and organisation costs. An Open Notebook Science meeting is attractive to companies such as Google and Nature whose business interests lie in predicting where the intersection of science and the world wide web is heading. I also think that the inclusion of people from journals and business at a meeting will be beneficial to the development of Open Notebook Science, as collaboration across industry, publishers and academia will increase participation in Open Notebook Science across the scientific community. A more inclusive meeting has the potential interest third parties in the promise of a more dynamic and collaborative research model, very much in the same way that open source software, such as Apache and Linux, benefits and has benefitted from the software industry.