Thursday, 7 April 2011

Writing code for a big scientific collaboration

One of the striking thing about scientific software is the range of different contexts in which it’s needed.  Scientists need quick-and-dirty scripts to process their data and plot their results; they need prototypes so that they can experiment with new statistical techniques; and they sometimes need to build new software tools that they’ll use again and again in their research.  While a lot of this will be for their own personal research, sometimes the Scientist-Programmer finds themselves developing software as part of a large scientific collaboration.  This has some particular requirements.

Who are you writing for?
The key difference when developing software for a large collaboration is that you are now writing for other people.  Others will be using the software you make, so it has to be more user-friendly than if you were the only user.  Remember, they will most likely have spent a lot less time thinking about the details of the task than you have, so they won’t have your level of familiatiry and expertise.

You may also be writing code that other programmers will need to work with.  So you need to do a good job!  We think you should hold your code to high standard in any event, but it’s particularly important if other people are going to need to work with it.  So try to think about craft of coding.  We can suggest some principles of good coding (here and here), and we certainly think you should be using literate programming.  And if you end up using other people’s code as well, you might need some help in surviving legacy code (which also tells you why other people will appreciate you writing high quality code).

What are you making?
There are a number of things you might be making for the collaboration.  These include the following.

    * data processing module (for a reducion pipeline)
    * tools for exploration and visualisation of data/results
    * statistical tools (automated and/or interactive)
    * simulation software (for example, to produce synthetic data or simulated results)
    * software that’s required in order to run the experiment in the first place
    * databases and interfaces to databases

Be professional!
We hope you do this anyway (!), but it’s extra-important to work and behave in a professional way when you’re developing software for a collaboration.  By holding yourself to good professional standards, as well as acting with professionalism in your dealing with your collaborators, you’ll not only produce good code, you’ll get people using it and giving you feedback.  Remember that your users will be very happy if you’re giving them good software tools and providing prompt and courteous technical support and bug fixes.  This is not only a great way to get a good reputation (which is very important in science, as in many other fields), but all your efforts in this regard help people in the project do more good science, more quickly.  So your simple professionalism will contribute directly to the success of the project.

The science bit…
If you’re writing code for a big scientific collaboration, you’re probably also interested in the science itself (and probably the specific science outcomes of the project).  This is a good thing, because it means you’ll have a greater understanding of what’s required from your code.  For data processing pipelines, you’ll have a good understanding of the charactertistics of the data (you may even have worked on the hardware taking the measurements in the first place).  If your code produces analysis results, you’ll have an understanding of what sorts of results are sensible/stupid and how best to present the results.  And if you’re implementing statistical models, you’ll know what kinds of model are sensible and what prior knowledge it’s reasonable to assume.

In conclusion
While it can take a lot of effort to produce good, robust code for a big collaboration, it’s also a great opportunity to be at the heart of the project.  It’s usually the case that the data are vital to the project, so that building the tools that are used to process/explore the data puts you in a great, central position in the project.  And it means that the good work that you do can have a direct impact on how successful the science i

No comments:

Post a Comment