Free E Books: Proggraming Articles

Showing posts with label Proggraming Articles. Show all posts

Thursday, 7 April 2011

Writing code for a big scientific collaboration

One of the striking thing about scientific software is the range of different contexts in which it’s needed. Scientists need quick-and-dirty scripts to process their data and plot their results; they need prototypes so that they can experiment with new statistical techniques; and they sometimes need to build new software tools that they’ll use again and again in their research. While a lot of this will be for their own personal research, sometimes the Scientist-Programmer finds themselves developing software as part of a large scientific collaboration. This has some particular requirements.

Who are you writing for?
The key difference when developing software for a large collaboration is that you are now writing for other people. Others will be using the software you make, so it has to be more user-friendly than if you were the only user. Remember, they will most likely have spent a lot less time thinking about the details of the task than you have, so they won’t have your level of familiatiry and expertise.

You may also be writing code that other programmers will need to work with. So you need to do a good job! We think you should hold your code to high standard in any event, but it’s particularly important if other people are going to need to work with it. So try to think about craft of coding. We can suggest some principles of good coding (here and here), and we certainly think you should be using literate programming. And if you end up using other people’s code as well, you might need some help in surviving legacy code (which also tells you why other people will appreciate you writing high quality code).

What are you making?
There are a number of things you might be making for the collaboration. These include the following.

    * data processing module (for a reducion pipeline)
    * tools for exploration and visualisation of data/results
    * statistical tools (automated and/or interactive)
    * simulation software (for example, to produce synthetic data or simulated results)
    * software that’s required in order to run the experiment in the first place
    * databases and interfaces to databases

Be professional!
We hope you do this anyway (!), but it’s extra-important to work and behave in a professional way when you’re developing software for a collaboration. By holding yourself to good professional standards, as well as acting with professionalism in your dealing with your collaborators, you’ll not only produce good code, you’ll get people using it and giving you feedback. Remember that your users will be very happy if you’re giving them good software tools and providing prompt and courteous technical support and bug fixes. This is not only a great way to get a good reputation (which is very important in science, as in many other fields), but all your efforts in this regard help people in the project do more good science, more quickly. So your simple professionalism will contribute directly to the success of the project.

The science bit…
If you’re writing code for a big scientific collaboration, you’re probably also interested in the science itself (and probably the specific science outcomes of the project). This is a good thing, because it means you’ll have a greater understanding of what’s required from your code. For data processing pipelines, you’ll have a good understanding of the charactertistics of the data (you may even have worked on the hardware taking the measurements in the first place). If your code produces analysis results, you’ll have an understanding of what sorts of results are sensible/stupid and how best to present the results. And if you’re implementing statistical models, you’ll know what kinds of model are sensible and what prior knowledge it’s reasonable to assume.

In conclusion
While it can take a lot of effort to produce good, robust code for a big collaboration, it’s also a great opportunity to be at the heart of the project. It’s usually the case that the data are vital to the project, so that building the tools that are used to process/explore the data puts you in a great, central position in the project. And it means that the good work that you do can have a direct impact on how successful the science i

Generics 101, Part 3: Exploring Generics Through a Generic Stack Type

Java 2 Standard Edition 5.0 introduced generics to Java developers. Since their inclusion in the Java language, generics have proven to be controversial. In the last of his three-part series, Jeff Friesen introduces you to the need for generic methods, focused on how generics are implemented to explain why you couldn’t assign new E[size] to elements.

Generics are language features that many developers have difficulty grasping. Removing this difficulty is the focus of this three-part series on generics.

Part 1 introduced generics by explaining what they are with an emphasis on generic types and parameterized types. It also explained the rationale for bringing generics to Java.

Part 2 dug deeper into generics by showing you how to codify a generic Stack type, and by exploring unbounded and bounded type parameters, type parameter scope, and wildcard arguments in the context of Stack.

This article continues from where Part 2 left off by focusing on generic methods as it explores several versions of a copy() method for copying one collection to another.

Also, this article digs into the topic of arrays and generics, which explains why you could not assign new E[size] to elements in Listing 1’s Stack type – see Part 2.

Finally, to reinforce your understanding of the material presented in all three parts of this series, this article closes with an exercises section of questions to answer.
Generic Copy Method

Suppose you want to create a method for copying one collection (perhaps a set or a list) to another collection. Your first impulse might be to create a void copy(Collection<Object> src, Collection<Object> dest) method. However, such a method's usefulness would be limited because it could only copy collections whose element types are Object[md]collections of Strings couldn't be copied, for example.

If you want to pass source and destination collections whose elements are of arbitrary type (but their element types agree), you need to specify the wildcard character as a placeholder for that type. For example, the following code fragment reveals a copy() method that accepts collections of arbitrary-typed objects as its arguments:

public static void copy(Collection<?> src, Collection<?> dest)
{
   Iterator<?> iter = src.iterator();
   while (iter.hasNext())
      dest.add(iter.next());
}

Although this method's parameter list is now correct, there is a problem, and the compiler outputs an add(capture#469 of ?) in java.util.Collection<capture#469 of ?> cannot be applied to (java.lang.Object) error message when it encounters dest.add(iter.next());.

This error message appears to be incomprehensible, but basically means that the dest.add(iter.next()); method call violates type safety. Because ? implies that any type of object can serve as a collection's element type, it's possible that the destination collection's element type is incompatible with the source collection's element type.

For example, suppose you create a List of String as the source collection and a Set of Integer as the destination collection. Attempting to add the source collection’s String elements to the destination collection, which expects Integers violates type safety. If this copy operation was allowed, a ClassCastException would be thrown when trying to obtain the destination collection's elements.

You could avoid this problem by specifying void copy(Collection<String> src, Collection<String> dest), but this method header limits you to copying only collections of String. Alternatively, you might restrict the wildcard argument, which is demonstrated in the following code fragment:

public static void copy(Collection<? extends String> src,
                        Collection<? super String> dest)
{
   Iterator<? extends String> iter = src.iterator();
   while (iter.hasNext())
      dest.add(iter.next());
}

This code fragment demonstrates a feature of the wildcard argument: You can supply an upper bound or (unlike with a type parameter) a lower bound to limit the types that can be passed as actual type arguments to the generic type. Specify an upper bound via extends followed by the upper bound type after the ?, and a lower bound via super followed by the lower bound type after the ?.

You interpret ? extends String to mean that any actual type argument that is String or a subclass can be passed, and you interpret ? super String to imply that any actual type argument that is String or a superclass can be passed. Because String cannot be subclassed, this means that you can only pass source collections of String and destination collections of String or Object.

We still haven't solved the problem of copying collections of arbitrary element types to other collections (with the same element type). However, there is a solution: Use a generic method (a static or non-static method with a type-generalized implementation). Generic methods are syntactically expressed as follows:

<formal_type_parameter_list> return_type identifier(parameter_list)

The formal_type_parameter_list is the same as when specifying a generic type: it consists of type parameters with optional bounds. A type parameter can appear as the method's return_type, and type parameters can appear in the parameter_list. The compiler infers the actual type arguments from the context in which the method is invoked.

You'll discover many examples of generic methods in the collections framework. For example, its Collections class provides a public static <T> T max(Collection<? extends T> coll, Comparator<? super T> comp) method for returning the maximum element in the given Collection according to the ordering specified by the supplied Comparator.

We can easily convert copy() into a generic method by prefixing the return type with <T> and replacing each wildcard with T. The resulting method header is <T> void copy(Collection<T> src, Collection<T> dest), and Listing 1 presents its source code as part of an application that copies a List of String to a Set of String.
Listing 1—Copy.java

// Copy.java
import java.util.ArrayList;
import java.util.Collection;
import java.util.Iterator;
import java.util.List;
import java.util.Set;
import java.util.TreeSet;
public class Copy
{
   public static void main(String[] args)
   {
      List<String> planetsList = new ArrayList<String>();
      planetsList.add("Mercury");
      planetsList.add("Venus");
      planetsList.add("Earth");
      planetsList.add("Mars");
      planetsList.add("Jupiter");
      planetsList.add("Saturn");
      planetsList.add("Uranus");
      planetsList.add("Neptune");
      Set<String> planetsSet = new TreeSet<String>();
      copy (planetsList, planetsSet);
      Iterator<String> iter = planetsSet.iterator();
      while (iter.hasNext())
         System.out.println(iter.next());
   }
   public static <T> void copy(Collection<T> src, Collection<T> dest)
   {
      Iterator<T> iter = src.iterator();
      while (iter.hasNext())
         dest.add(iter.next());
   }
}

Within the copy() method, notice that type parameter T appears in the context of Iterator<T>, because src.iterator() returns elements of the type passed to T, which happens to match the type of src's elements. Otherwise, the method remains unchanged from its previous incarnations.

Listing 1 generates the following output:

Earth
Jupiter
Mars
Mercury
Neptune
Saturn
Uranus
Venus

“Should I switch to Python?”

Rich has recently been considering switching to the Python programming language. Currently, Matlab is the language of choice in his department for rapid development and prototyping of code. It’s very good at this, but Mathworks (the company who produces Matlab) have been tinkering with the licencing terms, leading to hassles where none should exist. This is very frustrating and leads to the thought that it might be nice to use a free language where this will no longer be an issue.

But of course things are not quite that straightforward. Matlab is used for good reason – it’s very good at what it does. So is it worth the effort to stop using Matlab and instead learn to use Python? In this article we discuss some of the things that’ll need to be considered.

Why Python?
The first question is why out of all the programming languages that exist should we be considering Python? The bulk of the reasoning is actually contained in the specifics of the sections below, but the starting point is that Python has a good reputation for being nice to work with, it’s already used in some areas of science (suggesting it might be a sensible language to consider), and it has a wider community of users (including some big ones such as Google), so there should be good community support. So, this looks superficially promising. What about the specifics?

It’s free…
First up, Python is free. So no licence problems and no need to find the money to pay for it. This does mean that there isn’t a company whose raison d’etre is to build new functionality for Python, but there is an active community helping to develop it, so that’s probably not too much of a problem.

What do I need it for?
This is a key question when deciding whether to learn a new language. If you’re anything like us, you’re attracted to languages because you can do cool things with them, but you should be careful that they are the right cool things for your needs. In this case, Rich needs a language for building prototype implementations of statistical modelling tools. So, it needs to be fast to code in, object orientation would be desirable and lots of scientific library support is vital. Flat-out processing speed is a nice bonus, but is less essential as Rich is happy to recode in C++ if he needs to. (or use a bigger computer)

Library support
For scientific programming, having the right libraries is vital. We need to generate plots, process data, invert matrices, perform Fast Fourier Transforms and all sorts of specialist things like that. All of these things can be found in libraries for various programming languages, so it’s sensible to make sure you have access to these. Python scores well on this count because of packages such as SciPy, BioPython, NumPy and matplotlib.

Usability
This is always tricky to assess without using the language, but the perceived wisdom on the Web, backed up by the opinions of some of our colleagues, is that Python is extremely user-friendly. Indeed, this is part of the stated design philosophy of Python (see here).

Speed
For prototyping scientific code, computational speed is a bonus rather than a necessity. At this stage, user time (for programming) is far more valuable than CPU time, so an interpreted language like Python is acceptable. Comparative benchmarking between languages is notoriously hard (and task specific), but the impression we’ve got is that Python and Matlab are probably of order the same speed, and a couple of orders of magnitude slower that fully compiled languages like C++. However, in both cases people are working to make Matlab/Python implementations that are faster. And we probably won’t be losing out significantly by switching from Matlab to Python.

What does everyone else use?
It’s very useful if you’re surrounded by experts in the language you’re using. It’s also useful if your colleagues know the same languages as you, because they can pick up and use the things you write. In the case of Rich’s department, many people use Matlab but almost no-one uses Python. This is a downside. Of course, someone has to be first whenever a change like this is made, but it would mean that Rich would be on his own to a certain degree.

A tranferable skill…
It’s always prudent to be developing transferable skills and experience with Python would certainly count as that, because it’s widely used in industry and the commercial world. Matlab is also widely used, although perhaps more in science/engineering settings and less in places like the computing industry. It’s probably true to say that both have their merits in this regard.

What about Octave?
Wouldn’t it be nice if there was just a free version of Matlab? Well, there is (sort of): GNU Octave. This would be another good solution to Rich’s Matlab issues. We’re discounting it here mainly because of the concern that it’s less well supported than Python, and also because it’s less of a transferable skill. Neither of these reasons are killers, however, so we wouldn’t try to dissuade anyone from going down the Octave route.

Scripting for science papers

Scientist-Programmers write a lot of scripts. It’s part-and-parcel of “trying stuff out”, it’s a quick way to get some number crunching done on those data, and it’s very useful for generating the figures and tables that you need for that paper you’re writing. In this article, I give a quick once-over of some of the things I’ve learned over the years about using scripts as a scientific tool.

A bit like prototypes…
Scripts share some characteristics with software prototypes. Your aim is typically to get an answer quickly, writing code that doesn’t (necessarily) need to be very reuseable. There can also be a learning element here, if you’re trying understand more about exactly how to solve a given problem. This means that you’ll be subject to many of the same considerations as in a prototype. Writing quick-and-dirty code in exchance for speed is okay here, provided you can test enough to be confident that you can trust the results it’s generating.

You *will* want to re-run these at some point
Often, you’ll be writing a script to run a one-off analysis. Perhaps there are enough stages involved that it’s easier to handle by writing down in this way – for example, running an MCMC clustering analysis on some genetic data, summarising the results into a single ‘average’ clustering partition, then using annotation databases to search for patterns of biological function. All pretty straightforward stuff (and essentially just a set of modules, run in sequence). Despite being a nominla one-off task, an important lesson I’ve learned over the years is that it’s surprising how often you’ll come back to a script months (or even years) later and need to use it again, either for the same task or a related one. This can happen for a number of reasons.

    * because you’re returning to an old project
    * you’re responding to referees’ comments on a paper you wrote
    * you’re working on a further development of a previous project
    * You might also simply have come across the need to do a similar set of tasks for a completely different project.

Whatever the reason, you will thank yourself if you’ve taken the time to write some comments and keep the code fairly legible and literate. This doesn’t take much time to do as you’re writing the script, but will save you huge headaches in getting restarted after months working on other things.

Turning your script into proper software
Sometimes your script will turn out to be useful more than once. It might even be useful enough that you end up using it regularly and perhaps other people start asking if they can have a copy. This is great, because you’ve made something useful! But at this point, you might want to consider turning your script into a proper piece of software. My suggestion for this is to treat your script as a kind of prototype, meaning that you should start afresh with the planning, coding and testing for the proper software. This is extra effort, of course, but by definition you’ve identified a case where it’ll be effort well spent.

Links to some great articles on programming

The internet is full of smart people writing intelligently on how to write good software. Very few of these articles are from the perspective of a scientist (hence this blog!) but a lot of what they write is useful, interesting and, occasionally, entertaining. This post is of some of the best articles, posts and websites that have taught us what we know today.

Six ways to write more comprehensible code, by Jeff Vogel uses a game called ‘Kill Bad Aliens’ as a setting for his examples on writing better code. The code is C++ but the tips are nearly all applicable to any language. The only one I don’t agree with is number 2: ‘Use #define a lot. No, a LOT.’ #defines are part of the C/C++ pre-processor macro language that replaces one string with another before the compiler is run. It allows the programmer to replace ‘magic numbers‘ with more descriptive tokens without the overhead of creating constants. I agree that #defines are better than magic numbers but I believe most of those values should be loaded as data and not hard-coded into your program. This way, when (and it will be when and not if) you change your mind you only need to change the data files and not recompile your code. For something that needs as much tuning as a game having loaded, not compiled, data is very important as it allows quicker iterations which means more iterations in a given time and therefore more chance to get the best possible experience.

Bad names. Eric Lippert is one of the people who design the C# language at Microsoft. His blog can be deeply technical but it is always an interesting read. In this post he talks about bad variable/function/class names he has encountered in the C# compiler and what problems various names expose. For instance any name with ‘Misc’ in it is doing more than one thing which is ‘a bad thingTM’. The comments have a few more chipped in by the peanut gallery.

Is it worth spending time on design? by Martin Fowler (the real name is DesignStaminaHypothesis but that name is a bit opaque). This post is about the hypothesis that there is a point below which spending time on design will actually slow you down but that this point is very difficult to judge and, in the authors opinion, is lower than people think. I like this post because it shows that very few things are set in stone and it references two related points (links are in the article): 1) Productivity of a programmer is very hard to measure and 2) the author introduces the concept of technical debt to describe the cost to a project of not planning. Technical Debt is a very powerful concept and can be applied to any time you cut corners to get something done sooners than if you had done it properly.