2014-03-10

Jake Vanderplas is the Director of Research in the Physical Sciences at the University of Washington’s eScience Institute. He is a maintainer and/or frequent contributor to many open source Python projects, including scikit-learn, scipy, matplotlib, and others.  He occasionally blogs about Python, data visualization, open science, and related topics at Pythonic Perambulations. You may find his (BSD-licensed) code on Github, or follow him on Twitter at @jakevdp.

At the 223rd meeting of the American Astronomical Society in January, there was an excellent standing-room-only session on Astrophysics Code Sharing in which the question of open source licensing came up.  Code licensing is one of those issues that many people never have a reason to think about, but it turns out to be a very important piece of making your code available to others.  The goal of this post is to provide a quick guide to the subject for folks who don’t want to spend too much time researching it, and to offer some concrete advice to make sure your public scientific code is as useful and influential as it can possibly be.

Disclaimer: I have no formal legal training, and this post is not intended as legal advice. For a more thorough treatment of copyright law and licenses as they pertain to code and software, please see the list of references at the end of the article.

What Is a License?

For most people, the mention of a software license conjures up images of the paragraphs of legalese you must (pretend to) read before installing a proprietary software package. But licenses come in a variety of forms: generally, a software license is a legally-binding agreement which governs the use and redistribution of software. Licenses range from proprietary (think Microsoft Windows) to Open Source (think Linux), with many variations and gradations in between. Here we’ll focus primarily on Free and Open Source Software (FOSS) Licenses, as they are generally the most relevant to those who write and share scientific code.

Who Should Think About Licensing?

As we’ll discuss below, everybody who writes code should license it, and this is especially true in science. Increasingly, researchers in astronomy and related fields spend much of their energy producing scientific code. Because this code is an absolutely vital component of the reproducibility of scientific results, an important part of the scientific process is to make this code available to others. If you have shared or plan to share your scientific code—whether that’s on your own web page, on a blog, within supplemental websites of journals, or on a hosting service like Github or Bitbucket—this article is for you. Generally when scientists make their code public, they do so because they want it to be free to use and as useful as possible for as many people as possible. They want others to not only use it, but also extend it, fix bugs, incorporate it into their own research code, and thereby make it even more useful to more people. This article is written primarily with these scientific researchers in mind.

Summary: Three Pieces of Advice

This post will cover a lot of ground, but if you only take three pieces of information away from the article, let them be these:

Always license your code.  Unlicensed code is closed code, so any open license is better than none (but see #2).

Always use a GPL-compatible license. GPL-compatible licenses ensure broad compatibility for your code, and include GPL, new BSD, MIT, and others (but see #3).

Always use a permissive, BSD-style license. A permissive license such as new BSD or MIT is preferable to a copyleft license such as GPL or LGPL.

This list progresses from widely accepted to increasingly more contentious. I can’t claim my advice is sound for every possible situation. But for scientific researchers sharing their own code, I stand behind the recommendations. Below I’ll do my best to convince you why.

1. Always license your code.

Unlicensed code is closed code, so any open license is better than none (but see section #2 below).

Scientists generally share code that they hope others will use. And many of us operate under the assumption that posting it online effectively makes it available for anybody to use, modify, and incorporate into their projects. Somewhat surprisingly, it turns out that this is not the case in our copyright-driven world. The legalities of copyrights, licenses, etc. are difficult to distill, but a nice summary is given in a 2010 article by Arto Bendican.  It includes some examples of situations where copyright law may be at odds with naive intuition:

…if you stumble across some code with no attached licensing information, copyright laws would have you treat it as ‘all privileges retained’, even if its author in fact was just trying to make it available with no strings attached.

This is the first important thing to realize about licensing: adding a suitable license generally increases, not decreases, the openness of your code. Furthermore, adding a license lets you be explicit about your intent for the code. Do you want to be acknowledged as the author whenever your code is used or modified? Are you okay with other libraries incorporating, enhancing, and releasing versions your code? Are you OK with for-profit companies incorporating your code into their private code-bases? Choosing a license is the established way to inform potential users of your preferences regarding these, and other questions.

How to License Your Code

Below we’ll discuss suggestions of which license to use. First, though, let’s talk about the actual mechanics of licensing your code.  Licenses generally contain a few paragraphs of standard text in which you insert the date, your name, your organization, etc. (you can find texts of several common licenses at the Open Source Initiative). For smaller projects, you can include the full license text in a comment at the top of each source file.  Because this can be a bit tedious, other projects choose to have a single LICENSE.txt or COPYING.txt file in the source directory, and perhaps a small notice mentioning the license type at the top of each individual file in the codebase (see e.g. Numpy and SciPy for examples of this approach).

In summary, you should, at the very least, choose an open license and include it with your code.  It will make your intentions clear and allow others the freedom to use and adapt your code without worrying about an implicit copyright. In the next section we’ll go into a bit more depth on which license you should choose, and why.

2. Always use a GPL-compatible license.

GPL-compatible licenses ensure broad compatibility for your code, and include GPL, LGPL, new BSD, MIT, and others (but see #3, below)

There is a huge breadth of available licenses to use for your code. Wikipedia has a comprehensive comparison of free and open source software (FOSS) licenses; you can also find useful information on these licenses on the GNU website.  For the most part, any of these licenses is sufficient, though there are some important differences between them. Perhaps the best-known license is the GNU Public License (GPL), which guarantees the freedom of users to use, copy, and modify code. Code released under a GPL-compatible license is code that can be incorporated into another GPL-licensed codebase without modification to the license.  GPL-compatible licenses include GPL, LGPL, new BSD, MIT, and many others.

If the goal of publishing scientific code is to ensure that it is as useful as possible to as many people as possible, GPL-compatibility is a must in today’s world. GPL compatibility is so important that many non-GPL-compatible packages end up going to great lengths to retroactively change their license, a difficult process that involves gaining consensus of every person who has ever contributed to the project. For a list of examples, as well as a more complete discussion of the virtues of GPL compatibility, see this rather dense 2002 essay by David Wheeler, Make Your Open Source Software GPL-Compatible. Or Else.

I should clarify here that while I recommend a GPL-compatible license, I am not recommending necessarily recommending the use of a GPL license itself (see the next section).

You can find some more details on GPL compatibility here: http://producingoss.com/en/license-compatibility.html

3. Always use a permissive, BSD/MIT-style license

A permissive license such as BSD or MIT is preferable to a copyleft license such as GPL or LGPL.

Everything above this point in the article is, for the most part, non-controversial in the open source community. Here we’ll go into something that’s likely to spark more debate, and it has to do with some subtle distinctions between “permissive” licenses (which are often called “BSD-style”) and “copyleft” licenses (which are often called “GPL-style”).

The difference between the two is rooted in this: broadly speaking, a copyleft license requires derivative works to preserve the same license as original works. In the case of GPL-licensed code, what this means is that any codebase that incorporates any GPL code must also have a GPL license, and therefore be free for others to use. This is the sense in which the GPL and other copyleft licenses are sometimes called “viral” or “sticky” licenses.

The motivation for such a restriction seems reasonable enough: you work hard on your code and want it to be free for others to use. There is some injustice in the notion that someone would take the result of your hard work, make a superficial modification, and thereafter treat your code as their own intellectual property. This sense of justice (along with connected concerns related to software patenting issues) seems to be a central piece of the argument in favor of GPL-style copyleft licenses.

The case for BSD in big projects

Despite these concerns, there are a growing number of scientific software developers who push back against the idea of copyleft and rather advocate the use of permissive, BSD/MIT-style licenses. Perhaps the best articulation of the reason for this was in a 2004 forum post by the late John Hunter (creator of the Matplotlib visualization library): I’d highly recommend that you read the whole thing.

To summarize Hunter’s reasoning: the most important two predictors of success for a software project are the number of users and the number of contributors. Because of the restrictions and subtle legal issues involved with GPL licenses, many for-profit companies will not touch GPL-licensed code, even if they are happy to contribute their changes back to the community. A BSD license, on the other hand, removes these restrictions: Hunter mentions several specific examples of vital industry partnership in the case of matplotlib. He argues that in general, a good BSD-licensed project will, by virtue of opening itself to the contribution of private companies, greatly grow its two greatest assets: its user-base and its developer-base.

There is a tradeoff to choosing a BSD license, of course. It means that you’re effectively putting off-limits the incorporation of existing GPL-licensed functionality in your package. Hunter mentions some specific challenges this has posed for matplotlib; I’ve seen similar challenges within packages I’ve worked on, namely SciPy and Scikit-learn. Nevertheless, Hunter asserts that the net advantage comes from inviting the participation and contribution of industry partners to up the user-base and developer-base of the project. This is why core scientific projects like NumPy, SciPy, Matplotlib, IPython, Pandas, and many others have opted for BSD-style, permissive licenses.

The case for BSD in smaller projects

At this point, you might be wondering why this is relevant to you. After all, you are probably not developing the next IPython or Matplotlib, but simply releasing a small piece of research code that you hope others will find useful. Still, I would argue that BSD is the best option, because it will lower the barrier for your code to be as useful as possible to as many people as possible. This usefulness leads to the citations, recognition, and collaboration that are the primary currency of the academic researcher. We should recognize that these desirable outcomes are not protected  by the text of a license, but by established norms of the scientific community. For this reason, a scientific researcher should choose a license which offers the lowest barrier to the reproduction and extension of research results.

The world of scientific software is one in which many of the core packages have BSD-style licenses.  In this world, a GPL-style license is an unnecessary liability, as it contains barriers to use and leads to a one-way process of collaboration. A BSD-style license, on the other hand, enables two-way collaboration, in which developers of other projects can easily utilize, patch, improve, and cite your code to your mutual advantage.

In the astronomy world, many projects have followed this reasoning and opted for BSD-style licensing. AstroPy, astroML, emcee, and the yt project are well-known examples. The yt project is particularly interesting, because it started with a GPL license and, based on much of the above reasoning, undertook the fairly significant challenge of later switching to BSD-style. You can read an account of the reasoning and process on the yt blog. The benefits of a BSD-style license are strong enough that it was worth a significant effort to attain them: do your future self a favor and choose a BSD-style license from the start!

Going In-Depth: More Resources

Much has been written on this topic, and I’d encourage you to learn more if you’re interested. Following is a listing of several resources which discuss some of the above issues in more depth:

Producing Open Source Software: a free online book with a lot of great information and advice surrounding the production and maintenance of open source software, including licensing and related issues.

Understanding Open Source and Free Software Licensing: a full-length book discussing the finer points of open source licenses and copyright law.

A Quick Guide to Software Licensing for the Scientist-Programmer: An in-depth article on software licensing from a computational biology journal.

What makes computational open source software libraries successful? [pdf]: A general discussion of what makes scientific programming libraries successful.

Why we should be using BSD: The case for BSD-style licensing as made by the late John Hunter, creator of the matplotlib visualization library.

Open Source Licenses: The Open Source Initiative’s discussion of free and open licenses

With any discussion of licensing there are bound to be differing opinions. Where some prefer BSD to GPL, others make the case for GPL over BSD. Still others argue the superiority of other permissive licenses, such as Apache or Creative Commons. The right license choice often comes down to a matter of taste, and often depends on the (sometimes implicit) goals and priorities of the authors. Here are some questions to think about, which you may wish to discuss in the comment thread:

What core priorities drive advocacy of one license style over another?

What licensing practices are standard in your own community? What values do these reflect? How do these help or hinder progress?

How many in your community realize that unlicensed code defaults to non-open code?



Show more