Planet.python.org

Luke Plant: Bundling dependencies

2013-04-15

This post is about maintenance programming and the issue of Open Source
dependencies that may need customising. It compiles some of my current thoughts,
but I'm also eager to find out what other people do.

3 approaches to dependencies

Pure dependency

The source code of the dependency does not become a part of your project in
any way. For a web project with Python and virtualenv/pip, you would just
list the project name and version in requirements.txt, and it will be
installed when you deploy your project.

This is by far the easiest approach to dependencies.

Forked dependency

You create a fork of the library (usually hosted publicly, but not
necessarily) and add to it the changes you need. You then use this fork from
your main project.

This is done either in the hope that bug fixes and feature additions that you
make will be merged into the original, so that you won't have to maintain
your fork forever, or with the aim of keeping your changes small enough
that it will always be easy to merge in fixes from upstream.

Bundled dependency

You take a copy of the library, and include it directly into your own source
code, so that it becomes a part of your source code, so that you can make
whatever modifications you need. The code becomes a part of your source code
forever.

This post is about number 3 — the bundled dependency.

(There are, of course, variants and mixtures of these — for example, Django has often bundled dependencies, but this was
purely because of the confusing state of packaging, and the code was never
modified for use in Django. These libraries have been or will be un-bundled as
soon as possible.)

Avoid it if you possibly can

The first thing to say about bundling dependencies is that you should avoid doing
so if at all possible:

It can result in large increases in code base.

You won't get critical fixes from upstream, and it can be hard to merge them
in.

Bundling a dependency can be a drastic decision — you are taking on all the
technical debt and maintenance burden of the code you are adding. Some
developers look at Open Source libraries and think “wow, all this free source
code I can just add to my project”. Your attitude needs to be exactly the
opposite: “Wow, look at all that code I'm going to have to maintain and debug”.

An external dependency is often much worse from a maintenance point of view than
code you have written yourself:

You may not understand the code very well at all, and you may not have access
to the original reasons for the way it is.

When you add it to your project, you typically lose its history, making it
harder to track down reasons for its current state.

Library code can often be over-generalised and complex. It copes with all
kinds of situations that you don't need, but you will have to understand and
maintain that complexity.

The code will not ‘fit’ into your project well — there may be all kinds of
conventions and decisions that make it alien to your project, but now it is
part of your project and needs to fit.

Alternatives

To avoid bundling a dependency, you can go for the ‘forked dependency’
above. For the missing features you need, attempt to add extension points that
will give you the flexibility you need, rather than simply hard code something
very specific to your project that will never get merged upstream.

Another alternative is to build what you need yourself, or very selectively add
parts of the dependency into your own source. This may seem more work, but could
be easier to maintain long term.

Finally, you could consider a monkey patch. But be very careful, and make
sure you know all the places where you are doing that kind of thing, so that you
can assess at what point you should be switching strategy.

When you should consider it

However, there are times when you should consider bundling the dependency:

When the changes you want to make are more than bug fixes.

When the changes can't be easily made by adding extension points to the original.

When the number/size of extensions is going to severely inhibit a developer's
ability to understand the code.

I recently took on a project that had bundled a copy of Satchmo. It was a bit of a shock, because
requirements.txt also listed Satchmo as a dependency, making me think I was in
situation 1, when actually I was in situation 3, which is much worse.

Sometimes, however, it is unavoidable. e.g. you need multiple fields adding to
DB different models, or you need to make invasive changes in other ways. As I
looked at the number of modifications made to the bundled Satchmo, I realised
there was no way that strategies 1 or 2 would be any good. Strategy 3 had
already been chosen, it was impossible to turn back the clock, and with
hindsight it was probably the right decision.

But implementation of that decision was lacking in lots of ways.

So how do you cope when you are forced to bundle? Here are my hints so far.

Recognise that you have done a really bad thing, and you need to take equally
drastic action to cope with it. The bigger the dependency you've bundled, the
more likely it is that you have seriously damaged your ability to maintain the
project long term.

Make sure you include the tests of the original dependency, and integrate them
as part of your test suite.

Sounds obvious, but in the project that inspired this post, the opposite had
been done — they had copied all the source code, with the exception of every
file called 'tests.py' or directory called 'tests'. I do not know what
possessed them to do this, but this decision was an extremely expensive one
for their client, and has caused massive damage to the project.

Maintain the test suite properly.

Again, sounds obvious, but tests are extremely valuable to a project, and in
this situation it is vital that you keep them maintained.

It is acceptable to delete tests if they are checking requirements that you no
longer have. But you should be deleting the code that supports those tests as
well.

Take complete ownership of the code.

Having made the decision to bundle, don't treat the code like an external
dependency. It is your code now, only you can fix it. Don't pretend you are
going to merge in upstream changes.

The code should live at the same ‘level’ as the rest of your code — for
example, it should be in the same directory, not off in some 'libs' directory
that makes it harder to find. You need to embrace the fact that it is part of
your maintenance burden.

On the other hand, it is your code now, you can do what you want with it. So
don't be afraid of making changes. A tentative approach will leave you with
the worst of both worlds — a library that doesn't really do what you want, but
that you have to maintain. Make it do what you want.

Obviously, there can be some value in maintaining a separation between "your
stuff" and the "framework stuff" or "library stuff", but this is just good
coding practice — you wouldn't hard code something very specific into a
function that is supposed to be generic.

Delete, delete, delete.

If there is code that you don't need, just delete it. The more code you can
remove, the better. There can be a case for keeping some code around if:

It is causing very little nuisance to maintenance efforts.

It is fairly likely to be needed in the near future.

It is not causing runtime weaknesses (e.g. security problems),
because there is no entrance point to the code.

But note that just the existence of code is a maintenance problem. If, for
example, you need to change the signature of a function, you will do a search
for sites that call it. Every hit you get is something you have to
investigate, which takes time. If, in the process of this kind of
investigation, you find some code that might be unused, find out if it is, and
delete aggressively where appropriate.

And code that might be needed one day is better deleted. By the time
you come to need it, it might be horribly broken, or broken in subtle ways
that will take you longer to debug than to write, or too complex or badly
performing for the context of your evolved application.

This applies to all kinds of code, including templates etc.

Clean aggressively.

If you delete unused code, you'll find that you may well end up with code that
has essentially unused generality, or various other things that no longer make
sense for your specific project.

This is my golden rule for maintenance:

Leave the code looking as if it had always been designed that way.

This is a general maintenance principle, but it is especially important for
the situation where you are trying to go from a larger code base to a smaller
one.

Ideally, there should never be artefacts that can only be explained by talking
about the history of the project. This applies to every detail, including:

names of models

names of fields

names of variables and functions

Altering models is not hard if you have a good database migration tool
e.g. South for Django.

This principle may seem like it adds to the load of the maintenance
programming, but long term it reduces the load, and reduces the likelihood
that a project will collapse under its own weight. Even with this principle,
projects tend to become unmaintainable — the natural tendency of a project is
towards chaos, and you have to be very proactive about reversing that.

Example 1: after deleting some classes, you end up with a class hierarchy
where each base class is only used once. This adds a lot of overhead when
reading the code. You should clean aggressively — fold the classes together
(unless keeping them separate increases the clarity of the code).

Example 2: The code I'm maintaining uses livesettings (and uses it far too
much in my opinion, for things that ought to be in settings.py). It includes
some options that are unlikely to change for a given project, or are likely to
become ignored easily. For example, there is an "Only authenticated users can
check out" setting. In a project with an overridden login form or login view
(which can easily happen), it's very easy for this switch to become (at least
partly) broken. When you are working on some code that branches on the value
of this switch, there is no point fixing both branches — you won't have decent
tests to ensure that the unused branch is really working.

Instead, find out what the current value is, and just delete the other
branch. Then find all instances of the setting being used, and clean up
similarly. Finally, delete the code that defines the switch in the first
place. Remove every trace — you always have the history if you really need to
see how something was done before.

Lather, rinse, repeat.

The aggressive process of deleting and cleaning leads to more, and you should
follow this up. You may not have the time to do it right now, but you should
be doing as you go — whenever some coding has turned up something that can be
cleaned/deleted, first do the necessary commit for whatever you were working
on. Then do a round of cleaning/deleting, finding all the code paths that are
now dead or can be simplified, commit the change, and repeat,

These things have to go together. Aggressive deleting and cleaning can be made a
lot easier if you have a good test suite. Of course, when deleting code, you
will do a search for sites that might call it. But it ought to be possible to
check if you can delete code simply by running the test suite with it absent.

What other approaches or hints do you have for dealing with this situation?