Planet.haskell.org

Luke Plant: We need less powerful languages

2015-11-14

Many systems boast of being ‘powerful’, and it sounds difficult to argue that
this is a bad thing. Almost everyone who uses the word assumes that it is always
a good thing.

The thesis of this post is that in many cases we need less powerful
languages and systems.

Before I get going, I should say first of all that there is very little in this
post by way of original insight. The train of thought behind it was set off by
reading Hofstadter's book Gödel, Escher, Bach — an Eternal Golden Braid which helped me pull
together various things in my own thinking where I've seen the principle in
action. Philip Wadler's post on the rule of least power was also
formative, and most of all I've also taken a lot from the content of this video
from a Scala conference about everything that is wrong with Scala, which makes the
following fairly central point:

Every increase in expressiveness brings an increased burden on all who care
to understand the message.

My aim is simply to illustrate this point using examples that might be more
accessible to the Python community than the internals of a Scala compiler.

I also need a word about definitions. What do we mean by “more powerful” or
“less powerful” languages? In this article, I mean something roughly like this:
“the freedom and ability to do whatever you want to do”, seen mainly from the
perspective of the human author entering data or code into the system. This
roughly aligns with the concept of “expressiveness”.

The problem with this kind of freedom is that every bit of power you insist on
have when writing in the language corresponds to power you must give up at other
points of the process — when ‘consuming’ what you have written. I'll illustrate
this with various examples which range beyond what might be described as
programming, but have the same principle at heart.

We'll also need to ask “Does this matter?” It matters, of course, to the extent
that you need to be able to ‘consume’ the output of your system. Different
players who might ‘consume’ the message are software maintainers, compilers and
other development tools, which means you almost always care — this has
implications both for performance and correctness as well as human concerns.

Databases and schema

Starting at the low end of the scale in terms of expressiveness, there is what
you might call data rather than language. But the principle applies here.

In my years of software development, I've found that clients and users often ask
for “free text” fields. A free text field is maximally powerfully as far as the
end user is concerned — they can put whatever they like in. In this sense, this
is the “most useful” field — you can use it for anything.

But precisely because of this, it is also the least useful, because it is the
least structured. Even search doesn't work reliably because of typos and
alternative ways of expressing the same thing. The longer I do software
development involving databases, the more I want to tightly constrain everything
as much as possible. When I do so, the data I end up with is massively more
useful. I can do powerful things when consuming the data only when I severely
limit the power (i.e. the freedom) of the agents putting data into the system.

In terms of database technologies, the same point can be made. Databases that
are “schema-less” give you great flexibility and power when putting data in, and
are extremely unhelpful when getting it out. A key-value store is a more
technical version of "free text", with the same drawbacks — it is pretty
unhelpful when you want to extract info or do anything with the data, since you
cannot guarantee that any specific keys will be there.

HTML

The success of the web has been partly due to the fact that some of the core
technologies, HTML and CSS, have been deliberatedly limited in power. Indeed,
you probably wouldn't call them programming languages, but markup languages.
This, however, was not an accident, but a deliberate design principle on the part of Tim
Berners Lee. I can't do better than to quote that page at length:

Computer Science in the 1960s to 80s spent a lot of effort making languages
which were as powerful as possible. Nowadays we have to appreciate the
reasons for picking not the most powerful solution but the least powerful.
The reason for this is that the less powerful the language, the more you can
do with the data stored in that language. If you write it in a simple
declarative form, anyone can write a program to analyze it in many ways. The
Semantic Web is an attempt, largely, to map large quantities of existing
data onto a common language so that the data can be analyzed in ways never
dreamed of by its creators. If, for example, a web page with weather data
has RDF describing that data, a user can retrieve it as a table, perhaps
average it, plot it, deduce things from it in combination with other
information. At the other end of the scale is the weather information
portrayed by the cunning Java applet. While this might allow a very cool
user interface, it cannot be analyzed at all. The search engine finding the
page will have no idea of what the data is or what it is about. This the
only way to find out what a Java applet means is to set it running in front
of a person.

This is has become a W3C principle:

Good Practice: Use the least powerful language suitable for expressing
information, constraints or programs on the World Wide Web.

Note that this is almost exactly the opposite of Paul Graham's advice (with the caveat that 'power' is often
too informally defined to compare):

if you have a choice of several languages, it is, all other things being
equal, a mistake to program in anything but the most powerful one.

Python setup.py MANIFEST.in file

Moving up towards ‘proper’ programming language, I came across this example —
the MANIFEST.in file format used by distutils/setuptools. If you have had to
create a package for a Python library, you may well have used it.

The file format is essentially a very small language for defining what files
should be included in your Python package (relative to the MANIFEST.in file,
which we'll call the working directory from now on). It might look something
like this:

There are two types of directive: include type directives (include,
recursive-include, global-include and graft), and exclude type
directives (exclude, recursive-exclude, global-exclude and
prune).

There comes a question — how are these directives to be interpreted?

You could interpret them in this way:

A file from the working directory (or sub-directories) should be included in
the package if it matches at least one include type directive, and does
not match any exclude type directive.

This would make it a declarative language.

Unfortunately, that is not how the language is defined. The distutils docs for
MANIFEST.in are
specific about this — the directives are to be understood as follows (my
paraphrase):

Start with an empty list of files to include in the package (or technically,
a default list of files).

Go down the directives in the MANIFEST.in in order.

For every include type directive, copy any matching
files from the working directory to the list for the package.

For every exclude type directive, remove any matching
files from the list for the package.

As you can see, this interpretation defines a language that is imperative in
nature — each line of MANIFEST.in is a command that implies an action with side
effects.

The point to note is that this makes the language more powerful than my
speculative declarative version above. For example, consider the following:

The end result of the above commands is that .png files that are below foo/bar
are included, but all other files below foo/bar are not. If I'm thinking
straight, to replicate the same result using the declarative language is harder
— you would have to do something like the following, which is obviously sub-optimal:

So, because the imperative language is more powerful, there is a temptation to
prefer that one. However, the imperative version comes with significant
drawbacks:

It is much harder to optimise.

When it comes to interpreting the MANFIEST.in and building a list of files to
include in the package, one fairly efficient solution for a typical case is
to first build an immutable list of all files in the directory and its
sub-directories, and then apply the rules: addition rules involve copying
from the full list to an output list, and subtraction rules involve removing
from the output list. This is how the Python implementation currently does
it.

This works OK, unless you have many thousands of files in the full list, most
of which are going to get pruned or not included, in which case you can spend
a lot of time building up the full list, only to ignore most of it.

An obvious shortcut is to not recurse into directories that would be
excluded by some exclude directive. However, you can only do that if the
exclude directives come after all include directives.

This is not a theoretical problem — I've found that doing setup.py sdist
and other commands can take 10 minutes to run, due to large number of files
in the working directory if you use the tool tox for instance. This means that runs of
tox itself (which uses setup.py) become very slow. I am currently
attempting to fix this issue, but it is looking like it will be really hard.

Adding the optimised case might not look that hard (you can shortcut the file
system traversal using any exclude directives that come after all include
directives), but it adds sufficiently to the complexity that a patch is
unlikely to be accepted — it increases the number of code paths and the
chances of mistakes to the point of it not being worth it.

It might be that the only practical solution is to avoid MANIFEST.in
altogether and optimise only the case when it is completely empty.

The power has a second cost — MANIFEST.in files are harder to understand.

First, in understanding how the language works — the docs for this are
considerably longer than for the declarative version I imagined.

Second, in analysing a specific MANIFEST.in file — you have to execute the
commands in your head in order to work out what the result will be, rather
than being able to take each line on its own, or in any order that makes
sense to you.

This actually results in packaging bugs. For instance, it would be easy to believe
that a directive like:

at the top of a MANIFEST.in file would result in any file name ending in
~ (temporary files created by some editors) being excluded from the
package. In reality it does nothing at all, and the files will be erroneously
included if other commands include them.

Examples I've found of this mistake (exclude directives that don't
function as intended or are useless) include:

hgview (exclude directives at the top do nothing)

django-mailer (global-exclude at the top does nothing)

Another result is that you cannot groups lines in the MANIFEST.in file in any
way you please, for clarity, since re-ordering changes the meaning of the
file.

In addition, virtually no-one will actually use the additional power. I'm
willing to bet that 99.99% MANIFEST.in files do not make use of the additional
power of the imperative language (I downloaded 250 and haven't found any that
do). So we could have been served much better by a declarative language here
instead of an imperative one. But backwards compatibility forces us to stick
with this. That highlights another point — it is often possible to add features
to a language to make it more powerful, but compatibility concerns usually don't
allow you to make it less powerful, for example by removing features or adding
constraints.

URL reversing

One core piece of the Django web framework
is URL routing. This is the component that parses URLs and dispatches them to
the handler for that URL, possibly passing some components extracted from the
URL.

In Django, this is done using regular expressions. For an app that displays
information about kittens, you might have a kittens/urls.py with the following:

The corresponding views.py file looks like:

Regular expressions have a capture facility built in, which is used to capture
parameters that are passed to the view functions. So, for example, if this app
were running on cuteness.com, a URL like http://www.cuteness.com/kittens/23/
results in calling the Python code show_kitten(request, id=23).

Now, as well as being able to route URLs to specific functions, web apps almost
always need to generate URLs. For example, the kitten list page will need to
include links to the individual kitten page i.e. show_kitten. Obviously we
would like to do this in a DRY way, re-using the URL routing configuration.

However, we would be using the URL routing configuration in the opposite
direction. When doing URL routing, we are doing:

In URL reversing, we know the handler function and arguments we want the user
to arrive at, and want to generate a URL that will take the user there, after
going through the URL routing:

In order to do this, we essentially have to predict the behaviour of the URL
routing mechanism. We are asking “given a certain output, what is the input?”

In the very early days Django did not include this URL reversing facility, but
it was found
that with most URLs, it was possible to 'reverse' the URL pattern. The regex can
be parsed looking for the static elements and the capture elements.

Note, first of all, that this is only possible at all because the language being
used to define URL routes is a limited one — regular expressions. We could
easily have defined URL routes using a more powerful language. For example, we could
have defined them using functions that:

take a URL path as input

raise NoMatch if they do not match

return a truncated URL and an optional set of captures if they do match.

Our kittens urls.py would look like something like this:

Of course, we could provide helpers that make things like match_kitten and
capture_id much more concise:

Now, this language for URL routing is actually a lot more powerful than our
regex based one, assuming that m and c are returning functions as above.
The interface for matching and capturing is not limited to the capabilities of
regexes — for instance, we could do database lookups for the IDs, or many other
things.

The downside, however, is that URL reversing would be entirely impossible. For
general, Turing complete languages, you cannot ask “given this output, what is
the input?”. We could potentially inspect the source code of the function and
look for known patterns, but it quickly becomes totally impractical.

With regular expressions, however, the limited nature of the language gives us
more options. In general, URL configuration based on regexes is not reversible —
a regex as simple as . cannot be reversed uniquely. (Since we want to
generate canonical URLs normally, a unique solution is important. As it happens,
for this wild card, Django currently picks an arbitrary character, but other
wild cards are not supported). But as long as wild cards of any sort are only
found within capture groups (and possibly some other constraints), the regex can
be reversed.

So, if we want to be able to reliably reverse the URL routes, we actually want a
language less powerful than regular expressions. Regular expressions were
presumably chosen because they were powerful enough, without realising that they
were too powerful.

Additionally, in Python defining mini-languages for this kind of thing is quite
hard, and requires a fair amount of boilerplate and verbosity both for
implementation and usage — much more than when using a string based language
like regexes. In languages like Haskell, relatively simple features like easy
definitions of algebraic data types and pattern matching make these things much
easier.

Regular expressions

The mention of regexes as used in Django's URL routing reminds me of another
problem:

Many usages of regexes are relatively simple, but whenever you invoke a regex,
you get the full power whether you need it or not. One consequence is that for
some regular expressions, the need to do backtracking to find all possible
matches means that it is possible to construct malicious input that takes a huge
amount of time to be processed by the regex implementation.

This has been the cause of a whole class of Denial Of Service vulnerabilities
in many web sites and services, including
one in Django due to an accidentally 'evil' regex in the URL validator — CVE-2015-5145.

Django templates vs Jinja templates

The Jinja template engine was inspired by the
Django template language,
but with some differences in philosophy and syntax.

One major advantage of Jinja2 over Django is that of performance.
Jinja2 has an implementation strategy which is to compile to Python code, rather
than run an interpreter written in Python, which is how Django works, and this
results in a big performance increase — often 5 to 20 times. (YMMV etc.)

Armin Ronacher, the author of Jinja, attempted
to use the same strategy to speed up Django template rendering. There were
problems, however.

The first he knew about when he proposed
the project — namely that the extension API in Django makes the approach taken
in Jinja very difficult. Django allows custom template tags that have almost
complete control
over the compilation and rendering steps. This allows some powerful custom
template tags like addtoblock in django-sekizai that seems impossible at first glance. However, if a slower
fallback was provided for these less common situations, a fast implementation
might still have been useful.

However, there is another key difference that affects a lot of templates, which
is that the context object that is passed in (which holds the data needed by the
template) is writable within the template rendering process in Django.
Template tags are able to assign to the context, and in fact some built-in
template tags like url do just
that.

The result of this is a key part of the compilation to Python that happens in Jinja is impossible in Django.

Notice that in both of these, it is the power of Django's template engine that
is the problem — it allows code authors you to do things that are not
possible in Jinja2. However, the result is that a very large obstacle is placed
in the way of attempts to compile to fast code.

This is not a theoretical consideration. At some point, performance of template
rendering becomes an issue for many projects, and a
number have
been forced to switch to Jinja because of that. This is far from an optimal
situation!

Often the issues that make optimisation difficult are only clear with the
benefit of hindsight, and it isn't true to say that simply adding restrictions
to a language is necessarily going to make it easier to optimise. You might also
say that for the Django template designers, allowing the context object to be
writable was the obvious choice because Python data structures are typically
mutable by default. Which brings us to Python...

Python

There are many ways that we could think about the power of the Python language,
and how it makes life hard for every person and program that wants to make sense
of Python code.

Compilation and performance of Python is an obvious one. The unrestricted
effects that are possible at any point, including writable classes and modules
etc., not only allow authors to do some very useful things, they make it
extremely difficult to execute Python code quickly. PyPy
has made some impressive progress, but looking at the curve from PyPy 1.3
onward, which shows diminishing returns, makes it
clear that they are unlikely to make much bigger gains in the future. And the
gains that have been made in terms of run time have often been at the expense of
memory usage. There is simply a limit to how well you can optimise Python code.

(Please note, to all who continue reading this — I'm not a Python basher or a
Django basher for that matter. I'm a core developer of Django, and I use Python
and Django in almost all my professional programming work. The point of this
post is illustrate the problems caused by powerful languages).

However, rather than focus on the performance problems of Python, I'm going to
talk about refactoring and maintenance. If you do any serious work in a
language, you find yourself doing a lot of maintenance, and being able to do it
quickly and safely often becomes very important.

So, for example, in Python, and with typical VCS tools (Git or Mercurial, for
instance), if you re-order functions in a module e.g. move a 10 line function to
a different place, you get a 20 line diff, despite the fact that nothing changed
in terms of the meaning of the program. And if something did change (the
function was both moved and modified), it's going to be very difficult to spot.

This happened to me recently, and set me off thinking just how ridiculously bad
our toolsets are. Why on earth are we treating our highly structured code as a
bunch of lines of text? I can't believe that we are still programming like this,
it is insane.

At first, you might think that this could be solved with a more intelligent
diff tool. But the problem is that in
Python, the order in which functions are defined can in fact change the meaning
of a program (i.e. change what happens when you execute it).

Here are a few examples:

Using a previously defined function as a default argument:

These functions can't be re-ordered or you'll get a NameError for foo in
the definition of bar.

Using a decorator:

Due to unrestricted effects that are possible in @decorateit, you can't
safely re-order these functions and be sure the program will do the same
thing afterwards. Similarly, calling some code in the function argument list:

Similarly, class level attributes can't be re-ordered safely:

Due to unrestricted effects possible inside the Bar constructor, the
definitions of a and b cannot be re-ordered safely. (This might seem
theoretical, but Django, for instance, actually uses this ability inside
Model and Form definitions to provide a default order for the fields,
using a class level counter inside the base Field constructor).

Ultimately, you have to accept that a sequence of function statements in Python
is a sequence of actions in which objects (functions and default arguments) are
created, possibly manipulated, etc. It is not a re-orderable set of function
declarations as it might be in other languages.

This gives Python a certain power when it comes to writing it, but imposes
massive restrictions on what you can do in any automated way to manipulate
Python source code.

Above I used the simple example of re-ordering two functions or class
attributes. But every single type of refactoring that you might do in Python
becomes virtually impossible to do safely because of the power of the language
e.g. duck typing means you can't do method renames, the possibility of
reflection/dynamic attribute access (getattr and friends) means you can't in
fact do any kind of automated renames (safely).

So, if we are tempted to blame our crude VCS or refactoring tools, we actually
have to blame the power of Python — despite the huge amount of structure in
correct Python source code, there is very little that any software tool can do
with it when it comes to manipulating it, and the line-based diffing that got me
so mad is actually a reasonable approach.

In an ideal world, with my dream language, when you rename a function, the
entire 'diff' in your VCS should simply be "Function foo renamed to bar". (And,
this should be exportable/importable, so that when you upgrade a dependency to a version in
which foo is renamed to bar, it should be exactly zero work to deal with this).
In a “less powerful” language, this would be possible, but the power given to
the program author in Python has taken power from all the other tools in the
environment.

Does this matter? It depends on how much time you spend manipulating your
code, compared to using code to manipulate data.

At the beginning of a project, you may be tempted to desire the most powerful
language possible, because it gives you the most help and freedom in terms of
manipulating data. But later on, you spend a huge amount of time manipulating
code, and often using an extremely basic tool to do so — a text editor. This
treats your highly structured code as one of the least structured forms of data
— a string of text — exactly the kind of manipulation you would avoid at all
costs inside your code. But all the practices you would choose and rely on
inside your program (manipulating all data inside appropriate containers) are no
longer available to you when it comes to manipulating the program itself.

Some popular languages make automated refactoring easier, but more is needed: to
actually make use of the structure of your code, you need an editor and VCS that
understand your code properly. Projects like Lamdu are in the right direction, but still in
their infancy, and unfortunately involving re-thinking the entire software
development stack.

Summary

When you consider the total system and all the players (whether software or
human), including the need to produce efficient code, and long term
maintainability, less powerful languages are actually more powerful — “slavery
is freedom”. There is a balance between expressiveness and reasonability.

The more powerful a language, the greater the burden on software tools, which
either are need to be more complicated in order to work, or are forced to do
less than they could. This includes:

compilers — with big implications for performance

automated refactoring and VCS tools — with big implications for maintenance.

Similarly, the burden also increases for humans — for anyone attempting to
understand the code or modify it.

A natural instinct is to go for the most powerful solution, or a solution that
is much more powerful than is actually needed. We should try to do the opposite
— find the least powerful solution that will do the job.

This won't happen if creating new languages (which might involve parsers etc.)
is hard work. We should prefer software ecosystems that make it easy to create
very small and weak languages.