Perlsphere.net

Hardening Perl's Hash Function

2013-11-06

In 2003 the Perl development community was made aware of an algorithmic
complexity attack on the Perl's hash table
implementation[1]. This attack was similar to reports over the
last few years of attacks on other languages and packages, such as
the Java, Ruby and Python hash implementations.

The basic idea of this attack is to precompute a set of keys which
would hash to the same value,
and thus the same storage bucket. These
keys would then be fed (as a batch) to a target which would then have to compare each
key against each previously stored key before inserting the new key,
effectively turning the hash into a linked list, and changing the
performance profile for inserting each item from O(1) (amortized) to O(N). This means that
the practice of loading arguments such as GET/POST parameters into hashes
provided a vector for denial of service attacks on many
HTTP-based applications.

As a response to this, Perl implemented
a mechanism by which it would
detect long chains of entries within a bucket and trigger a "hash split".
This meant it would double the number of buckets and then redistribute
the previously stored keys as required
into the newly added buckets. If, after this hash split, the
chain was still unacceptably long, Perl would cause the hash to go into
a special mode (REHASH mode) where it uses a per-process random hash seed for its
hash function. Switching a normal hash to this special mode would cause Perl to allocate
a new bucket array, recalculate all of the previously stored keys using
the random seed and redistribute the keys from the old bucket array into
the new one. This mitigated the attack by remapping the previously colliding
keys into a well distributed set of randomly chosen new buckets.

At this point the Perl community thought we had put the subject of hash
collision attacks behind us, and for nearly 10 years we heard little
more on the subject.

Memory Exhaustion Attack On REHASH Mechanism

Over the years occasionally the subject
of changing our hash function would come up. For instance Jarkko made
a number of comments that there were faster hash functions and in response
I did a bit of research into the subject, but little came of
this work.

In 2012 this changed. I was working on several projects that made
heavy use of Perl's hash function, and I decided to invest some efforts
to see if other hash functions would provide performance improvements.
At the same time other people in the Perl community were becoming interested,
partly due to my work and partly due to the publicity from the
multi-collision attacks
on Python's and Ruby's hash functions (MurmurHash and CityHash). Publicity
I actually did not notice until after I had pushed patches to switch
Perl to use MurmurHash as its default hash, something that got reverted real
fast.

In restructuring the Perl hash implementation so it was easier
to test different hash functions, I became well acquainted with
the finer details of the implementation of the REHASH mechanism. Frankly it
got in the way and I wanted to remove it outright. While arguing
about whether it could be replaced with a conceptually simpler mechanism I
discovered that the defenses put in place in 2003 were not as strong
as had been previously believed. In fact they provided a whole new
and, arguably, more dangerous attack vector than the original attack
they were meant to mitigate.
This resulted in the perl5 security team announcing
CVE-2013-1667,
and the release of security patches for all major Perls versions since 5.8.x.

The problem was that the REHASH mechanism allowed an attacker to create
a set of keys which would cause Perl to repeatedly double the size
of the hash table, but never trigger the use of the randomized hash seed.
With relatively few keys the attacker could make Perl allocate a bucket array
with up to 2^32 hash buckets, or as many as memory would allow. Even if the
attack did not consume all the memory on the box there would be serious
performance consequences as Perl remapped the keys into ever increasing
bucket arrays. Even on fast 2013 hardware, counting from 0 to 2^32 takes a while!

This issue affected all versions of Perl from 5.8.2 to 5.16.2.
It does not affect Perl 5.18. For those interested the security
patches for these versions are as follows:

At this time most Perl installations should be security-patched.
Additionally official Perl maintenance releases 5.16.3, and 5.14.4
were published. But if you would
like to know if you are vulnerable you can try the following program:

The following are statistics generated by the time program for
the full attack (not the one-liner above) against a Perl 5.16 with and without the
fix applied (identical/zero lines omitted) on a laptop with 8GB:

Without the fix patch (0ff9bbd11bcf0c048e5b3e4b893c52206692eed2):

With the fix patch (f1220d61455253b170e81427c9d0357831ca0fac) applied:

But this doesn't explain all of the changes in Perl 5.18

The observant reader will have realized that if we could patch older
Perls to be robust to CVE-2013-1667
that we could also have patched 5.18
and avoided the problems that were caused by changing Perl's hash
function. The reason we went even further than those maintenance patches
is that we found out that Perl had further,
if less readily exploitable,
vulnerabilities to attack, and we wanted to do our best to fix them all.

This part of the story starts with Ruslan Zakirov posting a report to the
perl5-security mailing list. The report outlined the basis of a key discovery attack
on Perl's hash function. At first the Perl security team was not entirely convinced,
but he then followed up with more code that demonstrated that his attack
was sound. This development meant that the choice of a random seed would
not make Perl's hash function robust to attack. An attacker could
relatively efficiently determine the seed, and then use that knowledge
to construct a set of attack keys that could be used to attack the hash
function.

Nicholas Clark then ramped things up a notch further and did some in-depth analysis
on the attack and the issues involved. At the same time so did Ruslan and
myself. The conclusion of this analysis was that the attack exploited
multiple vulnerabilities in how Perl's hash data structure worked and
that the response would similarly require a multi-pronged approach.

Changes to the One-At-A-Time function

The first vulnerability was that Bob Jenkins' One-At-A-Time hash,
which Perl used, does not "mix" the seed together with the hashed data
well enough for short keys. This allows an attacker to mount a key discovery
attack by using small sets of
short keys and the order they were stored in to probe the "seed" and
eventually expose enough bits of the seed that a final collision attack could be
mounted.

We addressed this issue by making Perl append a four digit, randomly
chosen suffix to every string it hashed. This means that we always
"mix" the seed at least 4 times, and we mix it with something that
the attacker cannot know. This effectively doubles the number of bits
used for "secret" state, and ensures that short keys do not "leak"
information about the original seed. The reason we use a suffix is
that adding a prefix is the same as starting with a different initial
seed state, so does not add any extra security. A suffix modifies
the final state after the user input is provided and increases the
search space an attacker must consider.

Related to this change was that the original One-At-A-Time function
was potentially vulnerable to multi-collision attacks. An attacker
could precalculate one or more suffixes such that

which would then allow an attacker to trivially construct an infinite
set of keys which would always collide into the same bucket. We hardened
the hash by mixing in the length of the key into the seed. We believe
that this more or less eliminates the possibility of a multi-collision
attack as it means that the seed used to calculate H( concat(x, suffix) ) would
not be the same seed as H( concat(x, suffix, suffix) ). Cryptographers are
invited to prove us wrong.

Reduce Information Leakage

The second vulnerability was that it is all too easy to leak information about
the hash function to an attacker. For instance a web page might accept
a set of parameters and respond with information for each of those
parameters in the natural key order for the hash. This might provide
enough information to mount a key discovery attack.

In order to prevent this information leakage we randomize the element order
returned by the keys() and each() functions on the hash. We do this by adding a mask to
each hash, and when an insert into the hash occurs we modify the mask in
a pseudo-random way. During traversal we iterate from 0 to the k-th bucket
and then XOR the iteration value with the mask. The result is that every
time a new key is added to the hash the order of keys will change more
or less completely. This means that the "natural" key order of the hash
exposes almost no useful data to an attacker. Seeing one key in front of
another does not tell you anything about which bucket the key was stored
in. A nice side effect of this is that we can use this mask to
detect people inserting into a hash during each() traversal, which
generally indicates a problem in their code and can
produce quite surprising results and be very difficult to debug. In Perl 5.18
when this happens we warn the developer about it.

A third vulnerability is related to the case where two keys are to be stored in
the same bucket. In this case the order of the keys was predictable: the
most recently added key would be "first" out during a keys() or each()
traversal. This in itself allows a small amount of data to leak to
a potential adversary. By
identifying such a case one could find two (or more) strings which had
the same least significant bits. By stuffing more keys into the hash and
triggering a hash split an attacker could determine that the newly added
bit of the hash value was different, or the same, for the two keys. Without
the key-order randomization logic mentioned previously the attacker could
also determine which of the two had a 1 or 0 in the most significant bit
of the used part of the hash value.

While we were not yet able to construct an actual attack based on this
information we decided to harden against it anyway. This is done by
randomly choosing whether we should insert the colliding key at the top of a bucket chain
or if we should insert at the second from top in the chain. Similarly
during a hash split we also make such a decision when keys collide while
being remapped into the new buckets. The end result is that
the order of two keys colliding in a bucket is more or less random,
although the order of more than two keys is not.

People complain. Randomization is good anyway!

We introduced randomization into Perl's hash function in order to harden
it against attack. But we have discovered that this had other positive
consequences that we did not foresee.

The first of these initially appeared to be a downside. Perl's hash
function behaved consistently for more than a decade.
Over that time Perl developers inadvertently created dependencies
on this key order. Most of the examples of this were found in test files
of CPAN modules: Many of us got lazy and "froze" a key
order into the test code. For example by embedding the output of a
Data::Dumpercall into one's tests. Some of these however were real
bugs in reputable and well tested modules.

By making Perl's key order random these dependencies on key order became
instantly visible and a number of bugs that probably manifested
themselves as "heisenbugs" became regular occurrences, and much
easier to track down and identify. I estimate that for every two "non-bugs"
(in things like test code) that we found, there was one "real bug" that was
identified as well. Considering one of these was in the Perl core, and
the other was in DBI, I personally consider this to be a good result.

Many people object that randomization like this makes debugging harder.
The premise is that it becomes difficult to recreate a bug and thus
debug it. I believe that in practice it is the opposite. Randomization
like this means that a formerly rare bug becomes common. Which in turn
means it becomes much more obvious that it is related to subtle
dependencies on key order. Effectively making it much easier to find
such problems.

A last benefit of randomizing the hash function is that we can now,
at any time, change or replace the hash function that Perl is built with.
In fact 5.18 is bundled with multiple hash functions including the
presumed cryptographically strong Siphash. Since hash order from now
on will be random, we don't have to worry if we change the function.
External code should already be robust to the hash order being
unpredictable.

How tangible are attacks on Perl's hash function?

There has been a lot of discussion on this subject. Obviously
erring on the side of caution on security matters is the
right course of action. Nevertheless there is a lot of debate on
how practical an attack like this is in real production
environments where Perl is commonly used, such as web servers. Here are
some of the points to keep in mind:

Perls hash algorithm uses an array of buckets whose size is always between
the number of keys stored in it and a factor of two larger.
This means that a hash with 20 keys in it will generally have 32 buckets,
a 32 keys hash will be split into 64 buckets, and so on.
This means the more
keys are inserted in a hash the less likely further keys will be put
into the same bucket. So attacking a hash cannot make non-attack keys
slower[2]. An attacker basically
only slows their own fetches down and except as a by-product of resource
consumption they will not affect other requests.

For an attack to reach DOS proportions the number of items
inserted into the hash would have to be very, very large. On modern CPUs
a linked list of thousands to hundreds of
thousands of keys would be necessary before there was serious degradation of service. At
this point even if the attack was unsuccessful in terms of degrading
Perl's hash algorithm, it would still function effectively as a
data-flooding denial of service attack. Therefore, focusing on the
hash complexity aspect of the attack seems unwarranted.

Rudimentary out-of-band measures are sufficient to mitigate an attack.
Hard restrictions on the number keys that may be accepted by
publicly facing processes are sufficient to prevent an attack from causing
any damage. For instance, Apache defaults to restricting the number of
parameters it accepts to 512, which effectively hardens it against this
type of attack. (This is one reason the attack on the rehash mechanism
is so important, a "successful" attack requires relatively few keys.) Similarly,
a well designed application would validate the parameters it receives
and not put them in the hash unless they were recognized.

So long as the hash function chosen is
not vulnerable to multi-collision attacks then simple per-process hash
seed randomization makes the job of finding an attack keys set
prohibitively difficult. One must first perform a hash-seed discovery
attack, then generate a large set of keys. If restrictions on the number
of keys the process will accept
are in place then the keys must be very large before collisions would have
a noticeable effect. This also makes the job of
finding colliding keys all the more expensive.

In many circumstances,
such as a web-service provider, the hosts will be behind load balancers.
Which will either mean that every different web host uses different hash
seeds, making hash seed discovery attacks very difficult. Or it requires an
attacker to open a very long-running, persistent session with the server
they wish to attack. This should be easily preventable via normal
monitoring procedures.

For all these reasons it appears that hash-complexity
attacks in the context of Perl and web hosting environments
are of limited interest so long as:

The hash function does not allow multi-collision attacks.

The hash function uses at least a per-process hash seed randomization

The interface to untrusted potential attackers uses simple, hard
limits on the number of keys it will accept.

These properties are relatively easy to accomplish without resorting to
cryptographically strong hash functions (which are generally slow), or other
complicated measures to prevent attacks. As the case of Perl's rehashing
flaw has
shown, the cure may be worse than the disease. The code that this CVE
exploits for its attack was added to Perl as part of our attempt to
defend against the hypothetical attack vector of excessive bucket
collisions. We are at this time unaware of any real
attack of this form.

Hash Functions For Dynamic Languages

It seems like the bulk of research on hash functions has focused on
finding fast, cryptographically secure hash functions for long strings
containing binary data. However, dynamic languages, like Perl, make heavy
use of hash functions on small strings, which are often restricted to simple
alphanumeric characters. Examples are email addresses, identifiers like
variables and method names, single character keys, and lists of numeric ids
and similar use cases. Not only are they relatively short strings, but
they use restricted sets of bits in their bytes.

So far it appears that Bob Jenkins' One-At-A-Time-Hash with minor
modifications provides an acceptable solution. It seems to have good
distribution properties, is reasonably fast for short strings, and - with
the hardening measures added in Perl 5.18 - it appears to be robust against
attack. Analysis done by the Perl5 Security Team suggests that
One-At-A-Time-Hash is intrinsically more secure than MurmurHash.
However to my knowledge, there is no peer-reviewed cryptanalysis to
prove it.

There seems to be very little research into fast, robust, hash
algorithms which are suitable for dynamic languages. Siphash is a
notable exception and a step forward, but within the Perl
community there seems to be consensus that it is currently too slow, at
least in the recommended Siphash-2-4 incarnation. It is also problematic
that its current implementation only supports 64 bit architectures.
(No doubt this will improve over time, or perhaps even already has.)

Universal hashing also seems promising in theory, but unproven in
practice for cases like Perl, where the size of the hash table can be
very small, and where the input strings are of variable size, and
often selected from restricted character sets. I attempted to implement
a number of Universal Hashing-style hash functions for Perl, and was
disappointed by extremely poor distributions in the bucket hash for
simple tasks like hashing a randomly selected set of integers in text
form. This may have been a flaw in my implementation, but it appeared that
Universal Hashing does not perform particularly well when the input
bytes are not evenly distributed. At the very least further work is
required to prove the utility of Universal Hashing in a dynamic language
context.

The dynamic/scripting language community needs the academic computing
community to provide a better tool box of peer reviewed string hash functions which
offer speed, good distribution over restricted character sets and on
short strings, and that are sufficiently hardened that in practical deployments
they are robust to direct attack. Security is important, but theoretical
attacks which require large volumes of key/response exchanges cannot trump
requirements such as good distribution properties and acceptable performance
characteristics. Perl now makes it relatively easy to add and test new hash
functions (see hv_func.h),
and would make a nice test bed for those interested
in this area of research.

Afterwards

I would like to thank Nicholas Clark, Ruslan Zakirov, and Jarkko Hietaniemi
for their contributions which led to the changes in Perl 5.18. Nicholas
did a lot of deep analysis, and provided the motivation to create my attack
proof, Ruslan provided analysis and a working key-discovery attack scripts
on Perl's old hash function, and Jarkko motivated me to look into the general
subject.

[1]
See also:
The bug filed against Perl
and the original research.

[2]
If anything, such an attack might make access to keys that aren't part of the
attack faster: The attack keys cluster in one bucket. That means the other keys
are much more likely to spread out among the remaining buckets that now have
fewer keys on average than without the attack.