Gowers.wordpress.com

Razborov and Rudich’s natural proofs argument

2013-10-07

Introduction

The purpose of this post is to add some rigour to what I wrote in the previous post, and in particular to the subsection entitled “Why should we believe that the set of easily computable functions is a ‘random-like’ set?” There I proved that if the Rubik’s-cube-like problem is as hard as it looks, then there can be no polynomial-time-computable property that distinguishes between a random composition of 3-bit scramblers and a purely random Boolean function. This implies that there can be no polynomial-time-computable “simplicity” property that is satisfied by all Boolean functions of circuit complexity at most that is not satisfied by almost all Boolean functions.

I personally find the assumption that the Rubik’s-cube-like problem is hard very plausible. However, if you disagree with me, then I don’t have much more I can say (though see Boaz Barak’s first comment on the previous post). What Razborov and Rudich did was to use a different set of random polynomial-time-computable functions that has a better theoretical backing. They build them out of a pseudorandom function generator, which in turn is built out of a pseudorandom generator, which is known to exist if the discrete logarithm problem is hard. And the discrete logarithm problem is hard if factorizing large integers is hard. Since many people have tried hard to find an algorithm for factorizing large integers, there is some quite strong empirical evidence for this problem’s being hard. It’s true that there are also people who think that it is not hard, but the existence of a pseudorandom generator does not depend on the hardness of factorizing. Perhaps a more significant advantage of the Razborov-Rudich argument is that any pseudorandom generator will do. So the correctness of their conclusion is based on a weaker hypothesis than the one I used earlier.

Pseudorandom generators

It’s time I said in more detail what a pseudorandom generator is. Suppose you have a Boolean function , with . Then you have two obvious probability distributions on . The first is just the uniform distribution, which we can think of as choosing a random 01-string of length . The second is obtained by choosing an element uniformly at random from and applying the function . This we can think of as a pseudorandom 01-string of length . The idea is that if mixes things up sufficiently, then there is no efficient algorithm that will give significantly different results when fed a purely random 01-string and a pseudorandom 01-string.

We can be slightly more formal about this as follows. Suppose is a Boolean function. Define to be the probability that when is chosen randomly from . Define to be the probability that when is chosen randomly from . We say that is an -hard pseudorandom generator if whenever can be computed in time .

It may look a little strange that appears twice there. Shouldn’t one talk about a -hard pseudorandom generator, where the number of steps is and the difference in the probabilities is at most ? The reason for setting equal to is that, up to a polynomial, it is the only interesting value, for the following reason. Suppose that the difference in the probabilities is . Then if we run the algorithm times, the difference in the expected number of times we get 1 is . If that is significantly bigger than , then the probability that the difference in the actual number of times we get a 1 is not at least will be small, so we can detect the difference between the two with high probability by counting how many 1s each one gives. This happens when is proportional to . Speaking a little roughly, if the probabilities differ by , then you need at least runs of the experiment and at most runs to tell the difference between random and pseudorandom, where and are fixed polynomial functions. Since running the experiment times doesn’t affect the complexity of the detection process by more than a polynomial amount when depends polynomially on , we might as well set : if you prefer a bigger you can get it by repeating the experiment, and there is nothing to be gained from a smaller since the difference between random and pseudorandom is already hard to detect.

Intuitively, a pseudorandom generator is a function from a small Boolean cube to a big one whose output “looks random”. The formal definition is making precise what “looks random” means. I took it to mean “looks random to a computer program that runs in polynomial time” but one can of course use a similar definition for any model of computation, or indeed any class whatever of potential distinguishing functions. If no function in that class can distinguish between random functions and images of with reasonable probability, then is pseudorandom for that class.

Pseudorandom function generators

A pseudorandom generator produces a small subset of (technically it’s a multiset, but this isn’t too important) with the property that it is hard to distinguish between a random string in and a purely random string. However, sometimes we want more than this. For example, sometimes we would like to find a function from to that “looks random”. We could of course think of such a function as a 01 string of length (that is, as a list of the values taken by the function at each point of ) and use a pseudorandom generator to generate it, but that will typically be very inefficient.

Here is a much better method, which was devised by Goldreich, Goldwasser and Micali. (Their paper is here.) Let be a pseudorandom generator. (Later I’ll be more precise about how hard it needs to be.) We can and will think of this as a pair of functions , each one from to . If we are now given a string , we can use it and the functions and to define a function . We simply take the composition of s and s that correspond to . That is, . To put that another way, you use the digits of to decide which of and to apply. For instance, if , then you apply , then , then , then , then .

What we actually want is a function to , but if we take the first digit then we’ve got one. We can think of it as a function of two variables: given and , then equals the first digit of . But a function of two variables can also be thought of as a collection of functions of one variable in two different ways. We’ve thought so far of as indexing functions and as being the argument, but now let’s switch round: we’ll take as the index and as the argument. That is, let’s write for , which is itself just the first digit of .

We now have two probability distributions on the set of all functions from to . One is just the uniform distribution — this is what we mean by a random function. The other is obtained by choosing a random string of length , applying the composition to it and taking the first digit — this is what we mean by a pseudorandom function (associated with this particular construction).

Note that we are now in a very similar situation to the one we were in with 3-bit scramblers earlier. We have a small bunch of efficiently computable functions — the functions — and it is hard to distinguish between those and entirely random functions. But now we shall be able to prove that it is hard, subject to widely believed hardness hypotheses. Also, even if you don’t believe those hypotheses, the reduction to them is interesting.

How easily can be computed? Well, given , we have to calculate the result of applying to a composition of functions, each of which is a polynomial-time-computable function from to . So the number of steps we need is for some polynomial . This is polynomial in provided that is polynomial in . Accordingly, we take for some large constant .

The idea now is to show that if we can distinguish between a random function of the form and a genuinely random function, then the pseudorandom generator is not after all very hard: in fact, it will have hardness at most , which is substantially less than exponential in . Since the pseudorandom generator was arbitrary, this will show that no pseudorandom generator of that hardness exists.

By the way, let me draw attention to the parts of this proof that have always caused me difficulty (though I should say again that it’s the kind of difficulty that can be overcome if one is sufficiently motivated to do so — I’ve just been lazy about it up to now). The first is the point about the roles of the two variables and above and the way those roles switch round. Another is a wrong argument that has somehow made me feel that what is going on must be subtler than it actually is. That argument is that a pseudorandom generator is a function defined on , so its hardness is a reasonable function of , while the kind of pseudorandomness we’re interested in takes place at the level of Boolean functions defined on , which have domains of size , so breaking those in polynomial time in will surely have no bearing on the far smaller function that makes the pseudorandom generator.

I didn’t express that wrong argument very well — necessarily, since it’s wrong — but the thing I’ve been missing is that is quite large compared with , and we are making really quite a strong assumption about the hardness of the pseudorandom generator. Specifically, we’re not just assuming that the generator has superpolynomial (in ) hardness: we’re assuming that its hardness is at least for some small positive constant . That way the hardness can easily be comparable to . So there isn’t some clever way of “dropping down a level” from subsets of to subsets of or anything like that.

The third thing that got in the way of my understanding the proof is connected with levels. It’s surprising how often something easy can feel hard because it is talking about, say, sets of sets of sets. Here it is important to get clear about what we’re about to discuss, which is a sequence of probability distributions of Boolean functions from to . This shouldn’t be too frightening, since we’ve already discussed two such probability distributions: the uniform distribution and the distribution where you pick a random and take the function . What we’re going to do now is create, in a natural way, a sequence of probability distributions that get gradually more and more random. That is, we’ll start with the the distribution that’s uniform over all the , and step by step we’ll introduce more randomness into the picture until after steps we’ll have the uniform distribution over all functions from to .

I’m going to describe the sequence of distributions in a slightly different way from the way that Razborov and Rudich describe it. (In particular, their distributions start with the uniform distribution and get less and less random.) I don’t claim any particular advantage for my tiny reformulation — I just find it slightly easier. I also feel that if I’ve reworked something in even a minor way then it’s a sign that I understand it, so to a large extent I’m doing it for my own benefit.

First of all, let us take the set of binary sequences of length less than or equal to . These sequences form a binary tree if we join each sequence to the two sequences you get by appending a 0 and a 1. Let us take a sequence of trees , where consists just of the root of (that is, the empty sequence) and is the full tree , and let us do so in such a way that each is obtained from by finding a leaf and adding its children: that is, picking a sequence in that is not a sequence of length and is not contained in any other sequence in and adding the sequences and . It is not hard to see that this can be done, and since we add two sequences at a time and get from the tree that consists just of the empty sequence to the tree of all binary sequences of length at most , there are trees in this sequence of trees.

Given a tree , we create a probability distribution on the set of functions from as follows. For every we let be maximal such that the subsequence belongs to . We then take a random , apply the composition , and take the first digit of the result. If then we interpret this to mean that we simply pick a random . An important point to be clear about here is that the random points do not depend on . So what I should really have said is that for each vertex of we pick a random . Then for each we find the maximal subsequence that belongs to , apply to the composition , and pass to the first digit. If we interpret the composition as the identity function, so we simply take .

Note that if , then this is just applying the composition to and taking the first digit, which is exactly what it means to take a random function of the form . Note also that if , then for every all we do is take , which is another way of saying that whatever is, we choose a random and take its first digit, which of course gives us a function from to chosen uniformly at random. So it really is the case that the first distribution in our sequence is uniform over all and the last distribution is uniform over all functions from to .

Now let’s think about the difference between the distribution that comes from and the distribution that comes from . Let and be the binary sequences that belong to but not to . Let us also write for the random element of associated with any given 01-sequence . Let be the length of . Then the one thing we do differently when evaluating the random image of is this. If has as an initial segment, then instead of evaluating the first digit of we evaluate the first digit of . If does not have as an initial segment, then nothing changes.

Note that the first of these evaluations can be described as the first digit of . The basic idea now is that if we can distinguish between that and , then we can distinguish between and . But is a purely random sequence in whereas is a random output from the pseudorandom generator.

Let us now remind ourselves what we are trying to prove. Suppose that is a simplicity property that can be computed in time (which is another way of saying that it can be computed in time that is polynomial in ). By “simplicity property” I mean that holds whenever has circuit complexity at most , where is the polynomial function described earlier, and does not hold for almost all functions. Actually, we can be quite generous in our interpretation of the latter statement: we shall assume that if is a purely random function, then holds with probability at most .

If has those two properties, then holds for every , and therefore

I wrote there to mean the probability when is chosen uniformly at random, and to mean the probability when is chosen uniformly at random.

Let me now write for the probability distribution associated with . From the above inequality and the fact that there are of these distributions it follows that there exists such that

That is, the probability that holds if you choose a function randomly using the st distribution is greater by than the probability if you use the th distribution.

What we would like to show now is that this implies that the hardness of the pseudorandom generator is at most . To do that, we condition on the values of for all sequences other than , and . (Recall that was defined to be the unique sequence such that and belong to but not to .) By averaging, there must be some choice of all those sequences such that, conditioned on that choice, we still have

We now have a way of breaking the pseudorandom generator . Suppose we are given a sequence and want to guess whether it is a random sequence of the form (with chosen uniformly from ) or a purely random element of . We create a function from to as follows. For each , let be the maximal initial segment of that belongs to . If is not equal to , then take the first digit of , where is the length of and is the fixed sequence from the set of sequences on which we have conditioned. If and has length , then apply to the left half of if the next digit of is 0 and to the right half of if it is 1. Then take the first digit of the result.

If is a random sequence of the form , then what we are doing is choosing a random and taking the first digit of . Therefore, we are choosing a random function according to the distribution , conditioned on the choices of . If on the other hand is a purely random sequence in , then we are choosing a random function according to the distribution under the same conditioning. Since the probabilities that holds differ by at least and can (by hypothesis) be computed in time , it follows that the hardness of is at most .

Since for an arbitrarily small constant , it follows that if there is a polynomial-time-computable property that distinguishes between random and pseudorandom functions from to , then no pseudorandom generator from to can have hardness greater than .

A small remark to make at this point is that the hardness of the generator needs to be defined in terms of circuit complexity for this argument to work. Basically, this is because it is not itself that we are using to distinguish between random and pseudorandom sequences but a function that is created out of (using in particular lots of restrictions of the random ) in a not necessarily uniform way. So even if can be computed in polynomial time, it does not follow that there is an algorithm (as opposed to circuit) that will break the generator in time .

Can the “backwards” argument be given a similar foundation?

Recall that earlier I proposed a way of getting round the natural-proofs barrier and proceeded to argue that it almost certainly failed, for reasons very similar to the reasons for the natural-proofs barrier itself. The question I would like to consider here is whether that argument can be made to rely on the hardness of factorizing rather than on the hardness of a problem based on 3-bit scramblers that does not, as far as I know, connect to the main body of hardness assumptions that are traditionally made in theoretical computer science.

Here is an informal proposal for doing so. Let , let , identify the points of with graphs on (labelled) vertices in some obvious way and let take the value 1 if the corresponding graph contains a clique of size and 0 otherwise. Also, let be the function that is 1 if the th edge belongs to the graph and 0 otherwise. Those functions are chosen so that the function <img src="http://s0.wp.com/latex.php?latex=%28f%2Ce_1%2C%5Cdots%2Ce_n%29%3A%5C%7B0%2C1%5C%7D%5En%5Cto%5C%7B0%2C1%5C%7D%5E%7Bn%2B1%7D&bg=ffffff&fg=333333&s=0" alt="(f,e_1,\dots,e_n):\{0,1\}^n\to\{0,1\}^{n+1}" title="(f,e_1,\dots,e_n):\{0,1\}^n\to\