Reading Douglas Hofstadter will usually bend your mind. He’s been something of an oddball maverick in the AI/cognitive science world, doing his own left-field research at Indiana University for about thirty years. He has a lot of ideas about how the mind and consciousness works, which he deals with in “The Mind’s I”, “Gödel, Escher, Bach”, and “I Am a Strange Loop”, among other books. His books have won awards and have become really popular. I think they’re really thought-provoking books overall but they have a bit of that stream-of-consciousness quality that makes it hard for me to see the big picture. Maybe (quite likely, actually), the problem is that I’m not in the same level of genius as he is.

Anyway, one of Hofstadter’s ideas is that analogy-making is the root of all cognition. In the late 80’s Hofstadter and one of his students, Melanie Mitchell, decided to write a computer program – COPYCAT – that could come up with creative analogies. It could answer questions of the form, “abc is to abd as 123 is to what?” (The answer being, of course, “124”). The idea was to come up with a small toy system that could show examples of creative analogy-making.

In this post, I’m going to see if I can duplicate the results produced by COPYCAT using markov chain monte carlo (MCMC) sampling. I’m going to represent the type of word problems they dealt with as:

```
abc -> lmn
abd -> ?
```

Copycat could come up with analogies, ranging from simple ones like:

```
abc -> lmn
abd -> lmo
```

To more creative ones like:

```
abc -> mrrjjj
abd -> mrrjjjj
```

(If you can’t see how that makes sense, think about it a bit.)

So how did it work? When I first learned about COPYCAT’s internal workings it was a bit surprising since it was much, *much* more complicated than I’d thought, and that the simple nature of the program’s inputs and outputs (short strings of ASCII characters) suggested.

The tl;dr version of how it works is that there are a large number of small programs called ‘codelets.’ Each codelet is responsible for some small pattern recognition task like establishing a predecessor/successor relationship between adjacent letters, or grouping a set of similar letters together. Codelets can operate on the output of other codelets. The system randomly samples from a large collection of codelets (this is called ‘parallel terraced scan’), finds sets of codelets that establish the correct given input-output relationships, and then draws analogies by substituting the inputs (say, substituting abc with abd).

The codelet system, despite the weird way in which it is formulated and presented, is really just a specialized programming language, nothing more. It occurred to me that I might be able to write a program similar to COPYCAT – but much, *much* simpler, using just an ordinary programming language (and modern methods from statistics and machine learning like proper MCMC sampling). One programming language that lends itself well to operating on strings (and in which it is very easy to generate valid random programs) is Brainfuck. So the idea is: We randomly sample from a set of brainfuck programs, take the ones that give the desired result (that is, given input abc, produce output abd), and then see what analogies they produce (what output they produce when, for instance, given abd).

The parallel terraced scan is just a sampling algorithm. I was curious to know how well an analogy-making program would perform if it didn’t use the kind of highly-specialized sampling algorithm COPYCAT used, but instead used standardized sampling methods like Metropolis-Hastings sampling. The metropolis algorithm is one of the oldest MCMC methods but it’s still a pretty useful method. I came up with a *prior probability* for a program as |*A*|^{-(length of program)}, where *A* is the size of the alphabet of the program. The likelihood of a program is how close the produced string is to the target string. An unbiased ‘closeness’ metric would be to just set distance to 0 if the strings are equal and a very large number if they aren’t equal. The problem with this is that it doesn’t lend itself well to optimization. A better way might be to use Levenshtein distance, which is a proper metric in the mathematical sense, so the likelihood of two strings s1 and s2 being the same could simply be the usual Gibbs measure exp(-L(s1,s2)/T)/Z, where L is the Levenshtein distance, T is a factor that controls the ‘hardness’ of the likelihood (the limit of T → 0 causes this likelihood to only be nonzero if the strings are equal) and Z is some normalizing factor.

With our prior and our likelihood we have now constructed a posterior over likely programs. All we need now to implement MCMC sampling is a proposal distribution. That is, we need a distribution that generates new candidate programs given the current candidate program. It is not very difficult to devise a proposal distribution for Brainfuck programs (see the notes at the end of this post).

Running the MCMC gives a lot of interesting results. For instance, here’s a list of some interesting ‘analogies’ the algorithm produces (input -> output are in the first rows, discovered analogies are in the second rows, and the Brainfuck program that produced them are in the third rows):

```
1 -> 2,3,4
2 -> 3,4,5
,+.+.+.>++
5 -> 2,4,6,8,10
7 -> 2,4,6,8,10,12,14
[,+<<[-.+,<,+<.<<,[,.++...-,>,>>.>>-[,<.,.-]]<..>]-<<]++.++.[>>,+.++.++.++.+<]
4 -> 4,3,2,1
10 -> 10,9,8,7,6,5,4,3,2,1
-->,[.-]
# if it's "told" what 'a' is (that is, given 'a' in its memory):
1,2,3 -> a,b,c
1,2,4 -> a,b,d
.+.++.>,+>>-,,>>
```

Of course, the relative frequency by which these solutions are produced differs a large amount from the COPYCAT program. This can easily be taken into account by the differing ‘prior’ imposed by the very different structure of codelet programs vs. Brainfuck programs. There’s nothing, in my opinion, that would inherently cause preference of one system over another, except maybe speed, because the codelets language is designed to have compact representations for these kinds of problems. But I don’t think the codelets system is unique in this regard; it’s probably much more efficient to design some custom language (most likely based on lambdas) for this, rather than using codelets.

### Conclusion

Does COPYCAT have anything to offer modern ML research? It’s hard to say! Could the ideas from parallel terraced scan be adapted to modern ML problems and take on the role of, for instance, an improved stochastic search method alongside modern MCMC methods? I don’t know. The original COPYCAT program was more of a toy example; the *real* implementation of parallel terraced scan ideas to more interesting problems was done in a later program – METACAT. I couldn’t get the original COPYCAT program to run as it’s written in a defunct dialect of lisp. There have been attempts at revivals in Java and Python, and thankfully the python version works. But I couldn’t find any verion of METACAT that works. So it’s a bit hard to objectively assess parallel terraced scan. But if you are aware of people porting/adapting these ideas to newer ML frameworks/problems, I’d absolutely love to hear about it in the comments.

### Note: Proposal distribution for Brainfuck programs.

We desire a function that randomly generates valid modifications of a Brainfuck program. A simple way of doing this might be to randomly change characters in the current program, or to either add or delete characters. This approach has a minor problem in that most generated programs are going to have unbalanced brackets (brainfuck requires [ and ] brackets to be balanced). Here we discuss two methods for solving this problem.

#### Method 1: Generating valid proposals.

We could make the following restriction: Whenever a bracket is deleted, delete its corresponding matching bracket as well. And whenever a bracket is inserted, insert a corresponding matching bracket somewhere in the code. This would work, however we note that this proposal distribution is asymmetric. That is given some program like p1=“,+[+.]”, the likelihood of going to p2=“,++.” (denoted q(p2|p1)) is not the same as the likelihood of going from p2 to p1. Thus to use the Metropolis-Hastings algorithm, we need to be able to calculate q(p2|p1) and q(p1|p2). This is somewhat problematic (because the behavior is conditional on the specific program under consideration), so we devise a proposal distribution that is designed to ease this calculation. The proposal distribution is as follows. First, consider a Brainfuck program as an abstract syntax tree (AST), with the characters `><+-.,`

being leaf nodes, and branches representing brackets. Thus, for instance, the following program:

`+[[>+]+]`

Would be represented in the AST as:

```
+
[
[
>+
]
+
]
```

Or, more graphically:

We can devise a proposal system that either deletes or inserts new leaf nodes, removes a branch (by ‘flattening’ its leaf nodes into the parent), or takes some random subsequence and groups it together under a new node. These four operations can be illustrated as follows:

There are no conditionals in this proposal function, and thus it is easy to calculate conditional probabilities. Let the probability of choosing insertion/deletion be α, and the probability of choosing flattening/grouping be 1-α. Now, the conditional probabilities, assuming the given string is S and the proposed string is R, are:

- If R is result of insertion into S: q(R|S) = α/6⨯(length(S)+1)
- If R is result of deletion from S: q(R|S) = α/length(S)

We can give an additional factor of (1/7)⨯(length(S)/(length(S)+1)) probability of choosing deletion and 6/7 of choosing insertion; this makes insertion/deletion symmetric, so that:

- If R is result of insertion or deletion of S: q(R|S) = q(S|R) = α/7⨯(length(S)+1).

Now, for flattening/groupings. In this case, let’s just pick either flattening or grouping with probability 1/2. Then we have:

- If R is result of flattening of S: q(R|S) = (1-α)/2⨯num_branches(S)
- If R is result of grouping in S: q(R|S) = (1-α)/2⨯num_subsequences(S)

Where num_branches is the total number of non-leaf nodes in the AST (except for the root node) and num_subsequences is the sum of the number of subsequences of each branch in the AST. That is:

where |*B*| is the number of child nodes of node *B*. Note that only the ratio of q(R|S)/q(S|R) is important, so we can rewrite the ratio as simply:

- If R is a result of flattening of S: q(R|S)/q(S|R) = num_branches(S)/num_subsequences(S)
- If R is a result of grouping of S, the inverse of the above.

With these conditional probabilities in hand, the full Metropolis-Hastings algorithm can be given.

#### Method 2: Rejection sampling

This is a much simpler method than the above. Insertion/deletion is as above, but instead of grouping/flattening we can just randomly insert a [ and a ] somewhere in the program, and if there occur adjacent ‘[]’ or ‘][‘, remove them. Then, if the program has unmatched brackets, set the likelihood of the program to zero (or the log-likelihood to a very negative number). This method is very simple to implement and verify, although it has the problem of potentially rejecting a lot of samples. Nevertheless, because of its ease of implementation, this was the first method that was implemented in our code.

Interesting post, thank you!

Wonder if MCMC can do something more complicated like convert a number from dec form to hex form?

Given enough examples, could MCMC learn to convert 16 to 0x10, 1024 to 0x400, etc?

I tried with RNN seq2seq model, and found it does work but not very precise, and it doesn’t work when testing string has a larger length than training string. The length problem seems could be solved by Neural Turing Machine or Neural GPU, but it would still be very interesting to see how MCMC or COPYCAT works on this problem.

Best regards 🙂

LikeLike

MCMC can in principle do that but in practice it depends on your choice of language (b****fuck might not be suitable for this) and how much time you’re willing to dedicate to inference. If the conditions are not set up correctly, the running time can grow exponentially.

LikeLike