Transfer Learning for Fun and Profit

Way back, what seems like ages ago now, I purchased Marcus Hutter‘s book Universal Artificial Intelligence, convinced that in its pages lay the secret of AI, and that if I understood that book then I would be able to write my own AI. So I read it from cover to cover (not an easy task for me at that time – I was unfamiliar with the math) over the course of a few months, and I was of course a bit disappointed.

The book lays out in great detail the AIXI model. If you’re familiar with AGI research you probably know what AIXI is, but if you’re not, you’ve probably never heard of it. I’m not going to attempt to fully describe AIXI here (a good intro is given in this page), except to say that at the heart of AIXI lies a search algorithm. This is a search over programs. To simplify it down a lot, AIXI weighs programs by how short they are and how well they fit its observations, and then takes the programs with highest probability as constituting a model of the world. This is called probabilistic induction; induction being of course the classical problem of finding out the next term in a sequence.

In Three kinds of probabilistic induction: Universal distributions and convergence theorems, Solomonoff introduces three kinds of probabilistic induction problems.

  • Sequence induction: Given a set of symbols s1,s2,… predict the next one.
  • Set induction: Given a set of data belonging to some set of sets or categories, classify a new data point as belonging to one of those categories.
  • Operator induction: Given a set of relations q1->a1, q2->a2, …, predict qn->?

Set induction is what is commonly referred to as classification in conventional ML. And Operator induction is referred to as ‘analogy-making’, and is the problem I attacked in my Brainfuck post, albeit in the very limited context of simple Brainfuck programs.

The kind of MCMC search over simple programs that I introduced in that post could also be adapted to set induction and sequence induction. With sequence induction, you just sample from the set of programs that produce a given sequence. But set induction is more tricky. You can’t just sample from the set of programs that produce given outputs, because our list of examples isn’t exhaustive. With operator or sequence induction, we knew exactly what our sampled program needed to put out. But with set induction, we need our sampled program to produce outputs stochastically. We can solve this by, instead of sampling over deterministic programs, sample over probabilistic programs, written in a probabilistic programming language like Church. Then, instead of checking if our sampled program produces the right output, we instead compute the log-likelihood of the program conditioned on the input samples. Our final posterior distribution becomes P(output | program)P(program)/Z, via Bayes’ theorem.

Let’s go back to AIXI for a second. The problem with AIXI, of course, is that the space of all possible programs is big, and searching through all possible programs is hard because for even a single program, it might take an extremely large amount of time for the program to halt and produce an output. So AIXI, which relies on testing every single program in existence, is doubly hard and thus completely intractable. (However, interestingly, finite-time versions of AIXI exist and are optimal in a certain sense, but even those are far too inefficient for practical AI).

In the years since, there’s been a small but active group of people who’ve been chipping away at the problems of AIXI, and trying to come up with ‘more computable’ general AIs. One of the main ideas here is transfer learning – using solutions from previous problems to figure out solutions to new problems. If an AGI had this ability, training it would become similar to training a human: Just give it a progression of harder and harder problems to solve (instead of asking it to solve really really hard problems at the outset, which would be very hard). It’s clear that humans use something similar, and we could even do things like input ready-made knowledge about the world to help the AGI even further.

One early attempt at doing this was the optimally ordered problem solver (OOPS):

[OOPS] spends part of the total search time for a new problem on testing programs that exploit previous solution-computing programs in computable ways. If the new problem can be solved faster by copy-editing/invoking previous code than by solving the new problem from scratch, then OOPS will find this out.

Sounds promising. In practice, though, OOPS was limited by the same sorts of problems that plagued its symbolic AI brethren of the 60’s and 70’s – it could not scale well to real-world problems, and required search time that inevitable grew exponentially in the size of the input. It’s clear that without utilizing stochastic search, methods like this will always have this sort of problem.

Genetic programming (GP) is another set of methods for program search that has had a lot of success. Genetic programming simply uses ‘gene’ representations for programs, and uses the methods of evolution (natural selection, mutation, crossover, etc.) to generate new programs. Listen to the GP/GA people and they will tell you that crossover is th the missing ingredient that we need. In practice, though, GA has never been shown to be better than MCMC methods.

The great thing about MCMC methods is that they really do seem to make it far more efficient to search the space, because we spend far more time in actual promising regions of the solution landscape than the vast deserts of nothingness. And this gives a kind of transer learning ‘for free’ – if we consider each ‘sample’ to be a complete source code listing containing all our previously-solved problems as functions, new samples can invoke old routines by simply calling their functions anywhere in the code. And if we couple this with a more efficient proposal distribution that cleverly generates new code with high (e.g. > 10%) acceptance probability, then we are in business, and can do some serious program searching.

The problem with Brainfuck is that the kind of abstractions that make programming in a higher-level language easy (and make transfer learning easy) – like variables and functions and so on – don’t exist. The upside to this is that sampling from Brainfuck programs is easy (which is why I chose it!) However, if using a higher-level language like Scheme or Church which does support variables and functions, the proposal distribution in MCMC becomes quite complicated and involved, especially if we want to use transfer learning, and still maintain that the MCMC give an unbiased chain over time!

I haven’t yet worked out the maths of this, but in future posts I might revisit the problem and give it a crack to see how it goes.

At any rate, it would be interesting to see how it would work as a driver for one of the main themes of this blog – vision. One could image using a probabilistic program search model as the top layer for a visual model of the world. After a deep adversarial convnet has figured out a good representation for the world, we use a search over probabilistic programs (primed with knowledge about 3d space, orientation, and constellation models) to figure out a very complex, efficient representation for the world. Again, all of this is just a lot of speculation, but it shows that the different sub-fields of AGI/ML are often related and can use ideas from each other to explore new avenues of research.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s