The people at Google Deepmind have been working on a kind of neural architecture called Neural Programmer-Interpreters (NPIs). All of ML is ultimately about automating tasks, with the eventual goal of machines being able to automatically carry out all tasks that humans could do. Consider the problem of teaching a machine to do some particular task automatically. It could be a simple task like adding numbers, or a more complicated task like driving a car or folding laundry, or an even more complicated task like making a cup of coffee. Let’s consider the possible ways you could automate the task:
- The ancient (pre-AI) way to automate tasks was to write full highly-detailed program specifications to carry them out. For example, if you want to make a cup of coffee, you give the detailed sequence of voltage levels applied to the robot’s actuators to get them to move to a certain spot, grab a cup, etc. You may also include special conditional cases that work based on inputs from various sensors. The obvious shortcoming here is that to interface with the uncertainty and noisiness of the real world, you need to program in a huge number of conditional behaviors and special cases, and program complexity rapidly gets out of hand.
- The next step (the GOFAI way) was to try to program in enough logic and ‘common sense’ that the machine would figure out by itself how to do the task. For example, if you want it to be able to retrieve a red block from the table, you’d program in enough knowledge that it would know what red blocks are, what tables are, how to move its actuators to get to a table, etc. Completing a task is then analogous to searching for a symbolic sequence of steps that solve a problem, in a way similar to theorem proving. This approach generally failed because hard logic was too inflexible to deal with the real world in a simple way, and as it turns out, common sense is really hard to program into a computer simply based on logic rules.
- The next step (the neural/statistical AI way, from the 80’s to the present) was to come up with a lot of training examples that capture the variability in the real world, and then train some general learning machine on this large data set. At first, the available learning machines weren’t that good, and training them was hard. Over time, better algorithms and more computing power have made this approach very powerful. Unfortunately, this approach also has some issues, namely the requirement of a large data set, and that it’s hard to come up with a neural architecture that could learn non-trivial tasks from just a few training examples the way humans do.
Which brings us to NPI. NPI is an attempt to use neural methods to train machines to carry out simple tasks based on a small amount of training data. And, at its heart, it uses RNNs based on LSTM-cells. It also uses some ideas from the GOFAI approach. If you’re not familiar with RNNs, I highly recommend Andrej Karpathy’s popular blog post on the subject, which it explains it far better than I could. In fact you should probably read it before continuing if you haven’t done so already. I also suggest you read the description of LSTM networks on Colah’s blog if you want to know more about the details of how RNN/LSTM nets work.
Ok, I’ll assume you know what RNNs are now. If you wanted to train an RNN to perform a task, one way you might do it would be to write a sequence-to-sequence (s2s) translator as used in e.g. language translation, where the input is a sequence of symbols in one language, and the output is a sequence of symbols in another language. To automate tasks, the input would be the sequence of observations from the environment (sensors) and the output sequence is the set of functions the machine can perform on the environment. The training examples are sequences where we have ‘held the machine’s hand’ and helped it carry out some task.
People have tried training using this type of RNN before for task automation and, as I mentioned, it requires a lot of training data and it doesn’t generalize very well. Even though it works well enough for machine translation, for automating more algorithmic, compositional tasks, it doesn’t work as well. Let’s take a simple toy example: Sorting a list of numbers using bubblesort. You could imagine having a series of items laid out on the floor, and a ‘virtual robot’ that is able to perform the following actions:
- Walk up and down the line as needed,
- Swap two neighboring items.
In this case, the inputs would be the current position of the robot, and whether the current item is bigger or smaller than the next one. The output actions would be either moving the robot left/right or swapping items. You could manually go through several concrete training examples, providing the sensory inputs and the sequence of actions to take. However, if you train using a simple s2s translator net, you’d find that it would need a lot of training data to start sorting accurately, and wouldn’t generalize very well to larger sequences. Even though it could emulate sorting behavior under certain conditions, the robot wouldn’t have ‘gotten the point’ of sorting.
How do humans learn algorithms? If you’ve read my blog posts, you know the answer is obvious: We learn algorithms/programs by fitting them into our already-well-developed models of the world, decomposing them into hierarchical sequences of sub-programs which are individually simple enough to learn from a small number of training examples. For example, you can decompose bubblesort into repeated application of the sub-program ‘move up the list and swap any numbers that aren’t monotonically increasing.’
The people at Deepmind incorporated this kind of idea into NPI.
The NPI is just a s2s learner that is fitted with a ‘working memory’ in the form of a simple stack-like data structure. Every time it ‘enters’ a sub-program, it pushes the ‘parent’ program on to a stack (and re-initializes its hidden state), and every time it finishes a sub-program, it resumes the parent program where it left off. Since the behavior of sub-programs depends on what parent program is currently executing, it keeps the parent task (called the ‘Input program’) as an input vector to the s2s translator, like so:
So, in a nutshell, that’s all NPI is: An s2s translator where the input sequence is a sequence of (et, pt), where et is the environment at time t, and pt is the parent task at time t, instead of just (et). The output sequence is a sequence of the following:
- A true/false value indicating whether to terminate the current sub-program and go back to the parent program,
- The sub-program to carry out next,
- And the input arguments to that sub-program. For example, a voltage level, or a direction to move.
Since NPI is capable of producing output arguments to programs, the number of necessary programs is greatly reduced. E.g. we don’t have to have separate tasks for moving 1 cm, moving 2 cm, etc. We just have one program for movement and supply the amount to move as the argument. In fact, in the NPI model, all output actions (primitive atomic actions that can be performed on the environment) are performed with a single instruction – ACT. The type of action to perform is simply just the set of arguments to that function. For example, in the bubblesort task, swapping is performed via ACT(i, j, SWAP) and moving left is performed via ACT(1, LEFT) or ACT(2, LEFT), etc. You can specify application-specific tasks that would correspond to the abilities your robot has. In the NPI paper, output arguments are always 3-tuples of integers.
In this model, training is performed by specifying the hierarchical structure during training. This is one weakness of the model – it can’t infer the hierarchical structure by itself. Figuring out how to automatically decompose a program into sub-programs is an interesting and open research question.
There are a few other implementation details too. The major detail is encoding/decoding: how to represent the environment as a vector to the s2s machine, and how to decode the output vector from the s2s machine back into program calls. A few example encoders are given in section 4 of the paper. The general idea in the paper seems to be to mostly just concatenate sensor values together. For example, if the sensor has several discrete states, the sensor value would be a one-hot vector. An exception to this is if the sensor is a visual sensor. In this case, we don’t feed the raw sensor values but instead feed the image through a CNN and output the encoding the CNN produces.
After all the inputs have been concatenated together, they are fed through an MLP to obtain a shorter fixed-length encoding. ‘But wait’, you ask, ‘why don’t we just feed the concatenated vector to the RNN?’ That’s a good question. The answer is partly because it is easier to train the RNN when the input is shorter. But it’s also because the core RNN is shared across all tasks! That’s right – in the paper, the RNN is trained simultaneously on bubblesort, addition, and image canonicalization. This is another crucial aspect of the model, and it allows the system to transfer knowledge learnt from other tasks into new tasks and learn them far more efficiently.
So to sum up, NPI is a RNN/LSTM-based sequence-to-sequence translator, with:
- The ability to recurse into sub-programs and keep track of calling programs when doing so.
- A RNN core that is shared across all tasks, and
- Task-specific encoder/decoders that allow it to specialize for different situations.
During training, the encoder/decoders are trained alongside the RNN core. This is what makes training the NPI somewhat tricky, because even though the familiar gradient descent method is used, the architecture changes during training.
Assuming enough time and/or interest, I’ll write another post on how to train a model like this in TensorFlow.