What can physics tell us about understanding natural language?

Years ago, when I had just graduated with a degree in pure math, I was eager to apply the things I’d learned in class to the real world. So dabbling into physics seemed like the natural thing to do, for someone with math experience. I read books on quantum mechanics, quantum field theoryr, and general relativity. I didn’t fully understand a lot of the material in them at the time. Later I went into engineering and then machine learning, and I forgot a lot of the physics I’d learned. But something that I didn’t forget was the somewhat unique physics jargon and techniques for modeling the world. Historically, physics had to deal with the problem of modeling the behavior of large collections of atoms and molecules, and so it blended statistical techniques with the kind of mathematical modeling that Newton and his successors pursued.

This blog is about ML (on most days) and it’s worth mentioning at this point that a lot of people don’t realize that physics and machine learning actually have an interesting history together, going back decades. One of Geoffrey Hinton’s landmark pieces of work was his paper on Boltzmann machines. Boltzmann machines are neural nets inspired entirely by physical systems, capable of learning any kind of distribution. Note that feedforward neural nets like multilayer perceptrons are designed to learn any mapping from a set of inputs to a set of outputs. Boltzmann machines could do something different – they could learn the distribution over a set of inputs. That is, they can learn useful information about the input without being given an explicit mapping. In other words, they are capable of unsupervised learning.

While Boltzmann machines never became very popular (because they were very, very hard to train), later architectures like restricted Boltzmann machines (RBMs) were easier to train, and actually laid the groundwork for deep learning to build upon. Today’s deep learning architectures can trace their ancestry back to RBMs.

The Boltzmann machine was an example of something where physics-inspired models helped out ML. Of course, a lot of physicists do sometimes get a bit, shall we say, over-enthusiastic, and try to apply the insights from physics to areas where they may not be appropriate. This is more general than physicists though – a lot of fields feel the need to prove that their way of looking at things is the best way or at least a really great way.

So when the news came that Henry W. Lin and the well-known physicist Max Tegmark had recently uploaded a paper to arXiv, Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language, on applying methods from physics to analyze common ML methods for natural language processing (NLP), I thought I’d give it a look.

The paper has a lot of ideas – which I’ll talk about later in this post and it’s a bit difficult for me to disentangle them from already-known ideas in ML. For example, let’s talk about hidden markov models (HMMs).

A HMM is a very simple model of sequences over time. The assumptions of the model are that there is some ‘hidden state’ s that we don’t know of, and the hidden state at time t only depends on the state at time t – 1, and not on the state at time t – 2, or any earlier states (this is called the Markov property). It also produces an output or observation o at each step, which only depends on the state at that step, and on nothing else. The observation is the only way we have of figuring out what goes on inside the model. At first you might think that systems that can be described using HMMs aren’t that common, but by cleverly choosing the right state representation, you can represent a huge variety of systems with the HMM model. For example, think of a falling ball. You might think of representing the state as the position of the ball. It’s obvious that the position of the ball doesn’t merely depend on the previous position – velocity and acceleration factors into it as well. But if you consider the state to be position plus velocity and acceleration, then indeed the state at each time step can be fully inferred from the state at the previous time. And for a lot of problems, you could compress all relevant variables into the state vector. So HMMs are actually a lot general than might seem at first.

Still, most of the time in ML, when we talk about HMMs, we just restrict ourselves to talking about the very simple formulation where there is a (small) finite number of hidden states, and the transition probabilities are given explicitly. So the ball example is excluded (its hidden state is continuous).

The main insight in the paper is the contrast between two types of models: models where observations are generated by a Markov-like sequential process, and models where observations are generated by some kind of hierarchical grammar-like process. The main conclusion is that natural language has statistics that aren’t well-reproduced by HMMs, but very well-reproduced by hierarchical processes.

To say anything more firm about the paper, I need to read it in more depth and compare with previous work. For example, it’s already known that HMM memory decays exponentially. However, the way it’s proven in the paper (using a new quantity called rational mutual information) is somewhat unique.

The take-home message is that things like simple RNNs may not be adequate for modelling language, but more hierarchical or deep or other kinds of RNN models (or different types of models entirely) may be better suited. Which we intuitively already know.

As an aside, I also love the little bits of physics trivia sprinkled throughout the paper. Like this one:

Deep models are important because without the extra “dimension” of depth/abstraction, there is no way to construct “shortcuts” between random variables that are separated by large amounts of time with short-range interactions; 1D models will be doomed to exponential decay. Hence the ubiquity of power laws explains the success of deep learning. In fact, this can be seen as the Bayesian net version of the important result in statistical physics that there are no phase transitions in 1D.

And this one:

There are close analogies between our deep recursive grammar and more conventional physical systems. For example, according to the emerging standard model of cosmology, there was an early period of cosmological inflation when density fluctuations [got] added on a fixed scale as space itself underwent repeated doublings, combining to produce an excellent approximation to a power-law correlation function. This inflationary process is simply a special case of our deep recursive model (generalized from 1 to 3 dimensions).

Some fun with algebraic numbers

I’ve always been fascinated with the way computers can carry out billions of exact computations, without error, in the blink of an eye. There’s historically existed a bit of a divide between people who like using computers to calculate approximate answers to questions (numerical methods) and those who like being able to calculate exact answers and prove things rigorously (symbolic methods). You can see this divide in machine learning as well. The pros and cons are:

  • With numerical methods, you can often come up with really simple and general methods that work for a huge range of problems. For example, finite element PDE solvers can solve basically any classical physics problem without having to actually analytically solve the problem.
  • However, a huge drawback of numerical methods is that it’s sometimes hard to know how accurate your solution really is. Engineers (who use numerical methods a lot) use a bunch of tricks to try to overcome this problem but a lot of the time you do need to resort to more involved math to answer it definitively.
  • On the other hand, symbolic methods have the issue that the complexity of the solutions becomes impractically immense for most real-world problems (however, this isn’t always the case).

An example of a simple symbolic vs. numerical representation is rational numbers. We all know that integers can be represented exactly, and addition, subtraction, and multiplication of integers produces integers. Unfortunately, division isn’t included in this happy family, and division is (as it turns out) a pretty important arithmetic operation! So we can extend the set of integers by also including numbers of the form N/M where N and M are integers. Now all four elementary operations are included, and we can do exact arithmetic. However, it’s worth noting that as we do more and more operations, the rational numbers in our computations tend to get very large numerators and denominators, and computation with them becomes less and less efficient. So this is really only good for when doing short sequences of operations.

Another elementary operation aside from +,*,-,/ that is also important is root-taking i.e. square roots, cube roots, and so on. Unfortunately, the set of rational numbers isn’t closed under these operations. But the set of algebraic numbers is.

Algebraic numbers are the set of numbers that are roots of (non-zero) univariate polynomials with integer coefficients. For example, 8x3 − 4x2 − 4x + 1 = 0. Algebraic numbers may in general be real or complex, and they include all integers and all rational numbers (an integer N is the root of x – N, and a rational number N/M is the root of Mx – N). Not all numbers are algebraic – π isn’t, for example.

As it turns out, the set of algebraic numbers contains a lot of interesting things. For example, it contains all of Euclidean geometry. You know – the part of high school 2D geometry where you construct shapes and prove things about them using just a straight edge (ruler) and compass. It was very important to the ancient greeks to only use straight edges and compasses for any kind of geometrical contsruction, whether they were trying to draw a hexagon, or cut a triangle down the middle, or what have you. Well, as it turns out, all lengths, coordinates, etc. in Euclidean geometry are representable exactly as algebraic numbers. Actually this isn’t even that hard to prove. If you could represent algebraic numbers exactly on a computer, then you could  do Euclidean geometry exactly. For example, let’s say you wanted to check if an object is symmetric around some line or point. You could rotate or flip it, and then compare the coordinates exactly to see if they match.

Algebraic numbers contain a lot of other useful things as well, like the roots of unity, which are useful for doing Fourier transforms, which in turn are useful for, well, pretty much everything!

So is it possible to represent algebraic numbers exactly, in the same way you can do for integer and rational numbers? And also be able to do computations on them efficiently (i.e. in polynomial time)? The answer is: Yes!

The secret lies in the following fact: Every algebraic number has a unique minimal polynomial. That is, every algebraic number has a univariate polynomial (with integer coefficients) that is irreducible – the polynomial cannot be factored out into smaller factors that also have integer coefficients. The minimal polynomial is, in addition, unique (up to multiplication by a constant factor). The proof of this isn’t that difficult but it’s a bit technical so I’m not going to get into it here. But suffice to say that we can represent algebraic numbers by representing their minimal polynomials instead, along with some information specifying exactly which root of the minimal polynomial we are representing (note that in general the minimal polynomial might be of degree N > 1 and may have many roots). If we want to compare two algebraic numbers for equality, we first compare their minimal polynomials. If they are not the same, the algebraic numbers are not equal. If they are the same, we check to make sure they both represent the same root.

How do we specify which root of the minimal polynomial we want? There are a few ways. One more advanced way uses Thom’s lemma and represents the root using the signs of the derivatives of the polynomial at that root. That method is less useful for complex algebraic numbers. A simpler method first simply enumerates all the roots of the polynomial (numerically, up to some specified precision) and then calculates the smallest absolute distance d between them. We then just represent our desired root numerically, with precision higher than d. Note that this represents the root exactly, since we have the minimal polynomial as well.

In Julia, I can use the following type that captures all of what we’ve discussed:

type AlgebraicNumber{T<:Integer,F<:AbstractFloat}

Here we represent the minimal polynomial as an array of integer coefficients coeff, and an approximation to the root as well as the absolute precision in apprx and prec.

Ok, we have a way of representing algebraic numbers. What about doing arithmetic on them? Negating an algebraic number is easy – just replace x with –x in the polynomial. Taking the nth root of an algebraic number is also easy – just replace x with x^n in the polynomial. Inversion is, also, easy – just reverse the list of coefficients!

Addition and multiplication of algebraic numbers is more tricky. In this case, you have to construct a new polynomial that has as its roots the addition or sum of the roots of the original polynomials. This can be done using resultants but it winds up being very slow. A more interesting and much faster method is based on power sums of polynomial roots. I won’t get into the details here – the brief explanation is that you can ‘transform’ any polynomial with integer coefficients to a power sum polynomial with rational coefficients and also obtain the original polynomial back. In this new domain, addition and multiplication become simple products and sums of coefficients. Alin Bostan has done a lot of work in this area and the implementation I use is based on his papers. It’s also the implementation used in SymPy.

I’ve written down all of these ideas in a Julia library that you can use to experiment with algebraic numbers. You can do fun things like represent sqrt(2) exactly:

    sqrt2 = sqrt(AlgebraicNumber(2))
    assert(sqrt2^2 == 2)

Or define the golden ratio and prove that it is indeed the golden ratio:

    # Golden ratio
    ϕ = 1//2 + sqrt(AlgebraicNumber(5//4))

    assert(1+1/ϕ == ϕ)

Or you can also do other stuff, like compute the ‘difficult’ minimal polynomial in this stackoverflow question, without even breaking a sweat:

    n = sqrt(AlgebraicNumber(9*5))-sqrt(AlgebraicNumber(4*7))+sqrt(AlgebraicNumber(35))
    d = 1-sqrt(AlgebraicNumber(5))+sqrt(AlgebraicNumber(7))
    @assert(α.coeff == BigInt[3596, 2312, -280, -156, 19])

For the math-inclined, Algorithms in Real Algebraic Geometry by Basu, Pollack, and Roy is a great resource to learn more about algebraic number computation.

Recursive (not recurrent!) Neural Nets in TensorFlow

For the past few days I’ve been working on how to implement recursive neural networks in TensorFlow. Recursive neural networks (which I’ll call TreeNets from now on to avoid confusion with recurrent neural nets) can be used for learning tree-like structures (more generally, directed acyclic graph structures). They are highly useful for parsing natural scenes and language; see the work of Richard Socher (2011) for examples. More recently, in 2014, Ozan İrsoy used a deep variant of TreeNets to obtain some interesting NLP results.

The best way to explain TreeNet architecture is, I think, to compare with other kinds of architectures, for example with RNNs:



In RNNs, at each time step the network takes as input its previous state s(t-1) and its current input x(t) and produces an output y(t) and a new hidden state s(t). TreeNets, on the other hand, don’t have a simple linear structure like that. With RNNs, you can ‘unroll’ the net and think of it as a large feedforward net with inputs x(0), x(1), …, x(T), initial state s(0), and outputs y(0),y(1),…,y(T), with T varying depending on the input data stream, and the weights in each of the cells tied with each other. You can also think of TreeNets by unrolling them – the weights in each branch node are tied with each other, and the weights in each leaf node are tied with each other. The TreeNet illustrated above has different numbers of inputs in the branch nodes. Usually, we just restrict the TreeNet to be a binary tree – each node either has one or two input nodes. There may be different types of branch nodes, but branch nodes of the same type have tied weights.

The advantage of TreeNets is that they can be very powerful in learning hierarchical, tree-like structure. The disadvantages are, firstly, that the tree structure of every input sample must be known at training time. We will represent the tree structure like this (lisp-like notation):

(S (NP that movie) (VP was) (ADJP cool))

In each sub-expression, the type of the sub-expression must be given – in this case, we are parsing a sentence, and the type of the sub-expression is simply the part-of-speech (POS) tag. You can see that expressions with three elements (one head and two tail elements) correspond to binary operations, whereas those with four elements (one head and three tail elements) correspond to trinary operations, etc.

The second disadvantage of TreeNets is that training is hard because the tree structure changes for each training sample and it’s not easy to map training to mini-batches and so on.

Implementation in TensorFlow

There are a few methods for training TreeNets. The method we’re going to be using is a method that is probably the simplest, conceptually. It consists of simply assigning a tensor to every single intermediate form. So, for instance, imagine that we want to train on simple mathematical expressions, and our input expressions are the following (in lisp-like notation):

(+ 1 2)
(* (+ 2 1) 2)
(+ (* 1 2) (+ 2 1))
Now our full list of intermediate forms is:
a = 1
b = 2
c = (+ a b)
d = (+ b a)
e = (* d b)
f = (* a b)
g = (+ f d)
For example, f = (* 1 2), and g = (+ (* 1 2) (+ 2 1)). We can see that all of our intermediate forms are simple expressions of other intermediate forms (or inputs). Each of these corresponds to a separate sub-graph in our tensorflow graph. So, for instance, for *, we would have two matrices W_times_l and W_times_r, and one bias vector bias_times. And for computing f, we would have:
f = relu(W_times_l * a + W_times_r * b + bias_times)
Similarly, for computing d we would have:
d = relu(W_plus_l * b + W_plus_r * a + bias_plus)
The full intermediate graph (excluding input and loss calculation) looks like:
a = W_input * [1, 0]
b = W_input * [0, 1]
c = relu(W_plus_l  * a + W_plus_r  * b + bias_plus)
d = relu(W_plus_l  * b + W_plus_r  * a + bias_plus)
e = relu(W_times_l * d + W_times_r * b + bias_times)
f = relu(W_times_l * a + W_times_r * b + bias_times)
g = relu(W_plus_l  * f + W_plus_r  * d + bias_plus)
output1 = sigmoid(W_output * a)
output2 = sigmoid(W_output * c)
output3 = sigmoid(W_output * e)
output4 = sigmoid(W_output * g)
For training, we simply initialize our inputs and outputs as one-hot vectors (here, we’ve set the symbol 1 to [1, 0] and the symbol 2 to [0, 1]), and perform gradient descent over all W and bias matrices in our graph. The advantage of this method is that, as I said, it’s straightforward and easy to implement. The disadvantage is that our graph complexity grows as a function of the input size. This isn’t as bad as it seems at first, because no matter how big our data set becomes, there will only ever be one training example (since the entire data set is trained simultaneously) and so even though the size of the graph grows, we only need a single pass through the graph per training epoch. However, it seems likely that if our graph grows to very large size (millions of data points) then we need to look at batch training.
Batch training actually isn’t that hard to implement; it just makes it a bit harder to see the flow of information. We can represent a ‘batch’ as a list of variables: [a, b, c]. So, in our previous example, we could replace the operations with two batch operations:
[a, b]    = W_input * [[1, 0], [0, 1]]
[c, d, g] = relu(W_plus_l  * [a, b, f] + W_plus_r  * [b, a, d] + bias_plus)
[e, f]    = relu(W_times_l * [d, a]    + W_times_r * [b, b]    + bias_times)
output    = sigmoid(W_output * [a, c, e, g])
You’ll immediately notice that even though we’ve rewritten it in a batch way, the order of variables inside the batches is totally random and inconsistent. This is the problem with batch training in this model: the batches need to be constructed separately for each pass through the network. If we think of the input as being a huge matrix where each row (or column) of the matrix is the vector corresponding to each intermediate form (so [a, b, c, d, e, f, g]) then we can pick out the variables corresponding to each batch using tensorflow’s tf.gather function. So for instance, gathering the indices [1, 0, 3] from [a, b, c, d, e, f, g] would give [b, a, d], which is one of the sub-batches we need. The total number of sub-batches we need is two for every binary operation and one for every unary operation in the model.
For the sake of simplicity, I’ve only implemented the first (non-batch) version in TensorFlow, and my early experiments show that it works. For example, consider predicting the parity (even or odd-ness) of a number given as an expression. So 1 would have parity 1, (+ 1 1) (which is equal to 2) would have parity 0, (+ 1 (* (+ 1 1) (+ 1 1))) (which is equal to 5) would have parity 1, and so on. Training a TreeNet on the following small set of training examples:
[+,[+,1,1],1 ]
Seems to be enough for it to ‘get the point’ of parity, and it is capable of correctly predicting the parity of much more complicated inputs, for instance:

Correctly, with very high accuracy (>99.9%), with accuracy only diminishing once the size of the inputs becomes very large. The code is just a single python file which you can download and run here.

Write once, infer anywhere: Modular, reusable probabilistic inference components

The Julia language is pretty interesting; it’s a scientific/numerical language (think Matlab or Scipy) with a syntax that corresponds pretty closely with mathematical notation. It has a just-in-time compiler so it’s fairly fast (you can write fast loops in it) and that seems to be the reason a lot of people adopt it over, say, Matlab. But that’s not the main reason it’s an interesting language. The main reason, in my opinion, is the type system.
Julia’s type system is a very structured type system. It’s pretty impressive and well thought-out. It lets you do things that would normally be the domain of functional languages like OCaml or Haskell. Better yet, it has strong enough type inference and conversion that most of the time you don’t need to worry about types. In other words, the types only add to the language – they don’t really take anything away from it. I highly recommend the Julia documentation’s own reference on types, as well as the wikibooks entry.

In this blog post I’m going to use Julia’s type system to write the expectation-maximization algorithm in a completely general way.

A couple years ago Dahua Lin posted a message on julia-stats calling for construction of a new language (within Julia) that would take general graphical models (in abstract form) and ‘compile’ them to Julia code. As far as I’m aware the idea didn’t result in any working system, but it’s an interesting idea. This would free modelers of the hassles of re-implementing a whole bunch of inference algorithms (like EM) over and over again. It’s a very intriguing idea and would be awesome if it could be made to work. The attached slides are here (I’m linking it because I will be referring to it in this write-up).

Inspired by that, I decided to use the Distributions.jl package as a starting point to see how far I could go just by defining types and generic functions on those types to do various inference procedures (such as message passing). Note that I’m not defining a new language, just exploring what the Julia compiler itself can do. I found, pleasantly enough, that the interface Distributions.jl provides is rich enough to implement EM on both models Lin mentions in his presentation, just by some minimal EM code, specifying the high-level model and using minimal ‘grunt’ details. To whet your appetite, here’s a generic implementation of the EM algorithm that works, without modification, on any type of mixture (fit_mm_em):

# This returns, for a model and observations x,# the distribution over latent variables.
function infer(m::MixtureModel, x)
        K = length(m.component)  # number of mixture components
        N = size(x,2)            # number of data points
        lq = Array(Float64, N, K)
        for k = 1:K
                lq[:,k] = logpdf(m.component[k], x) .+ logpdf(m.mixing, k)

logp_to_p(lp) = exp(lp .- maximum(lp))

function fit_mm_em{T}(m::MixtureModel{T}, x)
        # Expectation step
        lq = infer(m, x)

        # Normalize log-probability and convert to probability
        q = logp_to_p(lq)
        q = q ./ sum(q,2)

        # Maximization step
        cr = 1:length(m.component)
        comps = [fit_em(m.component[k], x, q[:,k]) for k = cr]
        mix   =  fit_em(m.mixing, [cr], vec(sum(q,1)))

        MixtureModel{T}(mix, comps)

# 'fallback' function
fit_em(m::Distribution, x, w) = fit_mle(typeof(m), x, w)
All that this requires is that fit_em be defined on the type of mixture component. If not, it will use the fallback fit_mle(). The mixture model type is defined as:

# A 'closed' mixture model defining a full generative modeltype MixtureModel{T} <: Distribution
        mixing::Distribution   # Distribution over mixing components
        component::Vector{T}   # Individual component distributionsend
Now let’s have some fun. By passing a Mixture{MvNormal} type to the above function, we can get it to work on a normal gaussian mixture (the first model Lin mentions in his presentation). See gmm_test.jl in the attached code. But what about Lin’s second model (page 9)? That involves mixture components and weights that have priors on their parameters. Do we have to change the EM function above? No. All we have to do is define a new type:
type DistWithPrior <: Distribution
        pri                    # a prior over the parameters of dist (tuple)
        dist::Distribution     # the distribution itselfend

And the corresponding functions on it:

import Distributions:logpdf
fit_em{T<:DistWithPrior}(m::T, x, w) = T(m.pri, fit_map(m.pri, typeof(m.dist), x, w)) logpdf{T<:DistWithPrior}(m::T, x) = logpdf(m.dist, x)

Pretty simple. Now by, say, creating a DistWithPrior(Dirichlet, Categorical) distribution, you can model a categorical variable with a Dirichlet prior, and the EM algorithm will update it according to that prior, automatically.

I think this is pretty cool because this is achieved without ever even worrying about the grunt details of mathematical formulas. You can see this working on a sample data set in dahualin.jl in the attached code.

Neural Programmer-Interpreters: programs that can learn programs

The people at Google Deepmind have been working on a kind of neural architecture called Neural Programmer-Interpreters (NPIs). All of ML is ultimately about automating tasks, with the eventual goal of machines being able to automatically carry out all tasks that humans could do. Consider the problem of teaching a machine to do some particular task automatically. It could be a simple task like adding numbers, or a more complicated task like driving a car or folding laundry, or an even more complicated task like making a cup of coffee. Let’s consider the possible ways you could automate the task:

  1. The ancient (pre-AI) way to automate tasks was to write full highly-detailed program specifications to carry them out. For example, if you want to make a cup of coffee, you give the detailed sequence of voltage levels applied to the robot’s actuators to get them to move to a certain spot, grab a cup, etc. You may also include special conditional cases that work based on inputs from various sensors. The obvious shortcoming here is that to interface with the uncertainty and noisiness of the real world, you need to program in a huge number of conditional behaviors and special cases, and program complexity rapidly gets out of hand.
  2. The next step (the GOFAI way) was to try to program in enough logic and ‘common sense’ that the machine would figure out by itself how to do the task. For example, if you want it to be able to retrieve a red block from the table, you’d program in enough knowledge that it would know what red blocks are, what tables are, how to move its actuators to get to a table, etc. Completing a task is then analogous to searching for a symbolic sequence of steps that solve a problem, in a way similar to theorem proving. This approach generally failed because hard logic was too inflexible to deal with the real world in a simple way, and as it turns out, common sense is really hard to program into a computer simply based on logic rules.
  3. The next step (the neural/statistical AI way, from the 80’s to the present) was to come up with a lot of training examples that capture the variability in the real world, and then train some general learning machine on this large data set. At first, the available learning machines weren’t that good, and training them was hard. Over time, better algorithms and more computing power have made this approach very powerful. Unfortunately, this approach also has some issues, namely the requirement of a large data set, and that it’s hard to come up with a neural architecture that could learn non-trivial tasks from just a few training examples the way humans do.

Which brings us to NPI. NPI is an attempt to use neural methods to train machines to carry out simple tasks based on a small amount of training data. And, at its heart, it uses RNNs based on LSTM-cells. It also uses some ideas from the GOFAI approach. If you’re not familiar with RNNs, I highly recommend Andrej Karpathy’s popular blog post on the subject, which it explains it far better than I could. In fact you should probably read it before continuing if you haven’t done so already. I also suggest you read the description of LSTM networks on Colah’s blog if you want to know more about the details of how RNN/LSTM nets work.

Ok, I’ll assume you know what RNNs are now. If you wanted to train an RNN to perform a task, one way you might do it would be to write a sequence-to-sequence (s2s) translator as used in e.g. language translation, where the input is a sequence of symbols in one language, and the output is a sequence of symbols in another language. To automate tasks, the input would be the sequence of observations from the environment (sensors) and the output sequence is the set of functions the machine can perform on the environment. The training examples are sequences where we have ‘held the machine’s hand’ and helped it carry out some task.

People have tried training using this type of RNN before for task automation and, as I mentioned, it requires a lot of training data and it doesn’t generalize very well. Even though it works well enough for machine translation, for automating more algorithmic, compositional tasks, it doesn’t work as well. Let’s take a simple toy example: Sorting a list of numbers using bubblesort. You could imagine having a series of items laid out on the floor, and a ‘virtual robot’ that is able to perform the following actions:

  1. Walk up and down the line as needed,
  2. Swap two neighboring items.

In this case, the inputs would be the current position of the robot, and whether the current item is bigger or smaller than the next one. The output actions would be either moving the robot left/right or swapping items. You could manually go through several concrete training examples, providing the sensory inputs and the sequence of actions to take. However, if you train using a simple s2s translator net, you’d find that it would need a lot of training data to start sorting accurately, and wouldn’t generalize very well to larger sequences. Even though it could emulate sorting behavior under certain conditions, the robot wouldn’t have ‘gotten the point’ of sorting.

How do humans learn algorithms? If you’ve read my blog posts, you know the answer is obvious: We learn algorithms/programs by fitting them into our already-well-developed models of the world, decomposing them into hierarchical sequences of sub-programs which are individually simple enough to learn from a small number of training examples. For example, you can decompose bubblesort into repeated application of the sub-program ‘move up the list and swap any numbers that aren’t monotonically increasing.’

The people at Deepmind incorporated this kind of idea into NPI.

The NPI is just a s2s learner that is fitted with a ‘working memory’ in the form of a simple stack-like data structure. Every time it ‘enters’ a sub-program, it pushes the ‘parent’ program on to a stack (and re-initializes its hidden state), and every time it finishes a sub-program, it resumes the parent program where it left off. Since the behavior of sub-programs depends on what parent program is currently executing, it keeps the parent task (called the ‘Input program’) as an input vector to the s2s translator, like so:


So, in a nutshell, that’s all NPI is: An s2s translator where the input sequence is a sequence of (et, pt), where et is the environment at time t, and pt is the parent task at time t, instead of just (et). The output sequence is a sequence of the following:

  1. A true/false value indicating whether to terminate the current sub-program and go back to the parent program,
  2. The sub-program to carry out next,
  3. And the input arguments to that sub-program. For example, a voltage level, or a direction to move.

Since NPI is capable of producing output arguments to programs, the number of necessary programs is greatly reduced. E.g. we don’t have to have separate tasks for moving 1 cm, moving 2 cm, etc. We just have one program for movement and supply the amount to move as the argument. In fact, in the NPI model, all output actions (primitive atomic actions that can be performed on the environment) are performed with a single instruction – ACT. The type of action to perform is simply just the set of arguments to that function. For example, in the bubblesort task, swapping is performed via ACT(i, j, SWAP) and moving left is performed via ACT(1, LEFT) or ACT(2, LEFT), etc. You can specify application-specific tasks that would correspond to the abilities your robot has. In the NPI paper, output arguments are always 3-tuples of integers.

In this model, training is performed by specifying the hierarchical structure during training. This is one weakness of the model – it can’t infer the hierarchical structure by itself. Figuring out how to automatically decompose a program into sub-programs is an interesting and open research question.

There are a few other implementation details too. The major detail is encoding/decoding: how to represent the environment as a vector to the s2s machine, and how to decode the output vector from the s2s machine back into program calls. A few example encoders are given in section 4 of the paper. The general idea in the paper seems to be to mostly just concatenate sensor values together. For example, if the sensor has several discrete states, the sensor value would be a one-hot vector. An exception to this is if the sensor is a visual sensor. In this case, we don’t feed the raw sensor values but instead feed the image through a CNN and output the encoding the CNN produces.

After all the inputs have been concatenated together, they are fed through an MLP to obtain a shorter fixed-length encoding. ‘But wait’, you ask, ‘why don’t we just feed the concatenated vector to the RNN?’ That’s a good question. The answer is partly because it is easier to train the RNN when the input is shorter. But it’s also because the core RNN is shared across all tasks! That’s right – in the paper, the RNN is trained simultaneously on bubblesort, addition, and image canonicalization. This is another crucial aspect of the model, and it allows the system to transfer knowledge learnt from other tasks into new tasks and learn them far more efficiently.

So to sum up, NPI is a RNN/LSTM-based sequence-to-sequence translator, with:

  1. The ability to recurse into sub-programs and keep track of calling programs when doing so.
  2. A RNN core that is shared across all tasks, and
  3. Task-specific encoder/decoders that allow it to specialize for different situations.

During training, the encoder/decoders are trained alongside the RNN core. This is what makes training the NPI somewhat tricky, because even though the familiar gradient descent method is used, the architecture changes during training.

Assuming enough time and/or interest, I’ll write another post on how to train a model like this in TensorFlow.

UPDATE: Apparently there’s already a Keras implementation of NPI.

The Neural Lace

In a recent interview, Elon Musk talked about AI safety. Here are his (abridged) words:

Something I think is going to be quite important — I don’t know of a company that’s working on it seriously — is a neural lace.

If you assume any rate of advancement in AI, we will be left behind by a lot…

…I think one of the solutions… the solutions that seems maybe the best one, is to have an AI layer. If you think of the limbic system, your cortex, and then a digital layer, so a third layer above the cortex that could work well, and symbiotically, with you. I mean just as your cortex works symbiotically with your limbic system, your sort-of a third digital layer could work symbiotically with the rest of you.

(Bold emphasis mine). Elon Musk is talking, of course, about brain enhancement. He talks about some of the engineering difficulties involved:

The fundamental limitation is input/output. We’re already a cyborg… but we’re I/O bound. Particularly output bound… our input bandwidth is much better because we have a high-bandwidth visual interface.

(Look at the video to see the full quote; it’s worth it in my opinion).

I’ve actually been thinking along these lines for a long time. He’s 100% right that the main limitation is I/O bandwidth, particularly output bandwidth. It’s hard to put exact numbers, but it’s fair to say that the input from the eyes to the brain is on the order of a few megabits per second. Now compare this to output bandwidth. Our most common way of producing output is speech. Speech has bandwidth around a few hundred bits per second. Typing on a keyboard is another way of producing output, which for most people is somewhat slower than speech – around a few tens to a hundred bits per second. A factor of 10,000 slower than vision. Is there a way to get an output bandwidth closer to megabits per second?

You might ask why don’t we try some form of output that includes not just the mouth and the fingers but the whole body? Well, doing that wouldn’t result in much more output bandwidth because our fingers and speech-production mechanisms actually take up a lot of the total external bandwidth available to the brain. So the only way of really getting past the output bandwidth limitation would be directly capturing signals from the brain itself.

Non-invasive brain signal capture techniques probably won’t suffice. The most common one is EEG which I’ve worked with for several years and is extremely limited. Current EEG systems fall far short of keyboard typing speed and far far short of speech. And EEG is not completely noninvasive either. MEG is a new technology that’s less invasive than EEG but is comparable (or worse) in terms of bandwidth. Technologies like fMRI do exist that have good levels of bandwidth (say, up to 10,000 bits/second seems to be feasible) but they are incredibly bulky and impractical for anything but a specialized laboratory environment. And they too fall far short of the megabits/second target.

So the only way to achieve the target is invasive (surgically-implanted) probes. Imagine a meshwork layered very carefully on top of the cerebral cortex. Presumably, future technology would allow layering this meshwork in a minimally invasive manner.

Anyway, once you have access to the brain, then you have the issue of producing probes that can actually capture information from the brain reliably over a long period of time. The good news is that the cerebral cortex is exposed at the surface and so you could get very high bandwidth just by laying out your probes on the surface of the brain – no deep insertion required. The bad news is that brain tissue is very soft and vulnerable and you could easily wind up cutting and bruising your brain.

And even if you do circumvent those issues, you need to find a way to get your body to ‘accept’ the implants. Normally, with external objects implanted into the body, the body coats the objects with layers of insulating material to protect tissues. This is fine for things like pacemakers and hip implants, but is totally disastrous for brain implants because it would prevent the implants from reading signals produced by neurons. This is an area of active research and people are working on ways to create more bio-compatible materials. One interesting approach uses a very fine micro-scale metallic mesh in which cells taken from you own body are allowed to grow into.

So producing reliable, effective, minimally-invasive neural implants is hard. Still, it sounds like it’s a solvable problem with people already coming up with better and better methods.

Once we have such devices, the question becomes: How do we use them? This is an interesting and very much open question. And it all boils down to what information you can ‘send’ via the probe.

Any electrical probe can be used for both output from the brain and input into the brain. Input into the brain opens up a lot of issues both technical and ethical. To keep this discussion short, let’s just consider a read-only probe system that passively ‘reads’ your cortex. What sort of information could you output using this system? The cerebral cortex deals with sensory, motor, language, planning, and higher-level thought functions. An advanced probe system could presumably read your high-level thoughts. For example, if you imagine a dog, or a game of basketball, or the concept of taking an integral, it would recognize these. Designing the data analysis system necessary for this type of input is far beyond current science, however, so let’s stick with things that we know how to do. We have a fairly good ability to probe the sensorimotor regions of the brain and correlate activities in these regions to specific input stimuli or muscle actions. A probe system could read your sensorimotor areas, in particular your visual cortex, with the result that you now have a two-way visual connection with the outside world. You can see objects and process them into higher-level neural patterns via your normal visual processing route, and you can also imaging higher-level concepts, form visual pictures of them, and have the computer read those pictures. This opens up a new set of possibilities for communicating with a computer, where the primary mode of information transfer would not be textual (as it is now) but visual.