Years ago, when I had just graduated with a degree in pure math, I was eager to apply the things I’d learned in class to the real world. So dabbling into physics seemed like the natural thing to do, for someone with math experience. I read books on quantum mechanics, quantum field theoryr, and general relativity. I didn’t fully understand a lot of the material in them at the time. Later I went into engineering and then machine learning, and I forgot a lot of the physics I’d learned. But something that I didn’t forget was the somewhat unique physics jargon and techniques for modeling the world. Historically, physics had to deal with the problem of modeling the behavior of large collections of atoms and molecules, and so it blended statistical techniques with the kind of mathematical modeling that Newton and his successors pursued.

This blog is about ML (on most days) and it’s worth mentioning at this point that a lot of people don’t realize that physics and machine learning actually have an interesting history together, going back decades. One of Geoffrey Hinton’s landmark pieces of work was his paper on Boltzmann machines. Boltzmann machines are neural nets inspired entirely by physical systems, capable of learning any kind of distribution. Note that feedforward neural nets like multilayer perceptrons are designed to learn any *mapping* from a set of *inputs* to a set of *outputs*. Boltzmann machines could do something different – they could learn the *distribution* over a set of inputs. That is, they can learn useful information about the input without being given an explicit mapping. In other words, they are capable of *unsupervised learning.*

While Boltzmann machines never became very popular (because they were very, very hard to train), later architectures like restricted Boltzmann machines (RBMs) were easier to train, and actually laid the groundwork for deep learning to build upon. Today’s deep learning architectures can trace their ancestry back to RBMs.

The Boltzmann machine was an example of something where physics-inspired models helped out ML. Of course, a lot of physicists do sometimes get a bit, shall we say, *over-enthusiastic*, and try to apply the insights from physics to areas where they may not be appropriate. This is more general than physicists though – a lot of fields feel the need to prove that their way of looking at things is the best way or at least a really great way.

So when the news came that Henry W. Lin and the well-known physicist Max Tegmark had recently uploaded a paper to arXiv, Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language, on applying methods from physics to analyze common ML methods for natural language processing (NLP), I thought I’d give it a look.

The paper has a lot of ideas – which I’ll talk about later in this post and it’s a bit difficult for me to disentangle them from already-known ideas in ML. For example, let’s talk about hidden markov models (HMMs).

A HMM is a very simple model of sequences over time. The assumptions of the model are that there is some ‘hidden state’ *s *that we don’t know of, and the hidden state at time *t* only depends on the state at time *t* – 1, and not on the state at time *t* – 2, or any earlier states (this is called the Markov property). It also produces an *output* or *observation* *o* at each step, which only depends on the state at that step, and on nothing else. The observation is the only way we have of figuring out what goes on inside the model. At first you might think that systems that can be described using HMMs aren’t that common, but by cleverly choosing the right state representation, you can represent a huge variety of systems with the HMM model. For example, think of a falling ball. You might think of representing the state as the position of the ball. It’s obvious that the position of the ball doesn’t merely depend on the previous position – velocity and acceleration factors into it as well. But if you consider the state to be position *plus* velocity and acceleration, then indeed the state at each time step can be fully inferred from the state at the previous time. And for a lot of problems, you could compress all relevant variables into the state vector. So HMMs are actually a lot general than might seem at first.

Still, most of the time in ML, when we talk about HMMs, we just restrict ourselves to talking about the very simple formulation where there is a (small) finite number of hidden states, and the transition probabilities are given explicitly. So the ball example is excluded (its hidden state is continuous).

The main insight in the paper is the contrast between two types of models: models where observations are generated by a Markov-like sequential process, and models where observations are generated by some kind of hierarchical grammar-like process. The main conclusion is that natural language has statistics that aren’t well-reproduced by HMMs, but very well-reproduced by hierarchical processes.

To say anything more firm about the paper, I need to read it in more depth and compare with previous work. For example, it’s already known that HMM memory decays exponentially. However, the way it’s proven in the paper (using a new quantity called *rational mutual information*) is somewhat unique.

The take-home message is that things like simple RNNs may not be adequate for modelling language, but more hierarchical or deep or other kinds of RNN models (or different types of models entirely) may be better suited. Which we intuitively already know.

As an aside, I also love the little bits of physics trivia sprinkled throughout the paper. Like this one:

Deep models are important because without the extra “dimension” of depth/abstraction, there is no way to construct “shortcuts” between random variables that are separated by large amounts of time with short-range interactions; 1D models will be doomed to exponential decay. Hence the ubiquity of power laws explains the success of deep learning. In fact, this can be seen as the Bayesian net version of the important result in statistical physics that there are no phase transitions in 1D.

And this one:

There are close analogies between our deep recursive grammar and more conventional physical systems. For example, according to the emerging standard model of cosmology, there was an early period of cosmological inflation when density fluctuations [got] added on a fixed scale as space itself underwent repeated doublings, combining to produce an excellent approximation to a power-law correlation function. This inflationary process is simply a special case of our deep recursive model (generalized from 1 to 3 dimensions).