Links roundup

The latest super-resolution paper: http://arxiv.org/pdf/1609.04802.pdf Over the past few years super-resolution with convnets has become really good.
The latest in WaveNet-like research is a raw waveform-based speech recognizer from Facebook: http://arxiv.org/pdf/1609.03193v2.pdf What’s interesting is that the authors don’t cite WaveNet. Independent discovery?
Here’s a very nice intuitive/geometric description of various gradient descent algorithms. Definitely recommended for beginners.
A recent paper uses convnets to infer facial geometry from single photos. What’s more interesting to me than facial recognition is 1. The fact that this is an end-to-end method, and 2. the training data is generated synthetically. They take realistic 3d facial geometries and render them, training on the results. With near-photo-realistic rendering of 3D geometries becoming faster each day, approaches based on training on generated visual data are becoming really popular. See, for example, this work on training self-driving cars using car simulator video games.
A paper on extending translation-invariant convolutions to other types of convolutions (e.g. rotation-invariance) efficiently. I’m not really sure about the novelty of this work. I already used similar methods in my own thesis, two years ago. And even then I thought it was too obvious to publish. However they do offer some proofs in their article which might be interesting to examine.
Another interesting piece of work in the area of semi-supervised learning is this paper. In it, the authors pre-train on texture and shape cues using unlabelled data, and then use a smaller amount of labelled data to train the final classifier.

Learning machine learning

A very interesting read on one person’s experiences going from ML ‘novice’ to using it at work in just one year:

By the way, I’m the PhD he talks about in his article:

I came in touch with a Ph.D student who was willing to help me out for 40 USD per hour, both with the problem sets as well as the overall understanding. This has been critical for me in order to move on, as he has uncovered a lot of black holes in my knowledge.

Lesson learned: It’s possible to get a good machine learning teacher for around 50 USD per hour. If you can afford it, it’s definitely worth it.

It’s always nice to hear from former ‘students’. Glad to hear I was of help, Per!

No Man’s Sky – The Astronomer’s Version

The much-awaited and much-hyped game No Man’s Sky came out recently. Its claim to fame is that instead of putting you in a ‘fixed’, pre-designed game universe, it is a simulated game galaxy where planets are generated on-the-fly using procedural algorithms that guarantee you see something ‘novel’ on each planet. Its 64-bit random seed means that there are about 18 quintillion possible planets, meaning it’s unlikely you’ll ever visit the ‘same’ planet twice.

The amount of hype that the game recieved can probably only be compared to the amount of disappointment it seems to have generated amongst gamers, who say that the game is repetitive and boring.

I think the idea of a procedurally-generated galaxy that you just explore is a great one, it’s just that the execution here is not one that would personally motivate me to play it. In this post I’m going to offer a sketch of a similar kind of game that I think I and (maybe) a lot of other people might find interesting.

I could not find any explanation of the procedural algorithms behind the game with any acceptable level of authority or detail. I’m guessing that Hello Games hasn’t really shared the internals yet. But looking at the game empirically, it’s possible to figure out roughly how it works. There are algorithms for procedurally generating terrain, populating the terrain with plants/animals, and so on. I’m not sure how much detail there is in the generation of the plant and animal shapes themselves. Are the plants just picked from a large library of plants, and just have certain attributes (color, size, etc.) changed? Are they composed of a set of parts (leaves, stalks, etc.) that are permuted randomly? Or are the plants entirely generated using some procedural algorithm? I don’t know. But whatever the case, when you become more familiar with the game, you realize there isn’t that much variation between the plants, nor between the animals. Nor, indeed, between the planets themselves. They are all the same size, have similar-looking terrain, and there is little variation across the surfaces of the planets.

nomansky

And this is the major problem with the procedural generation that the game uses. Even though there is a lot of ‘shallow variation’ in color, size, shape, and so on, everything looks and feels roughly the same. In fact, you could probably argue that everything looks far similar than the comparatively small (but still large) game worlds featured in other games. Yet, those games do not use procedural generation.

Another way to do Generate a Galaxy

First of all, I’m not a game designer, so I don’t know how much effort went into the procedural generation. My guess is “a lot.” But I wonder if you wouldn’t be able to achieve much better results with a similar, or even less, amount of effort.

How could you do it? Focus more on realistically generating the galaxy itself, and less about the other aspects, like life and so on.

How about using realistic models of star system formation to generate star systems at various points of development? You could visit a star system in its early stages, consisting of nothing more than a protoplanetary disc. Or you could visit a star system in its late stages, where the sun has expanded to a red giant and burned the face off all of the planets. And everything in between.

You could go even further, and generate things like neutron stars, binary/nova stars, white dwarves, and even rogue planets. Each giving rise to unique planetary systems with unique constraints. Who knows what could happen!

As for the planets themselves, you could have gas giants, ice giants, Earth-like planets, and so on, perhaps in their early bombardment stages, or their life-bearing stages, or their Venusian hell-hole stages. With widely varying sizes, orbital periods, magnetospheres, atmospheric compositions, etc. Your planets could be geologically active, in which case they would have volcanoes and mountain ranges, or not, in which case their surfaces would be mostly the result of cratering and wind erosion. You can have planets that are dry as a bone, or planets covered in kilometer-deep oceans – perhaps made of ammonia or methane and not water. You can have periodic asteroid collisions which would range from small to extinction event-level. And none of this has to be behavior that you explicitly program in. All of this could just be from a small set of rules governing planetary system development and orbital mechanics.

Now you just write code that takes those randomly-generated planetary conditions into account, and tries to simulate what the climate would look like on the planets, using realistic models of planetary physics and computational fluid dynamics simulation. For instance, you could have tidally-locked planets that are scorching hot on one side and frozen cold on the other. Or they could have oceans, and thus a milder climate, but now with extreme hurricane winds and ocean currents constantly raging across the planet’s surface. Or you could have a planet orbiting a neutron star, with the star’s intense gravity and magnetic field playing havoc on the planet’s system. Or planets with multiple large moons, with every tide being a tsunami. Or endless more possibilities.

Note that you aren’t programming in a ‘library’ of climate patterns, or even ‘procedurally generating’ climate. You are trying to deduce climate from other, simpler, randomly-generated parameters like orbit, mass, incident radiation, surface composition, and so on. This would almost certainly produce a lot of climate patterns that you simply could not have predicted beforehand – now that’s exciting.

What about life? Life is a bit more complicated and would take a lot of time to program realistically. You could either plunge in and attempt to do it, or make your game interesting enough that pre-programmed life need not exist. Why not just generate the setting, and let the players populate it with their own life?

This could be one of the objectives of the game: Introduce life, and try to either design new lifeforms that match the planet’s conditions, or ‘terraform’ the planets to match the life. You get points if your lifeforms successfully colonize the planet. Alternatively, players could focus on mining planets and setting up industrial centers and civilizations. Clever players would use the randomly-generated geological features that were unforeseen by the developers, and use them to their advantage, again in ways unforeseen by the developers. A true open world where nearly anything is possible.

A large portion of this civilization-building would be just community-driven development, with the only thing you need to code being the constraints and the physics of the worlds.

Is it possible?

Now you might object that doing something like this wouldn’t be feasible, because:

It’s too hard to write the code, and a lot of the factors involved in exoplanet formation and climate are not known.
It would be too straining on the computational abilities of game consoles and home computers.

I would argue that (1) is not a valid criticism. First of all, total realism and exacting physics are not necessary; you just need enough computing power to make a convincing game environment. We already know how to do fast, efficient fluid dynamics simulation for games; you just have to do something equivalent for planetary simulation. Note that you don’t have to go into much detail; there are a lot of CFD methods that abstract over the fine-scale details of the simulation and are capable of rapidly simulating large-scale features (such as wind speed/humidity/etc. over large geographical areas, or hurricane wind patterns). There is no shortage of people skilled in both computer graphics and CFD who would be happy to implement such a system for you if they got paid doing it. Also, there is already a lot of research on simulating planetary climates, so you’re not starting from scratch.

(2) might be a more valid criticism, but based on my experiences with physics simulation, I doubt that something like this would be that much of an issue. Indeed, people came up with realistic-looking exoplanet models in the 1980’s, with hardware less powerful than a smartphone or a raspberry pi (a gigaFLOP/s or so). One possibility would be to just network together computers that are playing the game, and have players occupying the same planet work together in simulating the planet.

Notes on the TensorFlow Implementation of Inception v3

The official TensorFlow repository has a working implementation of the Inception v3 architecture. Inception v3 is the 2015 iteration of Google’s Inception architecture for image recognition. If you are familiar with deep learning then you most definitely know all about it. If you aren’t, but keep up with tech news, then you probably best know it as ‘that learning algorithm that trained itself to recognize pictures of cats.’ And if you still have no idea what I’m talking about (or you think I’m talking about that Leonardo DiCaprio movie), then congrats on climbing out of your cave, and welcome to the world of machine learning!

Inception is a really great architecture and it’s the result of multiple cycles of trial and error. I frequently find that it achieves the best performance for image recognition among other models.

The implementation of it was written by the same people who wrote TensorFlow, and so it seems to be well-written and makes use of a lot of TensorFlow tricks and techniques. I thought I’d study the code to see how they do things, and learn how to utilize TensorFlow better. In this blog post I’m sharing some of my notes.

First of all, the Inception code uses TF-Slim, which seems to be a kind of abstraction library over TensorFlow that makes writing convolutional nets easier and more compact. As far as I can tell, TF-Slim hasn’t been used for any major projects aside from Inception. But it’s very ideal for inception, because the inception architecture is ‘deep’ and has many layers. Looking at the Readme file on that page is recommended.

Let’s dive into the code. slim\inception_model.py contains the code for the actual inception model itself. The model makes heavy use of the various scoping mechanisms available in TensorFlow. It wraps the entire inception model into a new TensorFlow op. First, it wraps the entire model in an op_scope named 'inception_v3'. It then uses various arg_scopes to set the default arguments for ops inside the model. TF’s arg_scopes are a simple way of setting the default arguments for a lot of ops in a model at the same time, without having to repeatedly enter them each time an op is called. There are several nested arg_scopes, ostensibly for each module in the model.

An interesting aspect of the model is that it’s not constructed of repeated modules. That is, they didn’t define an ‘inception cell’ and then repeatedly apply this to downscale the input. This is what I would have done (modularity is good, it prevents bugs from creeping in and makes the code easier to modify), and I’m curious to know if there’s some fundamental reason that doing this wouldn’t have been worth it, or they just decided it to keep it conceptually simpler this way.

The list end_points in the model contains all intermediate tensors. So for instance, end_points['conv0'] contains the output of the first convolution, which is then fed into another convolution, the result of which is saved as `end_points['conv1'] and so on.

The rest of the model definition itself seems straightforward. Convolution, max-pooling, and dropout layers are repeatedly applied to the tensors, and the result is the logits variable which gives the predictions (a vector of length 1000 for each image in the batch) and the end_points list which stores all intermediate results. The other function in the file, inception_v3_parameters , simply creates another arg_scope that holds the default parameters for the inception op itself.

The next important file in the model is inception_train.py, which contains the actual code for, well training the inception model. Proper training in TF requires doing the following things:

Specifying a learning rate schedule
Creating some system for generating training batches and testing batches
Setting up an optimizer and running it
Setting up some system to periodically display results and save model state

In the file, there is a ‘warning’ that the learning rate schedule is

…heavily dependent on the hardware architecture, batch size and any changes to the model architecture specification. Selecting a finely tuned learning rate schedule is an empirical process that requires some experimentation. Please see README.md more guidance and discussion.

The train function in that file is of particular note. It first selects the cpu as the computational backend to perform various book-keeping duties. The next step is defining a global_step variable. It is simply a variable that represents the training iteration number during training. This is fairly standard procedure for TensorFlow models, as such a variable helps the computation graph keep track of when to update the learning rate, and when to output saved model results.

The training here is done in an interesting way. It uses a special mechanism for distributed training. It splits up the computation between gpus explicitly, by executing a separate graph on each one, which it calls a ‘tower’. It doesn’t seem to use the Distributed TensorFlow mechanism, which seems simpler, and I’m not exactly sure why (maybe because the code was written before Distributed TF became available). Anyway, there are two ‘private’ functions: _tower_loss, which only computes the total loss for each tower, and _average_gradient, which averages the per-tower gradients to return a global gradient. During each training cycle, the gradient is calculated for each tower, and then the gradients are averaged and added to the weights/biases. Note that the averaging step represents a synchronization point between the towers. In one paper, they compare synchronous with asynchronous training (doing updates separately) and the synchronous training seems to converge faster.

The gradients are computed, for each tower, using the opt.compute_gradients function, averaged using _average_gradient, and then applied using the opt.apply_gradients function, where opt is the optimizer. This is another neat thing about TF: For simpler training procedures, you can take an entirely hands-off approach, and allow TF to automatically take care of computing the gradients and applying them. You don’t even have to think about the gradients. You can also take a more micromanaged approach, explicitly calculating gradients, doing some processing, and then applying them. And you can also take a bare-bones approach, not relying on any pre-written optimizers, and adding gradients however way you want.

These operations, plus operations for saving/loading graphs, are then all bundled up into top-level ops, and then combined into a final op:

train_op = tf.group(apply_gradient_op, variables_averages_op,
                    batchnorm_updates_op)

Which is then run.

Now let’s look at how testing and training data are handled. Data are wrapped in a class called Dataset, which is defined as an abstract base class in dataset.py. The dataset class itself seems to mostly be a set of routines for simplifying access to TFRecords files. One thing I’m not sure about is why there seems to be a preference of using TFRecords over more standard data formats like HDF5. At least TF does have support for reading HDF5.

It seems like the code only runs the loss on a single data set at a time (training, testing, or validation). You can select which one by feeding a command-line argument:

tf.app.flags.DEFINE_string('subset', 'train',
"""Either 'train' or 'validation'.""")

You can also select whether to train or not from the command line. So this allows you to run multiple processes in parallel for the testing, training, and validation sets. This seems to be the preferred way to do testing/training in TF, rather than performing training and testing simultaneously in the script.

Summary

The code contains a lot of the now-standard coding patterns for TF:

Encapsulating everything in various modules defined by scopes, including op_scopes, name_scopes, arg_scopes, and var_scopes.
Specifying a learning rate schedule, done using global_step
Creating some system for generating training batches and testing batches – done external to the model, with actual evaluation in a separate process.
Setting up an optimizer and running it
Setting up some system to periodically display results and save model state, using the tf.Saver mechanism, which automates a lot of this.

Why TensorFlow is Free – And What You Can Do About It

Last year Google released TensorFlow, its in-house machine learning framework, for free. TensorFlow is a very powerful framework, and it’s already rapidly surpassing all other existing community-developed frameworks for ML. At this point it would be a rational projection that in a couple of years non-TensorFlow ML will be relegated to a niche role.

I’ve used TensorFlow and I like it a lot. It’s really a step up compared to what we had before.

The skeptic mind would ask why Google, a multi-billion dollar publicly traded company – would release one of its most important assets for anyone to use. Could it be user lock-in, as for example Microsoft tries to frequently do with its products? Well, no, because TensorFlow is open source, and Google makes no money (at least not directly) off of people using it. If you’re familiar with ML, though, you probably know why Google released it for free: they know you can’t compete with Google by using it.

Why? Because the current state of ML research means that models and code really aren’t that important. Any programmer with limited experience in ML could read ML papers and code their own ML frameworks. Indeed, that’s what I did in my spare time just a couple of years ago. It really wasn’t that hard.

What is really important, instead, is compute and data. That is, having access to a large set of data to train your algorithms on, and having the sheer computing power necessary to run your models.

Compute and data are the lifeblood of machine learning, and modern ML has a voracious appetite for both. Long gone are the old days of ML, when advances were made by postgrads coming up with optimal, hand-tuned models and running code on their own CPUs, and everything was neat and tidy and small and cute. Nowadays ML is driven by enormous clusters of computers, incredibly expensive and comprehensive collections of data that are closely guarded by various tech companies, perhaps more closely guarded and deemed more valuable than any other assets the companies have.

If anything, the more ML researchers use TF, the better it is for Google, because it streamlines and lubricates the process of feeding back researcher-developed ideas into Google’s own production.

In the ancient world, gold and silver and grain and salt were the major currencies; they were the commodities that drove the world economy. In the 20th century, it was oil. Today, it is data. And much like the oil barons of old, today’s tech companies found themselves, partly by design and partly by accident – in possession of a commodity that suddenly and dramatically increased in importance, making them rich beyond their wildest dreams. And just like the oil barons, they are scrambling to consolidate their holdings and their capital and force smaller players out of the market. Why else was Instagram – a web service for ‘retouching’ photos – bought for $1 bn dollars?

So, in this kind of ecosystem, is there any place for the intrepid newcomer to come in and actually make a meaningful impact? Can someone with no access to capital (which now means data) and no access to labor (which now means computing power) actually compete?

There’s been a lot of negativity recently in the ML community that without access to these resources, it’s hopeless. For example, Neil Lawrence (a professor of Machine Learning at Sheffield) has said that we need to ‘democratize’ AI by coming up with ML methods that do not require the same amount of resources. That approach may or may not work, and there are some fundamental reasons it may not. But regardless, the bigger point he wants to make is that the small guys can’t compete with the big guys anymore. And to some extent this is true – if you try to do exactly what the big guys are doing. The obvious subtext here is to find things that the big companies are not doing.

Innovate

Today’s large tech companies take the crude data that gushes in, collect it, refine it into ML software and trained models, and sell it in the form of distillated products like advertising, web services, analytics, and so on. But the problem with these large companies is that they are slow to move and innovation (risk-taking) is hard for them. This is the prime reason Google split off from their R&D sections and formed Alphabet, but whether this can actually solve the problem remains to be seen. Even when innovation does happen (e.g. Google glass, or its self-driving car), bringing to market is slow, because it is hard for them to let go of their percieved stability and monopoly and commit their resources to risky projects. A beautiful historical parallel is Xerox and its Alto graphical computer, the inspiration for the Apple Macintosh (and probably all other major graphical user interfaces). Xerox itself struggled to make money off of Alto, but a couple of now well-known young hackers took their idea and made a fortune.

Comma.ai is a new startup company that is aiming to sell self-driving car kits for $1000 that anyone can add to their own car. The CEO is a bit eccentric, and the company may not work out in the end, but I am 100% sure that some similar idea is going to make it to market soon.

Get Around the Problem

DeepMind itself, now owned by Google, is actually a great example of a newcomer company making a huge splash. In their case, they trained a reinforcement learner to play Atari games in a now-famous demo. They got around the problem of having access to data by using a simulated game environment where they could generate as much data as they wanted! With the problem of data solved, they just had to solve the compute problem, and that they did by using the resources of the University of Toronto. Another example is Cylance: a company that tries to detect malware using deep learning techniques; data is easy to come by here.

Other companies use things like web scraping and so on to obtain large troves of data from the internet. The downside to this is that processing raw data from the internet is usually more compute-intensive than curating your own data, so you might wind up trading off data for compute.

So in summary, the best advice to newcomers in the field would be to stop copying other people and look for new, original paths to take. It might be somewhat obvious advice, but there it is.

Differentiable Programming

Introduction

If you haven’t already, read this very thought-provoking article on colah’s blog about the connection between functional programming and neural networks. I would bet that we’re going to see ideas like this cropping up more and more. I intend to write about similar ideas too. The basic idea is that you can express many common kinds of neural networks in a very simple way as functional programs. In this blog post I’m going to talk about ways to actually turn that idea into reality. The article talks about various kinds of (now classical) neural networks and how they fit in to that picture. To set the tone of this article, let’s first refresh our memory on some kinds of very common neural nets:

Multi-Layer Perceptrons (MLPs). Possibly the simplest useful neural net model. MLPs are just sequential stacks of densely connected layers.
Convolutional Neural Nets (CNNs). Used widely for image recognition. CNNs are, like MLPs, sequential stacks of layers, but in this case they are convolutional layers, and they often also include max-pooling layers as well.
Recurrent Neural Nets (RNNs). Used for time series, audio, and text processing (anything with variable-length sequences of data). They work by feeding the output of the net back into its input. It is possible to simply feed an MLP output back into the input to create a simple RNN, but in practice the most popular variant uses LSTM cells.

These types of neural nets are very easy to lay out as a kind of graph structure to see the relationships between layers. And, in the old days, they used to have a fairly small number of layers. The method used to fit or train the nets was (and continues to be) gradient descent. The motivation behind choosing gradient descent was that we knew that as long as the models were simple, gradient descent was an effective learning strategy. Also, gradient descent makes use of the gradient of the function, which has the advantage that it provides faster convergence than algorithms that don’t use the gradient, but it has the disadvantage that you have to be able to calculate the gradient. With MLPs, though, there is a simple algorithm (backpropagation) that lets you calculate the gradient efficiently.

About ten years ago, people started experimenting with very deep (many layers) graph structures. This includes nets like VGG-derived nets. With Google’s inception model and later models, we’re starting to see the emergence of networks with very complicated, non-sequential structures. Some of the models created in the past few years have such complicated network structures now that they make even the inception model seem simple and straightforward. The major realization/insight that has emerged from all of this work is that gradient descent still works, even when you scale up to thousands of layers. Of course, you have to be somewhat clever about how you do gradient descent, and you need to design your model in a way that avoids certain pitfalls, but if you follow some simple rules then you’re golden. This has sparked a lot of speculation and research into just how far you can push gradient descent and what the limits are. Especially since we now have automatic systems that can compute gradients from models without the researcher having to painstakingly compute the gradients by hand and hard-code them into the model.

In recent years another new class of models has emerged: Neural Turing machines (NTMs). These models attempt to mimic the fundamental functions of computers – central processor units, memory units, buses, etc. – entirely using neural functions. The important and significant thing about this is that you get general-purpose computation systems that are entirely differentiable. That is, you can input programs and they execute them, and you can take the derivative of the machine’s parameters with respect to your outputs. This means you can optimize the entire machine using gradient descent to obtain some desired output.

Some of these NTM-like architectures include more recent models like memory networks, memory networks with a soft attention mechanism, pointer networks, stack-augmented RNN, stack/queue-augmented LSTM nets, NTMs trained with reinforcement learning, and NPIs, as I discussed in my NPI post. People like Ilya Sutskever and Jurgen Schmidhuber have talked about how these sorts of methods are the future.

So now that we have neural architectures that act like general-purpose computers, we have almost come full circle. The blog article I linked at the beginning of this post offers a tentative way of expressing neural nets using functional language, and NTM-derived models show certain classes of program-like behaviours being carried out by custom neural models. It seems that what would be needed to close the loop would be a language where you could describe arbitrary program-like behaviors using a functional language, which would then be automatically translated or compiled into a neural system (perhaps an NTM or NPI, as appropriate), allowing learning some of the the parameters or functions of a program entirely from data. This sort of programming language, were someone to implement it, could be called a Differentiable Programming Language (DPL). The special case of this where all primitives (except for very simple ones like addition) are neural, that is, have the following form:

y = f(x) = σ(Wx + b)

could be called Neural Programming Languages (NPLs).

One could argue that TensorFlow is already a low-level DPL. It has many of the language constructs that we expect in traditional programming but don’t usually expect from neural nets, such as while loops, conditional statements, and data structures like queues. And all of this is completely differentiable! Yet I would argue that TensorFlow is not the ideal DPL we desire, because implementing NTMs, NPIs, and other models in TensorFlow doesn’t seem ‘natural’. Also, TensorFlow lacks an innate memory mechanism, which all of those architectures require. In addition, there is no type system in TensorFlow.

Instead, we would like to use a more high-level approach to designing a DPL, and let the ‘compiler’ take care of the details for us.

High-Level Differentiable Programming Languages

Let’s give an example to make the idea more concrete. Let’s say our program is, in pseudocode:

function h(x)
  return f(g(x))
endfunction

Then this would be compiled to:

h(x) = σ(W1(σ(W2x + b1)) + b2)

That is, a standard two-layer MLP. Simple enough. What about a more complicated example? Consider a conditional:

function f(x, a)
  if x > 1.0
    return a + 1
  else 
    return a
  endif
endfunction

It is possible to translate this to the following neural structure:

+(x, y) = σ(Wx + W‘y + b)

f(x, a) = if(x, 1.0, +(a, 1.0), a)

Where we have used recursive neural nets aka TreeNets, and we have used a differentiable if function.

Ok, now let’s make things a bit more formal and actually define a usable language. We have to make the distinction between functions defined in a neural way (that is, those functions we want to find or optimize) and functions given as combinations of other functions. This we will do in the next section.

Lambda Calculus

It’s good to keep the language simple at first, so that we only need to implement the most basic functions required to get us up and going. Lisp is a language that is highly minimalistic in every way: Its interpreter is simple, its evaluation semantics are simple, and its ‘core’ set of functions are simple. You can describe the entire implementation of a Lisp interpreter/evaluatior in just half a page of code (page 13 of this book). Other functional languages like Haskell and so on are also very minimalistic and simple. Here, I’m going to use a subset of Lisp that doesn’t include quoted expressions. This actually greatly reduces the power of the language and it’s probably not a true Lisp, but for our purposes it suffices. We’ll also include a rudimentary type system.

Our language constructs are:

A set of primitive functions, which carry out the basic neural σ(Wx + b) building block, with W and b learned from the data. Each primitive function has a simple type: f:T->S. Functions have to be applied to correct types.
A set of composite functions, which we specify by composing together primitive functions. For example, we could define a two-layer MLP as mlp(x) = f(g(x)), assuming f and g are primitive functions.
And a set of higher-level functions, such as map and fold, which take other functions as input, and perform operations like mapping primitive functions across the data, and so on.

Together, a collection of these functions defines our programs. A simple proof-of-concept compiler can be written to take a program to a TreeNet, but more sophisticated compilers that figured out and utilized structure would be more efficient. I suspect that someone, somewhere, is probably working on something like this right now! In fact, people have already started to carry out research on differentiable languages based not on lambda calculus per se but on stack machines, for instance a Differentiable Forth interpreter.

How is memory handled in a functional setting? We maintain persistent memory by threading some data structure through our functions. In the most pure functional languages, this is expressed in a simple way via monads. I’m going to caution here that I haven’t worked out all the details (maybe I’ll flesh it out in later posts) but the experience with pure functional languages shows that a very simple paradigm of merely composing functions can encapsulate very complex ideas, and describe many otherwise complicated programs in a simple way. That is the goal of producing a DPL.

Statistical Models

There’s also been a parallel development in the field of statistical modeling, called probabilistic programming. The story of statistical modeling has been one of continuous unification among disparate ideas. Classical statistical models, like mixture models and regression, which were first proposed in the 19th century, became unified as Bayesian networks (BNs) in the 1980’s with the work of Judea Pearl, and these models were then unified with Markov Random Fields (MRFs) as the general theory of Probabilistic Graphical Models (PGMs) that started in the 1990’s and then was further expanded during the 2000’s. Each of these represents a more and more general class of models. Bayesian nets are those models where the dependencies between variables can be expressed as a directed acyclic graph (DAG). In this respect the analogous neural models would be non-recurrent neural networks. MRFs are models where the dependencies also take on a fixed graph structure, but the graph is not required to be a DAG. The latest iteration of unification has been probabilistic programming, where even the graph structure itself does not need to be fixed and can vary depending on the data. Probabilistic programs are written pretty much as ordinary computer programs, and can be ‘run’, generatively taking a set of input parameters to some output observations. The important distinction between probabilistic programs and classical programs, though, is that probabilistic programs can also be run in reverse in a way; taking output observations to (distributions over) input parameters. So they can be used for both generative purposes and for inference and model fitting.

Some of the better-developed probabilistic programming languages (PPLs) include STAN, JAGS, and Venture. Church is a very simple minimalistic PPL with Lisp-like syntax. This page has a substantial list of PPLs. One characteristic that most PPLs share is that they separate models from optimization algorithms. This is a very important key development and one that will also be needed for a practical DPL or NPL.

Keep Me in the Hyperloop

There has been some criticism of the hyperloop concept, mainly revolving around the requirement of maintaining vacuum (or near-vacuum, specifically 1 mbar) over long distances. The criticism is on-point. It is in fact very hard to maintain low pressures over long distances. That is the reason that vacuum trains have not yet been implemented in any kind of large scale. The problems include:

Thermal expansion of the tube. This means that expansion seals will need to be included at regular intervals along the tube, and the expansion seals will need to be airtight. An alternative is to bury the entire tube underground, however this is probably too expensive to be worth doing.
Vibration. The pods traveling in the tube will cause vibration and the structure needs to be able to absorb this without risk of failing.
Earthquakes. Self-explanatory.
Thermal buckling of the tube. Temperature differences across the top and bottom of the tube will result in differential length change. The structure of the pods, as well as the expansion seals, will need to be able to deal with this.
Malicious acts. The tube would be a fairly easy target for terrorists or troublemakers as even a rifle bullet would be more than enough to pierce the tube and compromise the vacuum.
Any loss of vacuum has the potential to be disastrous, as a shock front of air would travel down the tube and impact any pods in its way in a fairly violent fashion.

It seems that any practical design would have to assume that vacuum will be lost occasionally, perhaps even frequently, and would have to deal with this in a graceful way and without loss of life.

One option that has been pointed out by others is to use the pods themselves to provide the vacuum. Is this feasible or not? If so, it would solve several problems at once: The problem of creating vacuum, the problem of maintaining vacuum, and the problem of vacuum loss.

The proposed idea is to put compressed air tanks inside each pod. As each pod moved through the tube, more and more of the air in the tube would be collected. At the same time, the pods are designed to operate at any pressure between 1 and 1000 mbar. At 1000 mbar they travel slowly (well under the speed of sound), to avoid the large aerodynamic forces of moving at high speed. As air pressure in front of the pods decrease, they move faster, and at 1 mbar they reach their maximum speed.

This covers the process of starting with an air-filled tube and proceeding to an evacuated one, but what about loss of vacuum? In this case, the main priority is to make the loss as gradual as possible, to protect against rapid deceleration of the pod which could cause structural failure or injury to occupants. We can assume that at the point of vacuum compromise, we have no control over rate of pressure loss, and we also have to assume the rate of pressure loss would be large (a 1 atm shock front moving down the tube at sonic velocity). But we can control the rate of vacuum loss on the non-compromised sections. Under vacuum, the pods are moving rapidly, so by creating a series of very small pressure increases along the length of the tube, the pod can be made to experience gradual pressure increase. Each section of the tube would have a pressure sensor, and if a section of the tube detected loss of vacuum, neighboring sections would open valves to the outside and let air in as well, in a controlled manner.

The number of pods needed to evacuate the tube depends on a lot of factors. I get a rough ballpark estimate of ~2500 trips, or 20-30 days, to evacuate the tube using this method.

In short, more research is needed to see if this idea has merit and could work. If anyone has any links on studies relating to this concept, I’d love to hear about it in the comments.

What can physics tell us about understanding natural language?

Years ago, when I had just graduated with a degree in pure math, I was eager to apply the things I’d learned in class to the real world. So dabbling into physics seemed like the natural thing to do, for someone with math experience. I read books on quantum mechanics, quantum field theoryr, and general relativity. I didn’t fully understand a lot of the material in them at the time. Later I went into engineering and then machine learning, and I forgot a lot of the physics I’d learned. But something that I didn’t forget was the somewhat unique physics jargon and techniques for modeling the world. Historically, physics had to deal with the problem of modeling the behavior of large collections of atoms and molecules, and so it blended statistical techniques with the kind of mathematical modeling that Newton and his successors pursued.

This blog is about ML (on most days) and it’s worth mentioning at this point that a lot of people don’t realize that physics and machine learning actually have an interesting history together, going back decades. One of Geoffrey Hinton’s landmark pieces of work was his paper on Boltzmann machines. Boltzmann machines are neural nets inspired entirely by physical systems, capable of learning any kind of distribution. Note that feedforward neural nets like multilayer perceptrons are designed to learn any mapping from a set of inputs to a set of outputs. Boltzmann machines could do something different – they could learn the distribution over a set of inputs. That is, they can learn useful information about the input without being given an explicit mapping. In other words, they are capable of unsupervised learning.

While Boltzmann machines never became very popular (because they were very, very hard to train), later architectures like restricted Boltzmann machines (RBMs) were easier to train, and actually laid the groundwork for deep learning to build upon. Today’s deep learning architectures can trace their ancestry back to RBMs.

The Boltzmann machine was an example of something where physics-inspired models helped out ML. Of course, a lot of physicists do sometimes get a bit, shall we say, over-enthusiastic, and try to apply the insights from physics to areas where they may not be appropriate. This is more general than physicists though – a lot of fields feel the need to prove that their way of looking at things is the best way or at least a really great way.

So when the news came that Henry W. Lin and the well-known physicist Max Tegmark had recently uploaded a paper to arXiv, Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language, on applying methods from physics to analyze common ML methods for natural language processing (NLP), I thought I’d give it a look.

The paper has a lot of ideas – which I’ll talk about later in this post and it’s a bit difficult for me to disentangle them from already-known ideas in ML. For example, let’s talk about hidden markov models (HMMs).

A HMM is a very simple model of sequences over time. The assumptions of the model are that there is some ‘hidden state’ s that we don’t know of, and the hidden state at time t only depends on the state at time t – 1, and not on the state at time t – 2, or any earlier states (this is called the Markov property). It also produces an output or observation o at each step, which only depends on the state at that step, and on nothing else. The observation is the only way we have of figuring out what goes on inside the model. At first you might think that systems that can be described using HMMs aren’t that common, but by cleverly choosing the right state representation, you can represent a huge variety of systems with the HMM model. For example, think of a falling ball. You might think of representing the state as the position of the ball. It’s obvious that the position of the ball doesn’t merely depend on the previous position – velocity and acceleration factors into it as well. But if you consider the state to be position plus velocity and acceleration, then indeed the state at each time step can be fully inferred from the state at the previous time. And for a lot of problems, you could compress all relevant variables into the state vector. So HMMs are actually a lot general than might seem at first.

Still, most of the time in ML, when we talk about HMMs, we just restrict ourselves to talking about the very simple formulation where there is a (small) finite number of hidden states, and the transition probabilities are given explicitly. So the ball example is excluded (its hidden state is continuous).

The main insight in the paper is the contrast between two types of models: models where observations are generated by a Markov-like sequential process, and models where observations are generated by some kind of hierarchical grammar-like process. The main conclusion is that natural language has statistics that aren’t well-reproduced by HMMs, but very well-reproduced by hierarchical processes.

To say anything more firm about the paper, I need to read it in more depth and compare with previous work. For example, it’s already known that HMM memory decays exponentially. However, the way it’s proven in the paper (using a new quantity called rational mutual information) is somewhat unique.

The take-home message is that things like simple RNNs may not be adequate for modelling language, but more hierarchical or deep or other kinds of RNN models (or different types of models entirely) may be better suited. Which we intuitively already know.

As an aside, I also love the little bits of physics trivia sprinkled throughout the paper. Like this one:

Deep models are important because without the extra “dimension” of depth/abstraction, there is no way to construct “shortcuts” between random variables that are separated by large amounts of time with short-range interactions; 1D models will be doomed to exponential decay. Hence the ubiquity of power laws explains the success of deep learning. In fact, this can be seen as the Bayesian net version of the important result in statistical physics that there are no phase transitions in 1D.

And this one:

There are close analogies between our deep recursive grammar and more conventional physical systems. For example, according to the emerging standard model of cosmology, there was an early period of cosmological inflation when density fluctuations [got] added on a fixed scale as space itself underwent repeated doublings, combining to produce an excellent approximation to a power-law correlation function. This inflationary process is simply a special case of our deep recursive model (generalized from 1 to 3 dimensions).

Wait But Hi

http://waitbutwhy.com/2016/07/wait-but-hi.html

Pseudo-profound samples

What does an MCMC process running inside a brain look like?

Author: anejati