The holy grail of machine vision is to be able to look at a scene and tell precisely what objects are in the scene and where they are. By now virtually everyone in the solar system has heard about deep learning and the impressive strides it’s made toward this goal. And if we’re being fair, looking at the results from deep learning methods, it’s hard to argue that we haven’t made significant progress, compared to a few decades ago. But there’s still a lot that deep learning can’t do.
Deep learning methods usually start from a very very large (millions of images) training data set that has been carefully and accurately labelled. Every object in the training set typically appears in multiple images, and from various poses. This allows the net to learn what objects are no matter how they are percieved.
But the real world is dynamic and ever-changing. We often see new objects that we haven’t seen before (new faces, new locations, etc.), and we may need to memorize their appearance. Usually we do not get much ‘training data’ for new objects. Yet we are typically able to remember someone’s face even after just a few seconds of seeing them – this is called ‘one-shot’ learning. And we are able to recognize that face even when seen under vastly different lighting conditions or poses.
Another issue is that real-world visual scenes aren’t nicely labelled. While children have the aid of parents to sometimes teach them the words for new things, most of our learning is unsupervised – as we learn about the world we classify most objects into categories via similarities and differences between those objects and objects we’ve seen before.
One-shot learning and unsupervised learning are the sorts of problems that machine learning is attacking in the current post-deep learning era. People have made a lot of progress, but we’re still far from where we’d like to be.
The main ingredient we’re missing is a model of the world. Every ML method requires some assumptions on what the world is like. For a simple example, in k-means clustering, we assume that the world is composed of a set of clusters where each cluster has small intra-cluster distance and large inter-cluster distance, according to some metric. Further, we assume that the only relevant information is cluster membership, and all other information is irrelevant/noise. Even further, we assume that the number of clusters is relatively small (small enough to fit in our computer memories and to be able to efficiently sample from) and each cluster exists independently of all other clusters. These are very restrictive assumptions. The more restrictive the assumptions of a ML algorithm, the more efficiently it’s able to learn (fewer training examples) since the number of possible ‘worlds’ is smaller, but it’s only able to learn things that fall within its model. You can’t use raw k-means to perform real-world visual classification – the assumptions of k-means modelling are violated. For image classification There is no clear ‘similarity’ metric, the clusters are very large in number and dimension and are not neatly separated, and the clusters do not exist independently of one another.
So what are the assumptions of convolutional neural nets, of the type that deep learning uses for object recognition?
Well, they have the usual assumptions of neural nets – that visual scenes can be transformed via a series of neural layers to a compact representation. But they also have additional assumptions due to being convolutional – they assume translation-invariance and, sometimes, scale invariance as well. These are very fair assumptions for natural images and this is why they perform so well. But their assumptions aren’t nearly enough.
- They don’t ‘know’ anything about how objects look like in visual scenes. E.g., that objects can move and rotate in 3d and this causes their appearances to change in predictable ways*.
- The don’t know about the type of objects that exist in the world and how they relate to each other.
Because of these limitations and others, deep convnets need a lot of training data to work, since we must effectively ‘teach’ the networks about these factors of the world by giving them enough training examples that they ‘figure them out’ by themselves. This is very inefficient, and it prevents one-shot learning and unsupervised learning.
So how do we construct an object detection system that takes these things into account?
Object class hierarchies
In the early days of machine vision, it was popular to represent each type of object as having a set of sub-objects (‘parts’), which in turn might also have a set of sub-objects, and so on. Each sub-object had a position and orientation relative to the parent object (however this was often probabilistic in nature e.g. instead of being a 3d coordinate, it might have been a normal distribution over coordinates). This formed, in effect, a ‘grammar’ over visual objects. See, for instance, the work by Li Fei Fei’s group. This kind of model encodes something very important about how the world works: Objects can be counted on to have fairly stable sets of sub-objects and the differences in appearance of the same object in different scenes can usually be explained by the object simply appearing in a different pose.
But while this type of model can give interesting results for many object detection problems, it’s not good enough for real-world problems, because in the real world, we do not simply model each object as a set of explicit sub-parts. We use information learned from other objects to form representations that are far more efficient and capture more structured information about the world. For example, you know that all mammals have heads, bodies, limbs, etc. If you see a new mammal for the first time – say, a Saiga antelope (and no, that’s not photoshopped, it’s a real animal) – you’ll be able to tell that even though it looks like no mammal you’ve ever seen, it’s still probably a mammal (has fur, etc.), looks somewhat like a regular antelope or deer, and in turn belongs to the group of mammals that look roughly horse/deer/gazelle/goat-like. You’d build a mental model of the saiga antelope’s general shape. If you see the saiga antelope from another pose, you’d use that mental model to figure out it’s a saiga antelope. And you’d do all of this having just seen a single low-resolution picture of a saiga antelope.
What I’m trying to say is that we use information from other objects to build representations for new objects. Far from all of our object representations living separately and not interacting, we have a rich ‘internal world’ of different types of objects, object class hierarchies, and lateral relationships between objects.
No one knows what our internal model of object classes looks like. It’s an important enough problem that anyone who figured this out might win the Nobel prize. It’s been conjectured that figuring it out would give insights into a good deal of what the human brain does in general, beyond vision. Because of this, figuring it out might be as hard as ‘cracking’ the problem of general artificial intelligence. And anyone who accomplished the related goal of coming up with a computer model of object classes capable of the same richness of representation and efficient learning that humans have would definitely become a celebrity in ML circles.
Let’s call such a hypothetical future model the GVM – the General Visual Model or Grand Visual Model (or Generative Visual Model, if that makes you feel better). We’ve made some progress towards acheiving GVM but we’re still far from the goal. For example, there’s Picture, which allows you to represent your visual scene in a very high-level language, and the algorithm learns how to translate images into representations in your model. But Picture is still underspecified – it doesn’t specify a model for object representation, merely a language in which such a model could be represented. As an analogy, you could say that Picture is to the eventual GVM what C is to the Linux operating system (although this is assuming that the GVM will use Picture-related ideas, which may or may not happen). One model that I really like is Brendan Lake’s handwriting model which just came out last year. This model is a very nice example of the ideas that we might very likely have in the eventual GVM – a rich representation using information-sharing to represent a huge number of objects in a very efficient manner. It’s capable of things like one-shot learning and efficient unsupervised training, and it achieves results very similar to humans when trained on similar tasks. However, the model here is too specific – it just works on handwriting. And the model is actually fairly complicated, uses a lot of optimizations specific to handwriting, and it’s not clear how to generalize it to other tasks. This isn’t good news – it suggests that the GVM itself is also probably very complicated and would require a lot of hard-coded, hand-tuned ‘magic’ in order to work. This stands in stark contrast to convnets, which are fairly simple and general.
Still, Lake’s work has generated a flurry of related work and it’s somewhat hard to keep up with a lot of the ideas that are being thrown out there. In the next posts, I’ll go through the past, present, and (possible) future of achieving GVM and try to come up with a set of ideas on how to proceed.