Visual Models – where are we and how did we get here?

Background

In the previous post I gave a general overview of the current state of machine vision. Our ideal machine vision model (the ‘Grand Visual Model’ or GVM) would be one that can recognize any object that it has previously seen – even if it’s only seen that object once. We’d also like our model to be able to look at a scene and, on its own, tell if there are any new objects and, if so, to record them in its memory. We’d like our system to be able to identify precise and subtle details about the objects it sees. That is, we don’t just want it to be able to say “There’s a person in this image,” but to say, “There’s John, standing one meter away, gazing to the left, with an angry expression on his face.”

This is a difficult and unsolved challenge. But people have been approaching the GVM problem from various angles, and the future seems exciting. It’s possible that we are very close to cracking GVM. The reason I’m writing this post is to bring some useful ideas to a broader audience – maybe you will be the person who writes the first GVM (and incidentally becomes a major celebrity, but come on, the thrill of scientific achievement outweighs fame and fortune! /s).

An outline of the GVM

By now, it’s fair to say that deep learning has solved the problem of “recognize an object in an image when provided with numerous labelled training examples” but, alas, this is still a ways away from where we want to be. Deep learning methods are very general; we can help our learning algorithms learn about the world by hard-coding a set of general assumptions about the world. Let’s try to list some general assumptions that seem to be true about the real-world, at least when it comes to vision:

  1. Objects can be viewed from a variety of orientations and under a variety of lighting conditions, which cause the objects to appear different, but in predictable ways.
  2. Objects tend to be composed of sub-objects which have fairly stable relative spatial position and orientation relative to the parent objects.
  3. Objects can be grouped into categories and those categories can, in turn, be grouped into even larger categories. For example, ‘dog’ and ‘cat’ belong to the category ‘animal’, and ‘dining table’ and ‘work table’ belong to the category ‘table’. But objects can belong to many different categories and objects similar in appearance can, in context, belong to completely different categories.
  4. Objects in the same category tend to share certain features – most animals have a head, a body, and limbs, for instance, and all tables have some kind of support (legs, etc.) and a flat top.

I’m sure there are probably more, but let’s stick with these for now. Let’s try to flesh out what these assumptions mean. Deep learning has pretty much solved (1) – given enough training data, we can recognize an object despite its pose or lighting conditions. And if you have a machine vision system that can ‘imagine’ what objects look like, training data is no problem – you can imagine as much training data as you want and train your deep net. This is precisely what Picture aims to do.

Early attempts at solving (2) include, for example, constellation models. The more successful constellation models use probabilistic reasoning and are capable of impressive feats such as one-shot learning of object sub-part relationships entirely on their own. However a major issue with constellation models, at least as they were often implemented, is in choosing the features to detect. They used crude methods like Kadir-Brady feature detectors. The problem these methods have is that the extracted features are highly depended on object pose. By combining constellation models with convnets, it might be possible to solve this problem. So you’d have a system consisting of a convnet on the lower levels and a constellation model on the higher levels. However, there would be a lot of details that would have to be worked out, for example how to effectively train the model.

There have been attempts at solving (3) and (4) as well, for example the work of Josef Sivic (2008). The idea that visual objects form complex context-specific semantic categories reminds one of natural language, where words form sentences and topics and so on, and so people have used tools from natural language processing to attack the problem. For example, Sivic uses hierarchical latent Dirichlet allocation. The approach produces interesting taxonomies, for example (reproduced from the paper):0-figure1-1

You can see that, at the bottom level, it’s put computer monitor-like objects in one category, light-switches in another, and cars in another. On the next level, it groups light switches and monitors together into a fuzzy category of objects that share square-like outlines. And finally, it groups all the objects together in a Universal super-category.

The way the grouping is done is actually very simple and it’s based on an image similarity metric. The main problem with this sort of approach is that it places too much emphasis on appearance, which is ever-changing and dynamic. In the real world it’s unlikely that we’ll see a light switch or computer head-on and center, unoccluded, in great lighting conditions. And even if we do, we can’t count on that to classify how light switches appear. So we really do need a model that’s able to abstract away from simple appearance and take into account more permanent features of objects – such as light switches having a particular 3d shape and several switch protrusions near the center.

A path forward

You might already be thinking: What if we could combine these approaches together? We would have a system with 3 or 4 levels of organization. At the bottom, we’d have a convnet that’s able to robustly and efficiently detect features and can be easily trained. In the middle, we’d have a constellation model (or something similar) that’s capable of integrating features together to obtain detailed 3D models of objects. And at the highest level, we’d have a language-like system for modeling real-world objects. These components would then be trained together in an unsupervised fashion, with each model feeding its expectations into the lower model, which would then feed back its predictions into the higher-level model, which would be continuously trained.

Is it possible to build a model like this? Perhaps. Would it work? Well, in my experience, most models don’t work :). But it would be interesting to try, and we might learn something.

In the next posts, I’m going to dive a bit deeper into the nuts and bolts of these various models and explore how they could be tied together.

 

 

Advertisements

One thought on “Visual Models – where are we and how did we get here?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s