The official TensorFlow repository has a working implementation of the Inception v3 architecture. Inception v3 is the 2015 iteration of Google’s Inception architecture for image recognition. If you are familiar with deep learning then you most definitely know all about it. If you aren’t, but keep up with tech news, then you probably best know it as ‘that learning algorithm that trained itself to recognize pictures of cats.’ And if you still have no idea what I’m talking about (or you think I’m talking about that Leonardo DiCaprio movie), then congrats on climbing out of your cave, and welcome to the world of machine learning!
Inception is a really great architecture and it’s the result of multiple cycles of trial and error. I frequently find that it achieves the best performance for image recognition among other models.
The implementation of it was written by the same people who wrote TensorFlow, and so it seems to be well-written and makes use of a lot of TensorFlow tricks and techniques. I thought I’d study the code to see how they do things, and learn how to utilize TensorFlow better. In this blog post I’m sharing some of my notes.
First of all, the Inception code uses TF-Slim, which seems to be a kind of abstraction library over TensorFlow that makes writing convolutional nets easier and more compact. As far as I can tell, TF-Slim hasn’t been used for any major projects aside from Inception. But it’s very ideal for inception, because the inception architecture is ‘deep’ and has many layers. Looking at the Readme file on that page is recommended.
Let’s dive into the code. slim\inception_model.py contains the code for the actual inception model itself. The model makes heavy use of the various scoping mechanisms available in TensorFlow. It wraps the entire inception model into a new TensorFlow op. First, it wraps the entire model in an
'inception_v3'. It then uses various arg_scopes to set the default arguments for ops inside the model. TF’s arg_scopes are a simple way of setting the default arguments for a lot of ops in a model at the same time, without having to repeatedly enter them each time an op is called. There are several nested
arg_scopes, ostensibly for each module in the model.
An interesting aspect of the model is that it’s not constructed of repeated modules. That is, they didn’t define an ‘inception cell’ and then repeatedly apply this to downscale the input. This is what I would have done (modularity is good, it prevents bugs from creeping in and makes the code easier to modify), and I’m curious to know if there’s some fundamental reason that doing this wouldn’t have been worth it, or they just decided it to keep it conceptually simpler this way.
end_points in the model contains all intermediate tensors. So for instance,
end_points['conv0'] contains the output of the first convolution, which is then fed into another convolution, the result of which is saved as
`end_points['conv1'] and so on.
The rest of the model definition itself seems straightforward. Convolution, max-pooling, and dropout layers are repeatedly applied to the tensors, and the result is the
logits variable which gives the predictions (a vector of length 1000 for each image in the batch) and the
end_points list which stores all intermediate results. The other function in the file,
inception_v3_parameters , simply creates another
arg_scope that holds the default parameters for the inception op itself.
The next important file in the model is inception_train.py, which contains the actual code for, well training the inception model. Proper training in TF requires doing the following things:
- Specifying a learning rate schedule
- Creating some system for generating training batches and testing batches
- Setting up an optimizer and running it
- Setting up some system to periodically display results and save model state
In the file, there is a ‘warning’ that the learning rate schedule is
…heavily dependent on the hardware architecture, batch size and any changes to the model architecture specification. Selecting a finely tuned learning rate schedule is an empirical process that requires some experimentation. Please see README.md more guidance and discussion.
train function in that file is of particular note. It first selects the cpu as the computational backend to perform various book-keeping duties. The next step is defining a
global_step variable. It is simply a variable that represents the training iteration number during training. This is fairly standard procedure for TensorFlow models, as such a variable helps the computation graph keep track of when to update the learning rate, and when to output saved model results.
The training here is done in an interesting way. It uses a special mechanism for distributed training. It splits up the computation between gpus explicitly, by executing a separate graph on each one, which it calls a ‘tower’. It doesn’t seem to use the Distributed TensorFlow mechanism, which seems simpler, and I’m not exactly sure why (maybe because the code was written before Distributed TF became available). Anyway, there are two ‘private’ functions:
_tower_loss, which only computes the total loss for each tower, and
_average_gradient, which averages the per-tower gradients to return a global gradient. During each training cycle, the gradient is calculated for each tower, and then the gradients are averaged and added to the weights/biases. Note that the averaging step represents a synchronization point between the towers. In one paper, they compare synchronous with asynchronous training (doing updates separately) and the synchronous training seems to converge faster.
The gradients are computed, for each tower, using the
opt.compute_gradients function, averaged using
_average_gradient, and then applied using the
opt.apply_gradients function, where
opt is the optimizer. This is another neat thing about TF: For simpler training procedures, you can take an entirely hands-off approach, and allow TF to automatically take care of computing the gradients and applying them. You don’t even have to think about the gradients. You can also take a more micromanaged approach, explicitly calculating gradients, doing some processing, and then applying them. And you can also take a bare-bones approach, not relying on any pre-written optimizers, and adding gradients however way you want.
These operations, plus operations for saving/loading graphs, are then all bundled up into top-level ops, and then combined into a final op:
train_op = tf.group(apply_gradient_op, variables_averages_op, batchnorm_updates_op)
Which is then run.
Now let’s look at how testing and training data are handled. Data are wrapped in a class called
Dataset, which is defined as an abstract base class in dataset.py. The dataset class itself seems to mostly be a set of routines for simplifying access to TFRecords files. One thing I’m not sure about is why there seems to be a preference of using TFRecords over more standard data formats like HDF5. At least TF does have support for reading HDF5.
It seems like the code only runs the loss on a single data set at a time (training, testing, or validation). You can select which one by feeding a command-line argument:
tf.app.flags.DEFINE_string('subset', 'train', """Either 'train' or 'validation'.""")
You can also select whether to train or not from the command line. So this allows you to run multiple processes in parallel for the testing, training, and validation sets. This seems to be the preferred way to do testing/training in TF, rather than performing training and testing simultaneously in the script.
The code contains a lot of the now-standard coding patterns for TF:
- Encapsulating everything in various modules defined by scopes, including
- Specifying a learning rate schedule, done using
- Creating some system for generating training batches and testing batches – done external to the model, with actual evaluation in a separate process.
- Setting up an optimizer and running it
- Setting up some system to periodically display results and save model state, using the
tf.Savermechanism, which automates a lot of this.