Competition: Analyze This! AI Innovation Challenge

12 minute read


A few months ago, the twin cities meetup group Analyze This! hosted a competition on something rare in the area - deep learning. More specifically, the competition was on image classification of Yaroslav Bulatov’s notMNIST dataset (a few samples pictured in the header photo). This dataset is very similar to Yann LeCun’s MNIST dataset of handwritten digits, the “Iris dataset of deep learning”, but with less structure and more noise.

The script is at the bottom of the page, and here is a brief overview and explanation of what I did for this competition.


For this competition, I prototyped different neural network architectures in Keras due to the ability of quickly creating and modifying architectures without worrying about things like graphs, sessions, weights, or placeholder variables. Since the competition required submissions to be done in TensorFlow, I re-created the best performing architecture in TensorFlow after finishing testing in Keras.

Data Preparation


In most cases, you need to normalize your inputs before feeding them to a neural network. This allows it to train faster, reduces the chances of getting stuck in a local optima, and provides a few other nice mathematical properties.

You will notice that there is no scaling or normalizing in the script. This is because it was already performed on the data that we are reading in.

Data Augmentation

Since our samples vary greatly within class, I wanted to produce additional training samples to help our network further learn and generalize. I initially looked into generative adversarial networks (GANs) to generate new training samples. This works by pitting two networks, a generator to generate the images from random noise and a discriminator to detect the fake images, against each other in order to cause the generator to become good enough at creating fake images that it can fool the discriminator:

This is a very popular concept today, and I agree that it’s extremely interesting and creative. However, it is computationally expensive, and thus time consuming. Due to time constraints from working full time and often traveling for work, I had to cut a few corners that had the potential to improve the overall performance.

Favoring simplicity, I ended up using data augmentation, a technique that creates additional training samples by using one or more of the following techniques:

  • Rotating images
  • Shifting images
  • Zooming in
  • Stretching images (ex. horizontally or vertically)
  • Adding noise or altering pixel intensity of the channels
  • Elastic transformation

The idea is that these modified images will make the model more robust by both having additional training samples (reducing chances of underfitting) and due to having more noise and variation in the training set (reducing chances of overfitting).

For this competition, I used image rotation and shifting. Elastic deformation would’ve likely helped with performance, but I didn’t implement it due to time constraints.

Neural Network

Here’s the architecture of the neural network and a brief description of the components:

Conv(5,5) -> Conv(5,5) -> MaxPooling -> Dropout (50%) -> Conv(3,3) -> Conv(3,3) -> MaxPooling -> Dropout (50%) -> FC1024 Dropout (50%) -> FC1024 -> SoftMax

Note: Replace text with image

  • Convolutional Layers: These layers are a little more abstract than typical fully connected layers because they incorporate spatial aspects of the input data. More specifically, they scan over the input data (the image in this case) and create filters (also referred to as kernels or feature maps) that look for specific features that the model deems important - this is a kind of automatic feature engineering by finding representation within the data automatically. For images, these filters are usually things like colors, edges, corners, textures, or even more complex shapes. Below is a high level example of a convolutional layer scanning over an image and creating filters:
      • Credit to Microsoft for the gif
      • Note: In this image we are constructing filters for multiple channels, or colors, but this project used greyscale images. A more accurate representation would be one grey layer on the top of the gif.
    • Here is a more specific view of creating the filter by multiplying pixel intensities (the numbers under the input image) by trainable weights:

    • Below is an example of filters at different levels of the convolutional network that demonstrates how the learned features get more complex deeper in the network:
        • Credit to Stanford University for the image
    • These layers also all used batch normalization and ReLU activations
      • Batch Normalization: This helps the network learn faster and obtain higher accuracy by scaling the hidden units to a comparable range. This also allows our network to be more stable due to not having large discrepancies in values. Lastly, this helps us avoid the “internal covariate shift” problem where small changes to the network are amplified further on in the network, thus causing a change in the input distribution to internal layers of the network.
      • Rectified Linear Unit (ReLU) Activations: Activation functions are used in neural networks because they apply non-linear functions to the hidden layers, thus allowing the model to learn more complex functional mappings from data. Sigmoid and TanH activations were previously the most popular activation methods, but recent research has shown that ReLUs tend to outperform other activation methods. The first reason is because it can learn faster due to a reduced likelihood of having a vanishing gradient - this is where a the gradient becomes extremely small as the absolute value of x increases. The second reason is sparsity, which is a nice property that allows faster training.
  • Max Pooling: This allows us to train the network faster by reducing the number of parameters while still retaining a large amount of information. Below is an image of 2x2 max pooling over a 4x4 filter.
  • Dropout: Dropout is a regularization technique that randomly de-activates components of a neural network at random. It seems counter-intuitive at first, but it helps the neurons become more robust by forcing them to generalize more in order to obtain good performance when other neurons are randomly being de-activated. This is also an extreme form of bagging (bootstrap aggregating) because our neural network is effectively a different one at each mini-batch due to different neurons being active.
  • Fully Connected Layers: These are the traditional layers like those pictured in the dropout example above. These layers take the features learned from the convolutional layers and do the actual classification.
  • Softmax: This is the final piece of our neural network where we take the outputs for a given image and scale them into probabilities (with the sum of the probabilities for all images adding up to 1) in order to generate our prediction.

The general idea of the network is that the convolutional layers learn the features (ex. the edges/corners/shapes/color intensity differences) that the fully connected layers then use to classify the images. This is one of the key differences between deep learning and traditional machine learning - performing automatic engineering by learning representations or encoding the data. A traditional machine learning model, logistic regression for example, would generally just learn the importance of the color intensity for each given pixel, but not how important the pixels are together.

Because of this automatic feature engineering, transfer learning, a method where only the fully connected layers are re-trained in order to classify specific images, is extremely popular. While this is a very simplified explanation, this is how things ilke Microsoft’s Custom Vision Service work.

Early Stopping

One thing I wanted to mention that I didn’t add to my script that I should have was early stopping. Early stopping is a unique regularization technique that is very computationally efficient to implement. It prevents overfitting by stopping the training of our model when the validation accuracy begins to decrease relative to our training accuracy:

Credit to DL4J for the image

This is extremely efficient because it not only prevents unnecessary training that inadvertently harms the generalization, but it is more or less a self-learning parameter - we don’t have to re-train the model using different values because the optimal value is decided during training.

Final Thoughts

Overall, the competition was a great learning experience. I’ve played around with deep learning in the past, but this really forced me to learn how to correctly use different concepts to build an effective model. Here are a few other thoughts:

  • TensorFlow is an extremely popular deep learning framework, but it is not nearly as easy to use as higher level libraries like Keras
    • TensorFlow does have other neat tools like TensorBoard that I didn’t end up taking advantage of
    • Loading saved models and generating predictions is not trivial in TensorFlow
  • GPUs are essential for any kind of serious deep learning work
    • CUDA and cuDNN, the toolkits required to use an NVIDIA GPU with deep learning libraries, are not as difficult as they seem to set up
    • GPU memory matters - my 5 year old GPU has much less memory than GPUs today, and as a result my mini-batches had to be smaller (resulting in a slower training speed and reduced accuracy)
  • by Jeremy Howard and Rachel Thomas and Deep Learning by Yoshua Bengio, Ian Goodfellow, and Aaron Courville were fantastic learning resources
    • I plan on writing a post in the future with all of the “rules of thumb” from Deep Learning after I finish reading it cover to cover - stay tuned!


Here’s the unedited script used to train the winning model.

from __future__ import division, print_function, absolute_import

import sys
import os
import numpy as np
import tensorflow as tf
import pickle
from datetime import datetime
import time

print('OS: ', sys.platform)
print('Python: ', sys.version)
print('NumPy: ', np.__version__)
print('TensorFlow: ', tf.__version__)

# Checking TensorFlow processing devices
from tensorflow.python.client import device_lib
local_device_protos = device_lib.list_local_devices()
print([x for x in local_device_protos if x.device_type == 'GPU'])

# GPU memory management settings
config = tf.ConfigProto()
# config.gpu_options.allow_growth = True
config.gpu_options.per_process_gpu_memory_fraction = 0.2

# Importing the data
dir_path = os.path.dirname(os.path.realpath(__file__))
pickle_file = 'notMNIST.pickle'

with open(dir_path+'\\'+pickle_file, 'rb') as f:
    save = pickle.load(f, encoding='iso-8859-1')
    X_train = save['train_dataset']
    y_train = save['train_labels']
    X_validation = save['valid_dataset']
    y_validation = save['valid_labels']
    X_test = save['test_dataset']
    y_test = save['test_labels']
    del save  # hint to help gc free up memory
    print('\nNative data shapes:')
    print('Training set', X_train.shape, y_train.shape)
    print('Validation set', X_validation.shape, y_validation.shape)
    print('Test set', X_test.shape, y_test.shape, '\n')

image_size = 28
num_labels = 10
num_channels = 1  # grayscale

# Reformatting to unflattened images
def reformat(dataset, labels):
    dataset = dataset.reshape((-1, image_size, image_size, num_channels)).astype(np.float32)
    labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
    return dataset, labels

X_train, y_train = reformat(X_train, y_train)
X_validation, y_validation = reformat(X_validation, y_validation)
X_test, y_test = reformat(X_test, y_test)

print('Reformatted data shapes:')
print('Training set', X_train.shape, y_train.shape)
print('Validation set', X_validation.shape, y_validation.shape)
print('Test set', X_test.shape, y_test.shape, '\n')

# Augment training data
def augment_training_data(images, labels):
    Generates augmented training data by rotating and shifting images
    Creates an additional 300,000 training samples
    Takes ~1.25 minutes with an i7/16gb machine
    from scipy import ndimage

    # Empty lists to fill
    expanded_images = []
    expanded_labels = []

    # Looping through samples, modifying them, and appending them to the empty lists
    j = 0   # counter
    for x, y in zip(images, labels):
        j = j + 1
        if j % 10000 == 0:
            print('Expanding data: %03d / %03d' % (j, np.size(images, 0)))

        # register original data

        # get a value for the background
        # zero is the expected value, but median() is used to estimate background's value
        bg_value = np.median(x)  # this is regarded as background's value
        image = np.reshape(x, (-1, 28))

        for i in range(4):
            # rotate the image with random degree
            angle = np.random.randint(-15, 15, 1)
            new_img = ndimage.rotate(
                image, angle, reshape=False, cval=bg_value)

            # shift the image with random distance
            shift = np.random.randint(-2, 2, 2)
            new_img_ = ndimage.shift(new_img, shift, cval=bg_value)

            # register new training data
            expanded_images.append(np.reshape(new_img_, (28, 28, 1)))

    return expanded_images, expanded_labels

augmented = augment_training_data(X_train, y_train)

# Appending to the end of the current X/y train
X_train_aug = np.append(X_train, augmented[0], axis=0)
y_train_aug = np.append(y_train, augmented[1], axis=0)

print('X_train shape:', X_train_aug.shape)
print('y_train shape:', y_train_aug.shape)
print(X_train_aug.shape[0], 'Train samples')
print(X_validation.shape[0], 'Validation samples')
print(X_test.shape[0], 'Test samples')

def accuracy(predictions, labels):
    return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1)) / predictions.shape[0])

# Training Parameters
learning_rate = 0.001
num_steps = y_train.shape[0] + 1  # 200,000 per epoch
batch_size = 128
epochs = 100
display_step = 250  # To print progress

# Network Parameters
num_input = 784  # Data input (image shape: 28x28)
num_classes = 10  # Total classes (10 characters)

graph = tf.Graph()

with graph.as_default():
    # Input data
    tf_X_train = tf.placeholder(tf.float32, shape=(batch_size, image_size, image_size, num_channels))
    tf_y_train = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_X_validation = tf.constant(X_validation)
    tf_X_test = tf.constant(X_test)

    # Create some wrappers for simplicity
    def maxpool2d(x, k=2):
        Max Pooling wrapper
        return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME')

    def batch_norm(x):
        Batch Normalization wrapper
        return tf.contrib.layers.batch_norm(x, center=True, scale=True, fused=True,)

    def conv2d(data, outputs=32, kernel_size=(5, 5), stride=1, regularization=0.00005):
        Conv2D wrapper, with bias and relu activation
        layer = tf.contrib.layers.conv2d(inputs=data, 
        return layer

    # Conv(5,5) -> Conv(5,5) -> MaxPooling -> Conv(3,3) -> Conv(3,3) -> MaxPooling -> FC1024 -> FC1024 -> SoftMax
    def model(x):
        # Conv(5, 5)
        conv1 = conv2d(x)
        bnorm1 = batch_norm(conv1)

        # Conv(5, 5) -> Max Pooling
        conv2 = conv2d(bnorm1, outputs=64)
        bnorm2 = batch_norm(conv2)
        pool1 = maxpool2d(bnorm2, k=2)  # 14x14
        drop1 = tf.nn.dropout(pool1, keep_prob=0.5)

        # Conv(3, 3)
        conv3 = conv2d(drop1, outputs=64, kernel_size=(3, 3))
        bnorm3 = batch_norm(conv3)

        # Conv(3, 3) -> Max Pooling
        conv4 = conv2d(bnorm3, outputs=64, kernel_size=(3, 3))
        bnorm4 = batch_norm(conv4)
        pool2 = maxpool2d(bnorm4, k=2)  # 7x7
        drop2 = tf.nn.dropout(pool2, keep_prob=0.5)

        # FC1024
        # Reshape conv2 output to fit fully connected layer input
        flatten = tf.contrib.layers.flatten(drop2)
        fc1 = tf.contrib.layers.fully_connected(
        drop3 = tf.nn.dropout(fc1, keep_prob=0.5)

        # FC1024
        fc2 = tf.contrib.layers.fully_connected(

        # Output, class prediction
        out = tf.contrib.layers.fully_connected(
        return out

    # Construct model
    logits = model(tf_X_train)
    prediction = tf.nn.softmax(logits)

    # Define loss and optimizer
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_y_train, logits=logits))
    optimizer = tf.train.GradientDescentOptimizer(0.05).minimize(loss)

# Initialize the variables (i.e. assign their default value)
init = tf.global_variables_initializer()

# Start training
with tf.Session(config=config, graph=graph) as session:

    # For tracking execution time and progress
    start_time = time.time()
    total_steps = 0

    for epoch in range(1, epochs+1):
        print('Beginning Epoch {0} -'.format(epoch))

        def next_batch(num, data, labels):
            Return a total of `num` random samples and labels. 
            Mimicks the mnist.train.next_batch() function
            idx = np.arange(0 , len(data))
            idx = idx[:num]
            data_shuffle = [data[i] for i in idx]
            labels_shuffle = [labels[i] for i in idx]

            return np.asarray(data_shuffle), np.asarray(labels_shuffle)

        for step in range(num_steps):
            batch_data, batch_labels = next_batch(batch_size, X_train_aug, y_train_aug)

            feed_dict = {tf_X_train: batch_data, tf_y_train: batch_labels}
            _, l, predictions =[optimizer, loss, train_prediction], feed_dict=feed_dict)

            if (step % 250 == 0) or (step == num_steps):
                # Calculating percentage of completion
                total_steps += step
                pct_epoch = (step / float(num_steps)) * 100
                pct_total = (total_steps / float(num_steps * (epochs+1))) * 100  # Fix this line

                # Printing progress
                print('Epoch %d Step %d (%.2f%% epoch, %.2f%% total)' % (epoch, step, pct_epoch, pct_total))
                print('Minibatch loss: %f' % l)
                print('Minibatch accuracy: %.1f%%' % accuracy(predictions, batch_labels))
                print('Total execution time: %.2f minutes' % ((time.time() - start_time)/60.))
        # Save the model every 5th epoch
        if epoch % 5 == 0:
            # Saver object - saves model as 'tfTestModel_20epochs_Y-M-D_H-M-S'
            saver = tf.train.Saver()
            current_time ="%Y-%m-%d_%H-%M-%S")
  , dir_path+'\\models\\'+'tfTestModel'+'_'+str(epoch)+'epochs_'+str(current_time))
            print('Saving model at current stage')

    # Saver object - saves model as 'tfTestModel_20epochs_Y-M-D_H-M-S'
    saver = tf.train.Saver()
    current_time ="%Y-%m-%d_%H-%M-%S"), dir_path+'\\models\\'+'tfTestModel'+'_'+str(epoch)+'epochs_'+str(current_time))