Backpropagation and stochastic gradient descent

Backpropagation, or the backward propagation of errors, is the most commonly used supervised learning algorithm for adapting the connection weights.

Considering the error or the cost as a function of the weights W and b, a local minimum of the cost function can be approached with a gradient descent, which consists of changing weights along the negative error gradient:

Here,

is the learning rate, a positive constant defining the speed of a descent.

The following compiled function updates the variables after each feedforward run:

g_W = T.grad(cost=cost, wrt=W)
g_b = T.grad(cost=cost, wrt=b)

learning_rate=0.13
index = T.lscalar()

train_model = theano.function(
    inputs=[index],
    outputs=[cost,error],
    updates=[(W, W - learning_rate * g_W),(b, b - learning_rate * g_b)],
    givens={
        x: train_set_x[index * batch_size: (index + 1) * batch_size],
        y: train_set_y[index * batch_size: (index + 1) * batch_size]
    }
)

The input variable is the index of the batch, since all the dataset has been transferred in one pass to the GPU in shared variables.

Training consists of presenting each sample to the model iteratively (iterations) and repeating the operation many times (epochs):

n_epochs = 1000
print_every = 1000

n_train_batches = train_set[0].shape[0] // batch_size
n_iters = n_epochs * n_train_batches
train_loss = np.zeros(n_iters)
train_error = npzeros(n_iters)

for epoch in range(n_epochs):
    for minibatch_index in range(n_train_batches):
        iteration = minibatch_index + n_train_batches * epoch
        train_loss[iteration], train_error[iteration] = train_model(minibatch_index)
        if (epoch * train_set[0].shape[0] + minibatch_index) % print_every == 0 :
            print('epoch {}, minibatch {}/{}, training error {:02.2f} %, training loss {}'.format(
                epoch,
                minibatch_index + 1,
                n_train_batches,
                train_error[iteration] * 100,
                train_loss[iteration]
            ))

This only reports the loss and error on one mini-batch, though. It would be good to also report the average over the whole dataset.

The error rate drops very quickly during the first iterations, then slows down.

Execution time on a GPU GeForce GTX 980M laptop is 67.3 seconds, while on an Intel i7 CPU, it is 3 minutes and 7 seconds.

After a long while, the model converges to a 5.3 - 5.5% error rate, and with a few more iterations could go further down, but could also lead to overfitting, Overfitting occurs when the model fits the training data well but does not get the same error rate on unseen data.

In this case, the model is too simple to overfit on this data.

A model that is too simple cannot learn very well. The principle of deep learning is to add more layers, that is, increase the depth and build deeper networks to gain better accuracy.

We'll see in the following section how to compute a better estimation of the model accuracy and the training stop.