Training_Deep Learning with Theano-QQ阅读男生玄幻网

书名：Deep Learning with Theano
作者名：Christopher Bourez
本章字数：1313字
更新时间：2025-04-04 18:45:14

Training

In order to get a good measure of how the model behaves on data that's unseen during training, the validation dataset is used to compute a validation loss and accuracy during training.

The validation dataset enables us to choose the best model, while the test dataset is only used at the end to get the final test accuracy/error of the model. The training, test, and validation datasets are discrete datasets, with no common examples. The validation dataset is usually 10 times smaller than the test dataset to slow the training process as little as possible. The test dataset is usually around 10-20% of the training dataset. Both the training and validation datasets are part of the training program, since the first one is used to learn, and the second is used to select the best model on unseen data at training time.

The test dataset is completely outside the training process and is used to get the accuracy of the produced model, resulting from training and model selection.

If the model overfits the training set because it has been trained too many times on the same images, for example, then the validation and test sets will not suffer from this behavior and will provide a real estimation of the model's accuracy.

Usually, a validation function is compiled without a gradient update of the model to simply compute only the cost and error on the input batch.

Batches of data (x,y) are commonly transferred to the GPU at every iteration because the dataset is usually too big to fit in the GPU's memory. In this case, we could still use the trick with the shared variables to place the whole validation dataset in the GPU's memory, but let's see how we would do if we had to transfer the batches to the GPU at each step and not use the previous trick. We would use the more usual form:

validate_model = theano.function(
    inputs=[x,y],
    outputs=[cost,error]
)

It requires the transfer of batch inputs. Validation is computed not at every iteration, but at validation_interval iterations in the training for loop:

if iteration % validation_interval == 0 :
    val_index = iteration // validation_interval
    valid_loss[val_index], valid_error[val_index] = np.mean([
            validate_model(
                valid_set[0][i * batch_size: (i + 1) * batch_size],
                numpy.asarray(valid_set[1][i * batch_size: (i + 1) * batch_size], dtype="int32")
                )
                for i in range(n_valid_batches)
             ], axis=0)

Let's see the simple first model:

epoch 0, minibatch 1/83, validation error 40.05 %, validation loss 2.16520105302

epoch 24, minibatch 9/83, validation error 8.16 %, validation loss 0.288349323906
epoch 36, minibatch 13/83, validation error 7.96 %, validation loss 0.278418215923
epoch 48, minibatch 17/83, validation error 7.73 %, validation loss 0.272948684171
epoch 60, minibatch 21/83, validation error 7.65 %, validation loss 0.269203903154
epoch 72, minibatch 25/83, validation error 7.59 %, validation loss 0.26624627877
epoch 84, minibatch 29/83, validation error 7.56 %, validation loss 0.264540277421
...
epoch 975, minibatch 76/83, validation error 7.10 %, validation loss 0.258190142922
epoch 987, minibatch 80/83, validation error 7.09 %, validation loss 0.258411859162

In a full training program, a validation interval corresponding to the total number of epochs, with an average validation score for the epoch, would make more sense.

To better estimate how the training performs, let's plot the training and valid loss. In order to display the descent in early iterations, I'll stop the drawing at 100 iterations. If I use 1,000 iterations in the plot, I won't see the early iterations:

The training loss looks like a wide band because it oscillates between different values. Each of the values corresponds to one batch. The batch might be too small to provide a stable loss value. The mean value of the training loss over the epoch would provide a more stable value to compare with the valid loss and show overfitting.

Also note that the loss plot provides information on how the network converges, but does not give any valuable information on the error. So, it is also very important to plot the training error and the valid error.

For the second model:

epoch 0, minibatch 1/83, validation error 41.25 %, validation loss 2.35665753484
epoch 24, minibatch 9/83, validation error 10.20 %, validation loss 0.438846310601
epoch 36, minibatch 13/83, validation error 9.40 %, validation loss 0.399769391865
epoch 48, minibatch 17/83, validation error 8.85 %, validation loss 0.379035864025
epoch 60, minibatch 21/83, validation error 8.57 %, validation loss 0.365624915808
epoch 72, minibatch 25/83, validation error 8.31 %, validation loss 0.355733696371
epoch 84, minibatch 29/83, validation error 8.25 %, validation loss 0.348027150147
epoch 96, minibatch 33/83, validation error 8.01 %, validation loss 0.34150374867
epoch 108, minibatch 37/83, validation error 7.91 %, validation loss 0.335878048092
...
epoch 975, minibatch 76/83, validation error 2.97 %, validation loss 0.167824191041
epoch 987, minibatch 80/83, validation error 2.96 %, validation loss 0.167092795949

Again, the training curves give better insights:

For the third model:

epoch 0, minibatch 1/83, validation error 53.81 %, validation loss 2.29528842866
epoch 24, minibatch 9/83, validation error 1.55 %, validation loss 0.048202780541
epoch 36, minibatch 13/83, validation error 1.31 %, validation loss 0.0445762014715
epoch 48, minibatch 17/83, validation error 1.29 %, validation loss 0.0432346871821
epoch 60, minibatch 21/83, validation error 1.25 %, validation loss 0.0425786205451
epoch 72, minibatch 25/83, validation error 1.20 %, validation loss 0.0413943211024
epoch 84, minibatch 29/83, validation error 1.20 %, validation loss 0.0416557886347
epoch 96, minibatch 33/83, validation error 1.19 %, validation loss 0.0414686980075
...
epoch 975, minibatch 76/83, validation error 1.08 %, validation loss 0.0477593478863
epoch 987, minibatch 80/83, validation error 1.08 %, validation loss 0.0478142946085

Refer to the following graph:

Here we see the difference between train and valid, losses either due to a slight overfitting to the training data, or a difference between the training and test datasets.

The main causes of overfitting are as follows:

Too small a dataset: Collect more data
Too high a learning rate: The network is learning too quickly on earlier examples
A lack of regularization: Add more dropout (see next section), or a penalty on the norm of the weights in the loss function
Too small model: Increase the number of filters/units in different layers

Validation loss and error gives a better estimate than training loss and error, which are more noisy, and during training, they are also used to decide which model parameters are the best:

Simple model: 6.96 % at epoch 518
MLP model: 2.96 % at epoch 987
CNN model: 1.06 % at epoch 722

These results also indicate that the models might not improve much with further training.

Here's a comparison of the three models' validation losses:

Note that the MLP is still improving and the training has not finished, while the CNN and simple networks have converged.

With the selected model, you can easily compute the test loss and error on the test dataset to finalize it.

The last important concept of machine learning is hyperparameter tuning. An hyperparameter defines a parameter of the model that is not learned during training. Here are examples:

learning rate
number of hidden neurons
batch size

For the learning rate, too slow a descent might prevent finding a more global minimum, while too fast a descent damages the final convergence. Finding the best initial learning rate is crucial. Then, it is common to decrease the learning rate after many iterations in order to have more precise fine-tuning of the model.

Hyperparameter selection requires us to run the previous runs many times for different values of the hyperparameters; testing all combinations of hyperparameters can be done in a simple grid search, for example.

Here is an exercise for the reader:

Train the models with different hyperparameters and draw the training loss curves to see how hyperparameters influence the final loss.
Visualize the content of the neurons of the first layer, once the model has been trained, to see what the features capture from the input image. For this task, compile a specific visualization function:
```
visualize_layer1 = theano.function(
    inputs=[x,y],
    outputs=conv1_out
)
```