- Deep Learning with Theano
- Christopher Bourez
- 776字
- 2025-04-04 18:45:14
Multiple layer model
A multi-layer perceptron (MLP) is a feedforward net with multiple layers. A second linear layer, named hidden layer, is added to the previous example:

Having two linear layers following each other is equivalent to having a single linear layer.
With a non-linear function or non-linearity or transfer function between the linearities, the model does not simplify into a linear one any more, and represents more possible functions in order to capture more complex patterns in the data:

Activation functions helps saturating (ON-OFF) and reproduces the biological neuron activations.
The Rectified Linear Unit (ReLU) graph is given as follows:
(x + T.abs_(x)) / 2.0

The Leaky Rectifier Linear Unit (Leaky ReLU) graph is given as follows:
( (1 + leak) * x + (1 – leak) * T.abs_(x) ) / 2.0

Here, leak
is a parameter that defines the slope in the negative values. In leaky rectifiers, this parameter is fixed.
The activation named PReLU considers the leak
parameter to be learned.
More generally speaking, a piecewise linear activation can be learned by adding a linear layer followed by a maxout activation of n_pool
units:
T.max([x[:, n::n_pool] for n in range(n_pool)], axis=0)
This will output n_pool
values or units for the underlying learned linearities:
Sigmoid (T.nnet.sigmoid)

HardSigmoid function is given as:
T.clip(X + 0.5, 0., 1.)

HardTanh function is given as:
T.clip(X, -1., 1.)

T.tanh(x)

This two-layer network model written in Python will be as follows:
batch_size = 600 n_in = 28 * 28 n_hidden = 500 n_out = 10 def shared_zeros(shape, dtype=theano.config.floatX, name='', n=None): shape = shape if n is None else (n,) + shape return theano.shared(np.zeros(shape, dtype=dtype), name=name) def shared_glorot_uniform(shape, dtype=theano.config.floatX, name='', n=None): if isinstance(shape, int): high = np.sqrt(6. / shape) else: high = np.sqrt(6. / (np.sum(shape[:2]) * np.prod(shape[2:]))) shape = shape if n is None else (n,) + shape return theano.shared(np.asarray( np.random.uniform( low=-high, high=high, size=shape), dtype=dtype), name=name) W1 = shared_glorot_uniform( (n_in, n_hidden), name='W1' ) b1 = shared_zeros( (n_hidden,), name='b1' ) hidden_output = T.tanh(T.dot(x, W1) + b1) W2 = shared_zeros( (n_hidden, n_out), name='W2' ) b2 = shared_zeros( (n_out,), name='b2' ) model = T.nnet.softmax(T.dot(hidden_output, W2) + b2) params = [W1,b1,W2,b2]
In deep nets, if weights are initialized to zero with the shared_zeros
method, the signal will not flow through the network correctly from end to end. If weights are initialized with values that are too big, after a few steps, most activation functions saturate. So, we need to ensure that the values can be passed to the next layer during propagation, as well as for the gradients to the previous layer during back-propagation.
We also need to break the symmetry between neurons. If the weights of all neurons are zero (or if they are all equal), they will all evolve exactly in the same way, and the model will not learn much.
The researcher Xavier Glorot studied an algorithm to initialize weights in an optimal way. It consists in drawing the weights from a Gaussian or uniform distribution of zero mean and the following variance:

Here are the variables from the preceding formula:
nin
is the number of inputs the layer receives during feedforward propagationnout
is the number of gradients the layer receives during back-propagation
In the case of a linear model, the shape parameter is a tuple, and v
is simply numpy.sum( shape[:2] )
(in this case, numpy.prod(shape[2:])
is 1
).
The variance of a uniform distribution on [-a, a] is given by a**2 / 3, then the bound a
can be computed as follows:

The cost can be defined the same way as before, but the gradient descent needs to be adapted to deal with the list of parameters, [W1,b1,W2,b2]
:
g_params = T.grad(cost=cost, wrt=params)
The training loop requires an updated training function:
learning_rate = 0.01 updates = [ (param, param - learning_rate * gparam) for param, gparam in zip(params, g_params) ] train_model = theano.function( inputs=[index], outputs=cost, updates=updates, givens={ x: train_set_x[index * batch_size: (index + 1) * batch_size], y: train_set_y[index * batch_size: (index + 1) * batch_size] } )
In this case, learning rate is global to the net, with all weights being updated at the same rate. The learning rate is set to 0.01 instead of 0.13. We'll speak about hyperparameter tuning in the training section.
The training loop remains unchanged. The full code is given in the 2-multi.py
file.
Execution time on the GPU is 5 minutes and 55 seconds, while on the CPU it is 51 minutes and 36 seconds.
After 1,000 iterations, the error has dropped to 2%, which is a lot better than the previous 5% error rate, but part of it might be due to overfitting. We'll compare the different models later.