- The Deep Learning with Keras Workshop
- Matthew Moocarme Mahla Abdolahnejad Ritesh Bhagwat
- 5723字
- 2021-06-18 18:13:41
Building Your First Neural Network
In this section, you will learn about the representations and concepts of deep learning, such as forward propagation—the propagation of data through the network, multiplying the input values by the weight of each connection for every node, and backpropagation—the calculation of the gradient of the loss function with respect to the weights in the matrix, and gradient descent—the optimization algorithm that's used to find the minimum of the loss function.
We will not delve deeply into these concepts as it isn't required for this book. However, this coverage will essentially help anyone who wants to apply deep learning to a problem.
Then, we will move on to implementing neural networks using Keras. Also, we will stick to the simplest case, which is a neural network with a single hidden layer. You will learn how to define a model in Keras, choose the hyperparameters—the parameters of the model that are set before training the model—and then train your model. At the end of this section, you will have the opportunity to practice what you have learned by implementing a neural network in Keras so that you can perform classification on a dataset and observe how neural networks outperform simpler models such as logistic regression.
Logistic Regression to a Deep Neural Network
In Chapter 1, Introduction to Machine Learning with Keras, you learned about the logistic regression model, and then how to implement it as a sequential model using Keras in Chapter 2, Machine Learning versus Deep Learning. Technically speaking, logistic regression involves a very simple neural network with only one hidden layer and only one node in its hidden layer.
An overview of the logistic regression model with two-dimensional input can be seen in the following image. What you see in this image is called one node or unit in the deep learning world, which is represented by the green circle. As you may have noticed, there are some differences between logistic regression terminology and deep learning terminology. In logistic regression, we call the parameters of the model coefficients and intercepts. In deep learning models, the parameters are referred to as weights (w) and biases (b):
At each node/unit, the inputs are multiplied by some weights and then a bias term is added to the sum of these weighted inputs. This can be seen in the calculation above the node in the preceding image. The inputs are X1 and X2, the weights are W1 and W2, and the bias is b. Next, a nonlinear function (for example, a sigmoid function in the case of a logistic regression model) is applied to the sum of the weighted inputs and the bias term is used to compute the final output of the node. In the calculation shown in the preceding image, this is σ. In deep learning, the nonlinear function is called the activation function and the output of the node is called the activation of that node.
It is possible to build a single-layer neural network by stacking logistic regression nodes/units on top of each other in a layer, as shown in the following image. Every value at the input layers, X1 and X2, is passed to all three nodes at the hidden layer:
It is also possible to build multi-layer neural networks by stacking multiple layers of processing nodes after one another, as shown in the following image. The following image shows a two-layer neural network with two-dimensional input:
The preceding two images show the most common way of representing a neural network. Every neural network consists of an input layer, an output layer, and one or many hidden layers. If there is only one hidden layer, the network is called a shallow neural network. On the other hand, neural networks with many hidden layers are called deep neural networks, and the process of training them is called deep learning.
Figure 3.2 shows a neural network with only one hidden layer, so this would be a shallow neural network, whereas the neural network in Figure 3.3 has two hidden layers, so it is a deep neural network. The input layers are generally on the left. In the case of Figure 3.3, these are features X1 and X2, and they are input into the first hidden layer, which has three nodes. The arrows represent the weight values that are applied to the input. At the second hidden layer, the result of the first hidden layer becomes the input to the second hidden layer. The arrows between the first and second hidden layers represent the weights. The output is generally the layer on the far right and, in the case of Figure 3.3, is represented by the layer labeled Y.
Note
In some resources, you may see that a network, such as the one shown in the preceding image, is referred to as a four-layer network. This is because the input and output layers are counted along with the hidden layers. However, the more common convention is to count only the hidden layers, so the network we mentioned previously will be referred to as a two-layer network.
In a deep learning setting, the number of nodes in the input layer is equal to the number of features of the input data, and the number of nodes in the output layer is equal to the number of dimensions of the output data. However, you need to select the number of nodes in the hidden layers or the size of the hidden layers. If you choose a larger size layer, the model becomes more flexible and will be able to model more complex functions. This increase in flexibility comes at the cost of the need for more training data and more computations to train the model on. The parameters that are required to be selected by the developer are called hyperparameters and include parameters such as the number of layers and the number of nodes in each layer. Common hyperparameters to be chosen include the number of epochs to train for and the loss function to use.
In the next section, we will cover activation functions that are applied after each hidden layer.
Activation Functions
In addition to the size of the layer, you need to choose an activation function for each hidden layer that you add to the model, and also do the same for the output layer. We learned about the sigmoid activation function in the logistic regression model. However, there are more options for activation functions that you can choose from when building a neural network in Keras. For example, the sigmoid activation function is a good choice as the activation function on the output layer for binary classification tasks since the result of a sigmoid function is bounded between 0 and 1. Some commonly used activation functions for deep learning are sigmoid/logistic, tanh (hyperbolic tangent), and Rectified Linear Unit (ReLU).
The following image shows a sigmoid activation function:
The following image shows a tanh activation function:
The following image shows a ReLU activation function:
As you can see in Figures 3.4 and 3.5, the output of a sigmoid function is always between 0 and 1, and the output of tanh is always between -1 and 1. This makes tanh a better choice for hidden layers since it keeps the average of the outputs in each layer close to zero. In fact, sigmoid is only a good choice for the activation function of the output layer when building a binary classifier since its output can be interpreted as the probability of a given input belonging to one class.
Therefore, tanh and ReLU are the most common choices of activation function for hidden layers. It turns out that the learning process is faster when using the ReLU activation function because it has a fixed derivative (or slope) for an input greater than 0, and a slope of 0 everywhere else.
Note
You can read more about all the available choices for activation functions in Keras here: https://keras.io/activations/.
Forward Propagation for Making Predictions
Neural networks make a prediction about the output by performing forward propagation. Forward propagation entails the computations that are performed on the input in every layer of a neural network until the output layer is reached. It is best to understand forward propagation through an example.
Let's go through forward propagation equations one by one for a two-layer neural network (shown in the following image) where the input data is two-dimensional, and the output data is a one-dimensional binary class label. The activation functions for layer 1 and layer 2 will be tanh, and the activation function in the output layer is sigmoid.
The following image shows the weights and biases for each layer as matrices and vectors with proper indexes. For each layer, the number of rows in the weight's matrix is equal to the number of nodes in the previous layer, and the number of columns is equal to the number of nodes in that layer.
For example, W1 has two rows and three columns because the input to layer 1 is the input layer, X, which has two columns, and layer 1 has three nodes. Likewise, W2 has three rows and three columns because the input to layer 2 is layer 1, which has two nodes, and layer 2 has five nodes. The bias, however, is always a vector with a size equal to the number of nodes in that layer. The total number of parameters in a deep learning model is equal to the total number of elements in all the weights' matrices and the biases' vectors:
An example of performing all the steps for forward propagation according to the neural network outlined in the preceding image is as follows.
Steps to perform forward propagation:
- X is the network input to the network in the preceding image, so it is the input for the first hidden layer. First, the input matrix, X, is the matrix multiplied by the weight matrix for layer 1, W1, and the bias, b1, is added:
z1 = X*W1 + b1
- Next, the layer 1 output is computed by applying an activation function to z1, which is the output of the previous step:
a1 = tanh(z1)
- a1 is the output of layer 1 and is called the activation of layer 1. The output of layer 1 is, in fact, the input for layer 2. Next, the activation of layer 1 is the matrix multiplied by the weight matrix for layer 2, W2, and the bias, b2, is added:
z2 = a1 * W2 + b2
- The layer 2 output/activation is computed by applying an activation function to z2:
a2 = tanh(z2)
- The output of layer 2 is, in fact, the input for the next layer (the network output layer here). Following this, the activation of layer 2 is the matrix multiplied by the weight matrix for the output layer, W3, and the bias, b3, is added:
z3 = a2 * W3 + b3
- Finally, the network output, Y, is computed by applying the sigmoid activation function to z3:
Y = sigmoid(z3)
The total number of parameters in this model is equal to the sum of the number of elements in W1, W2, W3, b1, b2, and b3. Therefore, the number of parameters can be calculated by summing the parameters in each of the parameters in weight matrices and biases, which is equal to 6 + 15 + 5 + 3 + 5 + 1 = 35. These are the parameters that need to be learned in the process of deep learning.
Now that we have learned about the forward propagation step, we have to evaluate our model and compare it to the real target values. To achieve that, we will use a loss function, which we will cover in the next section. Here, we will learn about some common loss functions that we can use for classification and regression tasks.
Loss Function
When learning the optimal parameters (weights and biases) of a model, we need to define a function to measure error. This function is called the loss function and it provides us with a measure of how different network-predicted outputs are from the real outputs in the dataset.
The loss function can be defined in several different ways, depending on the problem and the goal. For example, in the case of a classification problem, one common way to define loss is to compute the proportion of misclassified inputs in the dataset and use that as the probability of the model making an error. On the other hand, in the case of a regression problem, the loss function is usually defined by computing the distance between the predicted outputs and their corresponding real outputs, and then averaging over all the examples in the dataset.
Brief descriptions of some commonly used loss functions that are available in Keras are as follows:
- mean_squared_error is a loss function for regression problems that calculates (real output – predicted output)^2 for each example in the dataset and then returns their average.
- mean_absolute_error is a loss function for regression problems that calculates abs (real output – predicted output) for each example in the dataset and then returns their average.
- mean_absolute_percentage_error is a loss function for regression problems that calculates abs [(real output – predicted output) / real output] for each example in the dataset and then returns their average, multiplied by 100%.
- binary_crossentropy is a loss function for two-class/binary classification problems. In general, the cross-entropy loss is used for calculating the loss for models where the output is a probability number between 0 and 1.
- categorical_crossentropy is a loss function for multi-class (more than two classes) classification problems.
Note
You can read more about all the available choices for loss functions in Keras here: https://keras.io/losses/.
During the training process, we keep changing the model parameters until the minimum difference between the model-predicted outputs and the real outputs is reached. This is called an optimization process, and we will learn more about how it works in later sections. For neural networks, we use backpropagation to compute the derivatives of the loss function with respect to the weights.
Backpropagation for Computing Derivatives of Loss Function
Backpropagation is the process of performing the chain rule of calculus from the output layer to the input layer of a neural network in order to compute the derivatives of the loss function with respect to the model parameters in each layer. The derivative of a function is simply the slope of that function. We are interested in the slope of the loss function because it provides us with the direction in which model parameters need to change in order for the loss value to be minimized.
The chain rule of calculus states that if, for example, z is a function of y, and y is a function of x, then the derivative of z with respect to x can be reached by multiplying the derivative of z with respect to y by the derivative of y with respect to x. This can be written as follows:
dz/dx = dz/dy * dy/dx
In deep neural networks, the loss function is a function of predicted outputs. We can show this through the equation given here:
loss = L(y_predicted)
On the other hand, according to forward propagation equations, the output predicted by the model is a function of the model parameters—that is, the weights and biases in each layer. Therefore, according to the chain rule of calculus, the derivative of the loss with respect to the model parameters can be computed by multiplying the derivative of the loss with respect to the predicted output by the derivative of the predicted output with respect to the model parameters.
In the next section, we will learn how the optimal weight parameters are modified when given the derivatives of the loss function with respect to the weights.
Gradient Descent for Learning Parameters
In this section, we will learn how a deep learning model learns its optimal parameters. Our goal is to update the weight parameters so that the loss function is minimized. This will be an iterative process in which we continue to update the weight parameters so that the loss function is at a minimum. This process is called learning parameters and it is done through the use of an optimization algorithm. One very common optimization algorithm that's used for learning parameters in machine learning is gradient descent. Let's see how gradient descent works.
If we plot the average of loss over all the examples in the dataset for all the possible values of the model parameters, it is usually a convex shape (such as the one shown in the following plot). In gradient descent, our goal is to find the minimum point (Pt) on the plot. The algorithm starts by initializing the model parameters with some random values (P1). Then, it computes the loss and the derivatives of the loss with respect to the parameters at that point. As we mentioned previously, the derivative of a function is, in fact, the slope of the function. After computing the slope at an initial point, we have the direction in which we need to update the parameters.
The hyperparameter, called the learning rate (alpha), determines how big a step the algorithm will take from the initial point. After selecting the proper alpha value, the algorithm updates the parameters from their initial values to the new values (shown as point P2 in the following plot). As shown in the following plot, P2 is closer to the target point, and if we keep moving in that direction, we will eventually get to the target point, Pt. The algorithm computes the slope of the function again at P2 and takes another step.
This process is repeated until the slope is equal to zero and therefore no direction for further movement is provided:
The pseudocode for the gradient descent algorithm is provided here:
Initialize all the weights (w) and biases (b) arbitrarily
Repeat Until converge {
Compute loss given w and b
Compute derivatives of loss with respect to w (dw), and with respect to b (db) using backpropagation
Update w to w – alpha * dw
Update b to b – alpha * db
}
To summarize, the following steps are repeated when training a deep neural network (after initializing the parameters to some random values):
- Use forward propagation and the current parameters to predict the outputs for the entire dataset.
- Use the predicted outputs to compute the loss over all the examples.
- Use backpropagation to compute the derivatives of the loss with respect to the weights and biases at each layer.
- Update the weights and biases using the derivative values and the learning rate.
What we discussed here was the standard gradient descent algorithm, which computes the loss and the derivatives using the entire dataset in order to update the parameters. There is another version of gradient descent called stochastic gradient descent (SGD), which computes the loss and the derivatives each time using a subset or a batch of data examples only; therefore, its learning process is faster than standard gradient descent.
Note
Another common choice is an optimization algorithm called Adam. Adam usually outperforms SGD when training deep learning models. As we've already learned, SGD uses a single hyperparameter (called a learning rate) to update the parameters. However, Adam improves this process by using a learning rate, a weighted average of gradients, and a weighted average of squared gradients to update the parameters at each iteration.
Usually, when building a neural network, you need to choose two hyperparameters (called the batch size and the number of epochs) for your optimization process. The batch_size argument determines the number of data examples to be included at each iteration of the optimization algorithm. batch_size=None is equivalent to the standard version of gradient descent, which uses the entire dataset in each iteration. The epochs argument determines how many times the optimization algorithm passes through the entire training dataset before it stops.
For example, imagine we have a dataset of size n=400, and we choose batch_size=5 and epochs=20. In this case, the optimizer will have 400/5 = 80 iterations in one pass through the entire dataset. Since it is supposed to go through the entire dataset 20 times, it will have 80 * 20 iterations in total.
Note
When building a model in Keras, you need to choose the type of optimizer to be used when training your model. There are some other options other than SGD and Adam available in Keras. You can read more about all the possible options for optimizers in Keras here: https://keras.io/optimizers/.
Note
All the activities and exercises in this chapter will be developed in a Jupyter notebook. Please download this book's GitHub repository, along with all the prepared templates, from https://packt.live/39pOUMT.
Exercise 3.01: Neural Network Implementation with Keras
In this exercise, you will learn the step-by-step process of implementing a neural network using Keras. Our simulated dataset represents various measurements of trees, such as height, the number of branches, the girth of the trunk at the base, and more, that are found in a forest. Our goal is to classify the records into either deciduous or coniferous type trees based on the measurements given. First, execute the following code block to load a simulated dataset of 10000 records that consist of two classes, representing the two tree species, where each data example has 10 feature values:
import numpy as np
import pandas as pd
X = pd.read_csv('../data/tree_class_feats.csv')
y = pd.read_csv('../data/tree_class_target.csv')
# Print the sizes of the dataset
print("Number of Examples in the Dataset = ", X.shape[0])
print("Number of Features for each example = ", X.shape[1])
print("Possible Output Classes = ", np.unique(y))
Expected output:
Number of Examples in the Dataset = 10000
Number of Features for each example = 10
Possible Output Classes = [0 1]
Since each data example in this dataset can only belong to one of the two classes, this is a binary classification problem. Binary classification problems are very important and very common in real-life scenarios. For example, let's assume that the examples in this dataset represent the measurement results for 10000 trees from a forest. The goal is to build a model using this dataset to predict whether the species of each tree that's measured is a deciduous or coniferous species of tree. The 10 features for the trees can include predictors such as height, number of branches, and girth of the trunk at the base.
The output class 0 means that the tree is a coniferous species of tree, while the output class 1 means that the tree is a deciduous species of tree.
Now, let's go through the steps for building and training a Keras model to perform the classification:
- Set a seed in numpy and tensorflow and define your model as a Keras sequential model. Sequential models are, in fact, stacks of layers. After defining the model, we can add as many layers to it as desired:
from keras.models import Sequential
from tensorflow import random
np.random.seed(42)
random.set_seed(42)
model = Sequential()
- Add one hidden layer of size 10 with an activation function of type tanh to your model (remember that the input dimension is equal to 10). There are different types of layers available in Keras. For now, we will use only the simplest type of layer, called the Dense layer. A Dense layer is equivalent to the fully connected layers that we have seen in all the examples so far:
from keras.layers import Dense, Activation
model.add(Dense(10, activation='tanh', input_dim=10))
- Add another hidden layer, this time of size 5 and with an activation function of type tanh, to your model. Please note that the input dimension argument is only provided for the first layer since the input dimension for the next layers is known:
model.add(Dense(5, activation='tanh'))
- Add the output layer with the sigmoid activation function. Please note that the number of units in the output layer is equal to the output dimension:
model.add(Dense(1, activation='sigmoid'))
- Ensure that the loss function is binary cross-entropy and that the optimizer is SGD for training the model using the compile() method and print out a summary of the model to see its architecture:
model.compile(optimizer='sgd', loss='binary_crossentropy', \
metrics=['accuracy'])
model.summary()
The following image shows the output of the preceding code:
- Train your model for 100 epochs and set a batch_size equal to 5 and a validation_split equal to 0.2, and then set shuffle equal to false using the fit() method. Remember that you need to pass the input data, X, and its corresponding outputs, y, to the fit() method to train the model. Also, keep in mind that training a network may take a long time, depending on the size of the dataset, the size of the network, the number of epochs, and the number of CPUs or GPUs available. Save the results to a variable named history:
history = model.fit(X, y, epochs=100, batch_size=5, \
verbose=1, validation_split=0.2, \
shuffle=False)
The verbose argument can take any of these three values: 0, 1, or 2. By choosing verbose=0, no information will be printed during training. verbose=1 will print a full progress bar at every iteration, while verbose=2 will print only the epoch number:
- Print the accuracy and loss of the model on the training and validation data as a function of the epoch:
import matplotlib.pyplot as plt
%matplotlib inline
# Plot training & validation accuracy values
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()
# Plot training & validation loss values
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()
The following image shows the output of the preceding code:
- Use your trained model to predict the output class for the first 10 input data examples (X.iloc[0:10,:]):
y_predicted = model.predict(X.iloc[0:10,:])
You can print the predicted classes using the following code block:
# print the predicted classes
print("Predicted probability for each of the "\
"examples belonging to class 1: "),
print(y_predicted)
print("Predicted class label for each of the examples: "),
print(np.round(y_predicted))
Expected output:
Predicted probability for each of the examples belonging to class 1:
[[0.00354007]
[0.8302744 ]
[0.00316998]
[0.95335543]
[0.99479216]
[0.00334176]
[0.43222323]
[0.00391936]
[0.00332899]
[0.99759173]
Predicted class label for each of the examples:
[[0.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[1.]]
Here, we used the trained model to predict the output for the first 10 tree species in the dataset. As you can see, the model predicted that the second, fourth, fifth, and tenth trees were predicted as the species of class 1, which is deciduous.
Note
To access the source code for this specific section, please refer to https://packt.live/2YX3fxX.
You can also run this example online at https://packt.live/38pztVR.
Please note that you can extend these steps by adding more hidden layers to your network. In fact, you can add as many layers as you want to your model before adding the output layer. However, the input dimension argument is only provided for the first layer since the input dimension for the next layers is known. Now that you have learned how to implement a neural network in Keras, you are ready to practice with them further by implementing a neural network that can perform classification in the following activity.
Activity 3.01: Building a Single-Layer Neural Network for Performing Binary Classification
In this activity, we will use a Keras sequential model to build a binary classifier. The simulated dataset provided represents the testing results of the production of aircraft propellers. Our target variable will be the results of the manual inspection of the propellers, designated as either "pass" (represented as a value of 1) or "fail" (represented as a value of 0).
Our goal is to classify the testing results into either "pass" or "fail" classes to match the manual inspections. We will use models with different architectures and observe the visualization of the different models' performance. This will help you gain a better sense of how going from one processing unit to a layer of processing units changes the flexibility and performance of the model.
Assume that this dataset contains two features representing the test results of two different tests inspecting the aircraft propellers of over 3000 propellers (the two features are normalized to have a mean of zero). The output is the likelihood of the propeller passing the test, with 1 representing a pass and zero representing a fail. The company would like to rely less on time-consuming, error-prone manual inspections of the aircraft propellers and shift resources to developing automated tests to assess the propellers faster. Therefore, the goal is to build a model that can predict whether an aircraft propeller will pass the manual inspection when given the results from the two tests. In this activity, you will first build a logistic regression model, then a single-layer neural network with three units, and finally a single-layer neural network with six units, to perform the classification. Follow these steps to complete this activity:
- Import the required packages:
# import required packages from Keras
from keras.models import Sequential
from keras.layers import Dense, Activation
import numpy as np
import pandas as pd
from tensorflow import random
from sklearn.model_selection import train_test_split
# import required packages for plotting
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
import matplotlib.patches as mpatches
# import the function for plotting decision boundary
from utils import plot_decision_boundary
Note
You will need to download the utils.py file from the GitHub repository and save it into your activity folder in order for the utils import statement to work correctly. You can find the file here: https://packt.live/31EumPY.
- Set up a seed for a random number generator so that the results will be reproducible:
"""
define a seed for random number generator so the result will be reproducible
"""
seed = 1
Note
The triple-quotes ( """ ) shown in the code snippet above are used to denote the start and end points of a multi-line code comment. Comments are added into code to help explain specific bits of logic.
- Load the dataset using the read_csv function from the pandas library. Print the X and Y sizes and the number of examples in the training dataset using feats.shape, target.shape, and feats.shape[0]:
feats = pd.read_csv('outlier_feats.csv')
target = pd.read_csv('outlier_target.csv')
print("X size = ", feats.shape)
print("Y size = ", target.shape)
print("Number of examples = ", feats.shape[0])
- Plot the dataset using the following code:
plt.scatter(feats[:,0], feats[:,1], \
s=40, c=Y, cmap=plt.cm.Spectral)
- Implement a logistic regression model as a sequential model in Keras. Remember that the activation function for binary classification needs to be sigmoid.
- Train the model with optimizer='sgd', loss='binary_crossentropy', batch_size = 5, epochs = 100, and shuffle=False. Observe the loss values in each iteration by using verbose=1 and validation_split=0.2.
- Plot the decision boundary of the trained model using the following code:
plot_decision_boundary(lambda x: model.predict(x), \
X_train, y_train)
- Implement a single-layer neural network with three nodes in the hidden layer and the ReLU activation function for 200 epochs. It is important to remember that the activation function for the output layer still needs to be sigmoid since it is a binary classification problem. Choosing ReLU or having no activation function for the output layer will not produce outputs that can be interpreted as class labels. Train the model with verbose=1 and observe the loss in every iteration. After the model has been trained, plot the decision boundary and evaluate the loss and accuracy on the test dataset.
- Repeat step 8 for the hidden layer of size 6 and 400 epochs and compare the final loss value and the decision boundary plot.
- Repeat steps 8 and 9 using the tanh activation function for the hidden layer and compare the results with the models with relu activation. Which activation function do you think is a better choice for this problem?
Note
The solution for this activity can be found on page 362.
In this activity, you observed how stacking multiple processing units in a layer can create a much more powerful model than a single processing unit. This is the basic reason why neural networks are such powerful models. You also observed that increasing the number of units in the layer increases the flexibility of the model, meaning a non-linear separating decision boundary can be estimated more precisely.
However, a model with more processing units takes longer to learn the patterns, requires more epochs to be trained, and can overfit the training data. As such, neural networks are computationally expensive models. You also observed that using the tanh activation function results in a slower training process in comparison to using the ReLU activation function.
In this section, we created various models and trained them on our data. We observed that some models performed better than others by evaluating them on the data that they were trained on. In the next section, we learn about some alternative methods we can use to evaluate our models that provide an unbiased evaluation.