Training a neural network with backpropagation

In order to train a neural network using backpropagation, we must use an optimization technique called gradient descent. The objective function we will use is the negative log-likelihood, which can be calculated in Numpy as follows:

def cost(T, Y):
return -(T*np.log(Y)).sum()

Gradient descent requires us to travel along the gradient of the objective function with respect to the weights W and V, until we reach a minimum. This is the same optimization approach used in logistic regression. Backpropagation is simply the term used to describe gradient descent in a neural network. To ensure we find the minimum, we must take small steps in the direction of the gradient. If steps are too large, we may end up on the other side of the canyon.

Backpropagation is the process of updating weights in a neural network in order to minimize the error function. This is done by adjusting the weights of the network in the opposite direction of the gradient of the error function with respect to the weights. This can be written mathematically as: weight = weight – learning_rate * gradient_of_J_wrt_weight or in more formal terms, w = w – learning_rate * dJ/dw, where the learning rate is typically a very small number (although it can be optimized). To understand the process better, one can try to optimize a function with a known solution, such as a quadratic equation.

The term backpropagation comes from considering a single-node representation of a neural network, such as: o–W–o–V–o x z y where x, y, and z are the input, output, and hidden layers respectively. This helps to visualize how the gradients are calculated as the network goes deeper.

The error of a weight will always depend on the errors at the nodes to its immediate right (which themselves depend on the errors to their right, and so on). This graphical/recursive structure is what allows computing libraries like Theano and TensorFlow to automatically calculate gradients for you.

To exercise the use of gradient descent, we can optimize the following functions:

Maximize J = log(x) + log(1-x), 0 < x < 1
Maximize J = sin(x), 0 < x < pi
Minimize J = 1 – x^2 – y^2, 0 <= x <= 1, 0 <= y <= 1, x + y = 1

Before we look into Theano and TensorFlow, we can set up a neural network with just pure Numpy and Python. Similar to the previous examples, we create simple data that is visualizable and run our algorithm on it. The following code helps us do that:

import numpy as np
import matplotlib.pyplot as plt
def forward(X, W1, b1, W2, b2):

Functions to compute the classification rate, derivatives of W2, W1, b2, and b1

def classification_rate(Y, P):
n_correct = 0
n_total = 0
for i in range(len(Y)):
n_total += 1
if Y[i] == P[i]:
n_correct += 1
return float(n_correct) / n_total

def derivative_w2(Z, T, Y):
return Z.T.dot(T – Y)

def derivative_w1(X, Z, T, Y, W2):
dZ = (T – Y).dot(W2.T) * Z * (1 – Z)
return X.T.dot(dZ)

def derivative_b2(T, Y):
return (T – Y).sum(axis=0)

def derivative_b1(T, Y, W2, Z):
return ((T – Y).dot(W2.T) * Z * (1 – Z)).sum(axis=0)

def cost(T, Y):
tot = T * np.log(Y)
return tot.sum()
def main():
# create the data

Nclass = 500
D = 2
M = 3
K = 3
X = np.vstack([np.random.randn(Nclass, D) + np.array([0, -2]), np.random.randn(Nclass, D) + np.array([2, 2]), np.random.randn(Nclass, D) + np.array([-2, 2])])
Y = np.array([0]Nclass + [1]Nclass + [2]*Nclass)
N = len(Y)
T = np.zeros((N, K))
for i in range(N):
T[i, Y[i]] = 1
plt.scatter(X[:,0], X[:,1], c=Y, s=100, alpha=0.5)
plt.show()
W1 = np.random.randn(D, M)
b1 = np.random.randn(M)
W2 = np.random.randn(M, K)
b2 = np.random.randn(K)


def main(): 
    X, Y, W1, b1, W2, b2 = init_weights() 
    learning_rate = 10e-7 
    costs = [] 

    # Iterate through the neural network 100,000 times
    for epoch in xrange(100000): 
        output, hidden = forward(X, W1, b1, W2, b2) 
        c = cost(T, output) 
        P = np.argmax(output, axis=1) 
        r = classification_rate(Y, P) 

        # Print cost and classification rate every 100 iterations
        if epoch % 100 == 0: 
            print "cost:", c, "classification_rate:", r 
            costs.append(c) 
        
        # Use gradient ascent to update weights and biases
        W2 += learning_rate * derivative_w2(hidden, T, output) 
        b2 += learning_rate * derivative_b2(T, output) 
        W1 += learning_rate * derivative_w1(X, hidden, T, output, W2) 
        b1 += learning_rate * derivative_b1(T, output, W2, hidden) 
    
    # Plot the cost to track progress
    plt.plot(costs) 
    plt.show() 

if __name__ == '__main__': 
    main()

The following code is used to run a neural network with two layers. We begin by defining the forward function, which will calculate the output of the neural network given the weights, biases, and input. Next, we define the cost function to calculate the cost of the output given the target. The classification rate is then calculated to measure how well the model is performing. After that, we define the derivatives of the weights and bias terms to be used in gradient ascent. Finally, we iterate through the network 100,000 times and plot the cost of each iteration to track our progress.

The code above is a great example of using backpropagation to find the optimal weights and biases in a neural network. We have renamed the target variables T and the output of the neural network Y. The backpropagation process involves propagating the “error” backwards through the network, and in our code this is shown by the “Y – T” appearing in both gradients.

We are now looping through a number of “epochs” which means we are repeatedly showing the neural network the same samples again and again. To put this into practice, we could use the code above and the MNIST dataset, or any other dataset we might choose. We would need to add the bias terms, or add a column of 1s to the matrix X and Z to effectively have bias terms.

In addition to printing the cost, it would also be useful to print the classification rate or error rate. We can then observe if a lower cost always guarantees a lower error rate.

Functions to compute the classification rate, derivatives of W2, W1, b2, and b1

Leave a Reply Cancel reply