Getting output from a Neural Networks

To get output from a neural network, we need some data to work with. A great resource for this is Kaggle, which offers the MNIST dataset for machine learning problems. This data is represented as a matrix of inputs X and labels or targets Y.

Each sample (pair of x and y) is represented as a vector of real numbers for x and a categorical variable (often just 0, 1, 2, etc.) for y. The matrix X has N rows, representing the number of samples, and D columns, representing the dimensionality of each input. For example, in the MNIST dataset, D = 784 = 28 x 28, because the original images are “flattened” into 1 x 784 vectors.

If y is not a binary variable, we can turn it into a matrix of indicator variables by transforming Y into an indicator matrix (a matrix of 0s and 1s) where Y_indicator is an N x K matrix, where N = number of samples and K = number of classes in the output. For the MNIST dataset, K = 10. Here is an example of how we can do this in Numpy:

def y2indicator(y):
N = len(y)
ind = np.zeros((N, 10))
for i in xrange(N):
ind[i, y[i]] = 1
return ind

An artificial neural network is a structure composed of layers, where each layer feeds into the next layer. There are no feedback connections. To calculate the output of a logistic unit, we use a 1-hidden layer neural network.

The input is denoted by x, the hidden layer is denoted by z, and the output layer is denoted by y. To compute z1 and z2, we use the following formulas:

z1 = s(w11x1 + w12x2)
z2 = s(w21x1 + w22x2)

s() can be any nonlinear function. The most common functions used are sigmoid, hyperbolic tangent, and rectifier linear unit.

The sigmoid function is defined as:

def sigmoid(x):
return 1 / (1 + np.exp(-x))

The hyperbolic tangent is defined as:

np.tanh()

The rectifier linear unit is defined as:

def relu(x):
if x < 0:
return 0
else:
return x

Alternatively, the rectifier linear unit can be written as:

def relu(x):
return x * (x > 0)

Once z1 and z2 are computed, y can be calculated as:

y = s’(v1z1 + v2z2)

Where s’() can be a sigmoid or softmax.

The forward action of a neural network using ReLU and softmax can be written as:

def forward(X, W, V):
Z = relu(X.dot(W))
Y = softmax(Z.dot(V))
return Y

Binary classification

In binary classification, the output of the model is a probability that the predicted label is equal to 1 given the input features. This probability is calculated using a logistic regression layer, which outputs a value between 0 and 1. The probability that the label is equal to 0 is then calculated as 1 minus the probability that it is equal to 1.

Softmax

When there are more than two labels (e.g. the digits 0-9 in the MNIST dataset), the softmax function is used. This is defined as exp(a[k]) / {exp(a[1]) + exp(a[2]) + … + exp(a[k]) + … + exp(a[K])}, where the “little k” and the “big K” are different. This equation ensures that the probabilities of all labels add up to 1, and thus can be considered a probability.

NOW IN CODE

Assuming that you have already loaded your data into Numpy arrays, you can calculate the output Y. The formulas shown above only calculate the output for one input sample, so when we are doing this in code, we typically want to do this calculation for many samples simultaneously.

To do this, we will define two functions: sigmoid and softmax. The sigmoid function takes in a value a and returns the output 1/ (1+e^(-a)). The softmax function takes in a value a and returns the output e^a/∑e^a.

Next, we will load in the data and randomly initialize W and V (the weights for the inputs and the weights for the outputs, respectively). We will then calculate the output using the sigmoid and softmax functions.

Now the outputs aren’t very useful since they are randomly initialized. What we would like to do is determine the best W and V so that when we take the predictions of P(Y | X), they are very close to the actual labels Y.

To obtain a prediction, we just need to round the output probability (for sigmoid), or take the argmax (for softmax).

To add the bias term, we just need to add an extra weight to each layer. This extra weight will act as a bias term, allowing us to shift the activation of that layer.

We can then rewrite the sigmoid and softmax functions to incorporate this bias term. The new functions will take in an extra parameter, b, which is the bias term.

def sigmoid(a, b):
return 1 / (1 + np.exp(-a-b))
def softmax(a, b):
expA = np.exp(a+b)
return expA / expA.sum(axis=1, keepdims=True)

We can then modify the prediction function to take in the bias term.

sigmoid

def predict(X, W, V, b):
return np.round(forward(X, W, V, b))

softmax

def predict(X, W, V, b):
return np.argmax(forward(X, W, V, b), axis=1)

Binary classification

Softmax

NOW IN CODE

sigmoid

softmax

Leave a Reply Cancel reply