Neural Networks
Examining the building blocks of modern-day deep learning
Non-Technical
Technical

Neural Networks are the foundation of all things Deep Learning so it’s pretty natural to be wondering how exactly they work. While they may seem like a black box to most, we find that the box isn’t actually that deep upon opening. With some simple, intuitive diagrams, let’s examine how these fascinating mathematical models learn to play video games, classify animals, and drive cars.

The piece that all neural networks are built on is the neuron. It takes in some inputs and yields an output, very similar to how the neurons in our brain work.

Network with Activation Function

Think about each circle as a neuron and each line as a connection between the input and the output. In a computer, each circle and line are simply represented by a number. The gray circle is considered a bias term, whose value is always 1. However, this doesn’t mean the line associated with it is too.

The way the output is calculated is by taking a weighted sum of all of the inputs and then passing the result into the activation function; The lines are considered the weights. For now, let's assume the activation function is the sign function; It outputs 1 if the input is positive and -1 if it's not.

Neuron

It is through simple addition and multiplication that allows us to calculate the output for each neuron.

Below is a diagram of a fully connected neural network. It's simply a bunch of neuron arranged together in layers, with the dotted circles representing the bias terms.

Network

We have four neurons in the input layer (red), two neurons in the hidden layer (blue), and four neurons in the output layer (yellow). The number of neurons per layer is dependent on the task at hand, and may require tuning by you, the computer scientist. Each layer, besides the output layer, also has its own bias unit.

The input numbers will be given by whatever dataset we're using, but the values of each neuron in subsequent layers is the weighted sum of all the neurons in the previous layer, as mentioned above. The process of training a neural network is mathematically calculates out what values these weights should be.

At a fundamental level, a neural network just performs a sequence of multiplications and additions.

After calculating the weighted sum for each neuron in a layer, we pass the sum through the activation function; These aren't represented in the network above. The activation function’s task is to constrain the neuron's output within a specific range to prevent certain neurons from significantly overpowering other neurons. They also introduce non-linearity to the network, which is important for complex tasks. Without activation functions, a neural network would just condense to an everyday linear regression. All neurons in a layer have the same activation function, but can vary between layers.

We explore activation functions more in-depth in the technical section of this post.

The value of the first blue neuron in the hidden layer is the sum of each of the four neurons in the input layer times the corresponding weight connecting each red neuron to the blue neuron, plus the weight represented by the dotted line between the red bias term and the blue neuron passed into the activation function.

Network

The same process goes for the other neuron in the hidden layer and the rest of ones in the output layer. We do not calculate anything for the bias neurons; They're always one.

For our network to be useful, we need some data. Let’s say we’re tackling the hand-written digit classification task (the "Hello World" of neural nets) using the MNIST dataset. Inserted below are 64 sample inputs from the dataset, each an image of a hand-written digit.

MNIST Digits

The way each digit is represented in the computer is by a 28x28 grid of numbers. Black pixels are given a value of 0, while white ones are given a value of 1. Any number in between is a relative shade of gray.

We take each row of the image and stack it into a giant list of 784 (28 times 28) numbers; This will be the input to our neural network. For this case the input layer would need to have 784 neurons, one for each input. The output layer would need to have 10 neurons, one for each digit. We can choose to have however many hidden layers we'd like, and however many neurons in each.

We then multiply through the network until we have the 10 values in our output layer, known as the forward pass. The neuron with the highest value is our “prediction.” The first neuron represents the number 0. So, if the 8th neuron is the highest value, then the network’s prediction is the digit 7.

While the general mechanics of a neural network hopefully seem much less complicated, I would expect that your next question is: How do we train these things? How does the network know what the values of the weights need to be in order to make accurate predictions?

Saving the calculus for the technical section, let’s examine the backpropagation algorithm. As mentioned earlier, each data point has an associated label, 0 through 9. We convert each numeric value into a vector of nine zeroes, and a one in the appropriate position indicated by the original label. These are called "one-hot encoded" vectors. Again, the first index represents 0.

0=[1000000000]7=[0000000100]0=\begin{bmatrix} 1\\ 0\\ 0\\ 0\\ 0\\ 0\\ 0\\ 0\\ 0\\ 0 \end{bmatrix} \hspace{40pt} 7=\begin{bmatrix} 0\\ 0\\ 0\\ 0\\ 0\\ 0\\ 0\\ 1\\ 0\\ 0 \end{bmatrix}

At the starting, all of the weights in the network are initialized randomly, meaning the predictions won’t be accurate at all. After we forward propagate through the network, we have 10 numbers in our output layer, all of them pretty close to being useless.

We then compare the output layer to the vector created from the data label and subtract them to get an error vector. It essentially tells the network how incorrect each neuron in the output layer was. Using this error, we then calculate (by using gradient descent) how much to change each weight by for the output to be a little bit more accurate. We then move to the previous layer, again calculating how inaccurate the neurons were and then tweaking the network parameters.

After we repeat this for each data point, we’ll have completed one epoch of training. Most tasks require multiple epochs in order to yield a respectable accuracy.

We can track if our neural network is actually learning by using a “cost function.” It is a function that takes in our predicted values and the actual values for our entire dataset and returns a value indicating how “incorrect” our predictions are relative to the ground truth. The lower the value, the better.

It's through this process of passing a data point forward, comparing it with the ground truth, calculating the error, backpropagating the error through the network in reverse, and updating the weights that the weights will eventually start yielding accurate predictions. It's the fundamental backbone to all of the systems today that use a neural network(s) under the hood.

In the technical section, we’ll see how the back propagation algorithm minimizes this function to obtain the weights that maximize our accuracy. There we'll dive into the linear algebra and calculus for neural networks.