What is Backpropagation?

December 28, 2023

We define a neural network as Input layer with 2 inputs, Hidden layer with 4 neurons, Output layer with 1 output neuron and use Sigmoid function as activation function. Each hidden layer computes the weighted sum (`a`) of the inputs then applies an activation function like ReLU (Rectified Linear Unit) to obtain the output (`o`). The output is passed to the next layer where an activation function backpropagation tutorial such as softmax converts the weighted outputs into probabilities for classification.

Backward Pass

But each mini-batch gives a pretty good approximation, and if there are 100 minibatches, each step takes 1/100th of the total time. And after 100 steps, each piece of training data will have had its chance to influence the final result. Or, rather, in principle it should, but for computational efficiency, we’ll do a little trick later to keep you from needing to hit every single example for every single step.

If a picture is worth a thousand words than surely over a dozen GIFs is worth a good deal more (or maybe you just never want to see another GIF again).
Figure 2 indicates the notation for nodes and weights in the example network.
Because the network is not yet well trained, the activations in that output layer are effectively random.
An RNN is a neural network that incorporates feedback loops, which are internal connections from one neuron to itself or among multiple neurons in a cycle.

Defining Neural Network

To illustrate how backpropagation works, we start with the most simple neural network, which only consists of one single neuron. Get an in-depth understanding of neural networks, their basic functions and the fundamentals of building one. Though some machine learning literature assigns unique nuance to each term, they’re generally interchangeable.1 An objective function is a broader term for any such evaluation function that we want to either minimize or maximize.

4 The Full Algorithm: Forward, Then Backward

This process demonstrates how Back Propagation iteratively updates weights by minimizing errors until the network accurately predicts the output. So the desire of this digit-2 neuron is added together with the desires of all the other nine neurons. Each has its own suggested nudge for that second-to-last layer, again in proportion to the corresponding weights and in proportion to how much each neuron needs to change. The third way we can help increase this neuron’s activation is by changing all the activations in the previous layer.

Thinking about the gradient vector as a direction in a 13,002-dimensional space is, to put it lightly, beyond the scope of our imaginations. If a picture is worth a thousand words than surely over a dozen GIFs is worth a good deal more (or maybe you just never want to see another GIF again). This article was actually a detour from my original project which was building a single shot object detector that at several points broke leading me down this rabbit hole. Backpropagation identifies which pathways are more influential in the final answer and allows us to strengthen or weaken connections to arrive at a desired prediction. It is such a fundamental component of deep learning that it will invariably be implemented for you in the package of your choosing.

Activation functions

The process is repeated iteratively in a series of training epochs until the error rate stabilizes. Now that we have the gradients of the loss function with respect to each weight and bias parameter in the network, we can minimize the loss function—and thus optimize the model—by using gradient descent to update the model parameters. Short for “backward propagation of error”, backpropagation is an elegant method to calculate how changes to any of the weights or biases of a neural network will affect the accuracy of model predictions. It’s essential to the use of supervised learning, semi-supervised learning or self-supervised learning to train neural networks.

Each neuron is configured to perform a mathematical operation, called an “activation function”, on the sum of varyingly weighted inputs it receives from nodes in the previous layer. Activation functions introduce “nonlinearity”, enabling the model to capture complex patterns in input data and yield gradients that can be optimized. Using only linear activation functions essentially collapses the neural network into a linear regression model. Backpropagation, short for Backward Propagation of Errors, is a key algorithm used to train neural networks by minimizing the difference between predicted and actual outputs.

Example of Back Propagation in Machine Learning

The main difference here compared to a typical neural network layout is that I’ve explicitly broken hidden nodes into two separate functions, the weighted sum (z nodes) and the activations (a nodes). These are typically grouped under one node but it’s clearer and required here to show each function separately. Throughout I assume we are dealing with one training example, in reality you would have to average over all training examples in your training set. In this section, we’ll explore how neural networks adjust their weights and biases to minimize error (or loss), ultimately improving their ability to make accurate predictions. In this text, we focus on the intuitive ideas of this optimization problem rather than going into a deep dive of the methods and theory that are required to perform the optimization steps of backpropagation.

In other words, we’ll need to find the partial derivatives of Lc’s activation function. Starting from the final layer, a “backward pass” differentiates the loss function to compute how each individual parameter of the network contributes to the overall error for a single input. The ultimate goal of backpropagation and gradient descent is to calculate the weights and biases that will yield the best model predictions. Neurons corresponding to data features that significantly correlate with accurate predictions are given greater weights; other connections may be given weights approaching zero. They’re composed of many interconnected nodes (or neurons), arranged in layers.

Minimizing the loss function would entail making adjustments throughout the network that bring the output of Lc’s activation function closer to 1. These inputs combined with their respective weights are passed to hidden layers. For example in a network with two hidden layers (h1 and h2) the output from h1 serves as the input to h2. Before applying an activation function, a bias is added to the weighted inputs. The tool used here to convey this visual information is manim, a math animation library created by Grant Sanderson from the 3Blue1Brown YouTube channel.

So, the backward signal sent by the \(L_2\) loss layer is a row vector of per-dimension errors between the prediction and the target.
It works by propagating errors backward through the network, using the chain rule of calculus to compute gradients and then iteratively updating the weights and biases.
In the simplest such model, one feedback loop connects a single neuron with itself.
Backprop is an efficient way to find partial derivatives in computation graphs.
With this simple example, we illustrated one forward and one backward pass.

Both issues are important to be aware of and addressed when working with RNNs; otherwise, the accuracy of your model may be compromised. On a technical, mathematical level, the goal of backpropagation is to calculate the gradient of the loss function with respect to each of the individual parameters of the neural network. In simpler terms, backpropagation uses the chain rule to calculate the rate at which loss changes in response to any change to a specific weight (or bias) in the network. The intermediate layers between the input layer and output layer called the network’s hidden layers, are where most “learning” occurs.

In Backward pass or Back Propagation the errors between the predicted and actual outputs are computed. The gradients are calculated using the derivative of the sigmoid function and weights and biases are updated accordingly. Then you compute a gradient descent step according to each minibatch rather than the entire set of training examples. It won’t give you the actual gradient of the cost function, which depends on all the training data, so it’s not the most efficient step downhill.

Monika Gupta