5 - Published at:
a few seconds ago
[ Understanding Backpropagation in ML ]Basic Terminologies related to Backpropagation
Function Composition :- It is an operation that takes two functions f and g and produces a function h such that h(x) = g(f(x)). In this operation, the function g is applied to the result of applying the function f to x.
Multi Layered Perceptron (M.L.P) :- This is basically a graphical way of representing complex functions. By having this multi layered structure as shown in above diagram, we can come up with complex mathematical functions to solve our real world problems and also give enormous power to our ML models.
Chain Rule :- This rule of differentiation helps us in differentiating function composition and is defined as , f(g(x)) is f’(g(x))⋅g’(x).
Memoization :- It is an optimization technique where we store the results of expensive function calls and use the same results when the same inputs are encountered again.
Lets understand memoization using Fibonacci series,
Fibonacci series :- Sum of previous two terms, given first term=0 and second term=1. E.g., 0,1,1,2,3,5,8 ……
Therefore ,in short fib(n)=fib(n-1)+fib(n-2).
Fully Connected Neural Network or Fully Connected M.L.P :- Neural network which has every possible combinations of connections between neurons are called as F.C.N.N. Note,connections are always made from One layer to another layer and not within the layers.
Fully Connected Neural Network (F.C.N.N)
Notations used in understanding M.L.P
These functions f are activation functions such as Relu or Sigmoid on each neuron.
Outputs in M.L.P
Weights in M.L.P
When connections are made from one layer to another layer, some weight are attached which signifies that some input features are more important than another and thus larger weight is attached to that connections.
Let me explain notation of these weights which is very important to understand backpropagation, with an example.
Weight Notation Example
Weight Matrix Representation at Layer 1
Therefore, This W¹ (Weight at Layer 1) is matrix of size 4*3. and similarly we can compute for all the hidden layers.
Optimization Problem :-Our Goal in ML is always to minimize the difference between our predicted output (y^) and actual output(y). This difference is also called as Loss/Cost Function and is often denoted by “L”.
Loss= Predicted Output(y^) — Actual Output(y).
We can use mean squared error(M.S.E) or simply simply squared error.I will be using squared error in L i.e.,
L= (Actual Output (yi) - Predicted Output (yi^))² .
Thus, mathematical way of above saying is called Optimization problem i.e., we need to find Weights(training parameters) that minimizes our L .
We know that from linear regression that
Gradient Descent :- To solve above optimization problem we require an optimization algorithm called Gradient Descent. This is used for minimizing the cost function (L). It updates various parameters of a ML model to minimize the cost function using Update rule.
This “r” or “η” (eta)is called step-size or learning rate and is always positive.
Convergence in Update rule :-
When you cannot further updates weights, its time to stop. Sometime , you may not always converge so better take delta i.e., threshold value after which we stop updating.
Steps to solve Optimization Problem
1) Define Loss function L which is
Lets ignore Regularizor term for simplicity.
Putting value of Predicted Output in Optimization Problem
Final Optimization Problem to be solved.
Here W is W¹ and W² and W³ (Matrix representation of weights) as shown in previous diagram .
2) Initialization of Training Parameters i.e., Weights
are done using random initialization.
3) Gradient Descent now keep changing these weights using update rule.
4) Perform Update till convergence
Therefore most important part is computing this derivative.
How to compute this derivative ?
Now let’s take another example to better understand .
Example for Chain Rule.
Before going to More examples,lets understand some more advanced concept in chain rule .
Using above 2 concept lets calculate for ∂k/ ∂W¹ 11,
We are storing 7 derivatives so slightly more-memory required for memoization , but there is huge speed-up. Look ∂L/∂O31 is used in all 20 weights but we computed 1 time and saved results and using this saved result for rest 19 weights.
Therefore , We used Chain Rule and Applied Memoization as a trick and the resultant algorithm we got is called Backpropagation.
Training an M.L.P
“Training a neural network means finding the best weights on edges or connections using our training data.” Now next question that comes in our mind is what are these best weights. Best weights means weights that minimizes our L (loss). When first time our training data is passed from our model , it is often called as “first epoch”. Epoch basically means passing our entire training data in our model. After each epoch, we calculate our loss, and see how close we have reached to our actual output. This is called as forward propagation.
After we have calculated our L, its time to adjust our training parameters that is our weights so as to minimize L more next time. This is called as “Backward Propagation”.
Therefore, Backpropagation calculates error at Output and then distributes that Output back throughout the network layer . This process of forward and backward is done multiple times (or multiple epochs) so as to get minimum L.
In real world, model is trained using multiple epochs i.e., we pass our training data (D) to neural network many times.