Monday, March 25, 2019


CNN contd..

> Now, the graph that I had in the last post, did not explain everything in detail.

> It was a graph of error function J(W) (y-axis) vs 1 parameter (or 1 weight ) that is supposed to be multiplied with 1 feature.
> But in reality there are large number of weights (in matrix forms) that is multiplied with features.
> Implying the fact that, the error function or the cost function is affected by each and every weight parameter that is used across all layers of our network.
> So, now how will I get to know how my error function changes once I change my weights. Cannot really rely on a 2D graph..

> So, I need to be able to constantly change all of my weights such that the error decreases in every iteration.

> But by how much?

This is when I came across gradient descent, which says use a 'gradient' to change the weights that u initially used to predict the final class.

Something like :

for each iteration:
    for each layer i :
      gradient_w11 = determine_gradient_for_weight_on_neuron1()
      gradient_w12 = determine_gradient_for_weight_on_neuron2()
       ... and so on
   
      new_weights_on_neuron1  = original_weights_on_neuron1_changed_by(gradient_w11)
      new_weights_on_neuron1  = original_weights_on_neuron2_changed_by(gradient_w12)
      .. and so on

      calculate_error_with_new_weights() //check if this is decreasing!
     (in case I forget : error is the potato being predicted as a book!)
 
> Ofcourse, I won't be calculating the gradients for each weights on each neuron individually - as in its all using matrices! (Numpy)

>Now, how is the gradient computed:

>This is where the math that I was taught a couple years ago comes into the picture.
It's about finding the derivatives of  functions (differentiation). So the derivative of a graph a.k.a slope of the tangent to the graph - according to my understanding -  will give me the direction of rate of growth of the function.

> This is what I understood -  if any 1 comes up better/simpler understanding, pls feel free to post in the comments.


This is what I found on google images. Its actually from this youtube channel 3blue 1 brown. that is pretty amazing! (https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw)

So bottom line, if we assume the graph on top LHS  as our error function (probably with 2 weights vs error), and take the derivative of the error function (with respect to weights), we get the direction of growth of the function..

But we want the exact opposite thing!
We want to descend that hill.
Hence they have told, reducing our weights by the obtained gradients, enables us to descend the hill on the left  => implying that we are trying to obtain the weights (for our features) for which our error (graph on left) is minimized.


I think intuitively, it makes sense, gradient tells me by how much the function grows wen I increase my weight. If I decrease the weight by that gradient, the function comes down right.

Now, the derivative that I talked about, can be with respect to any of the weights.
The derivative for the graph/error function (y = f(x) ) or dy/dx form - tells me how the function y (error) changes (more like increases) , when there is tiny change in the value of x (weight).

> On similar lines i'd want to know how the y function changes for tiny changes in weight w11,w12,w13,w14........ and so on..

> For this, there is this partial derivative, which says how the graph/error function/y changes for tiny changes of a particular variable (weight) while others (other weights) are treated as constants.

>Something like:

for a given number of iterations:
   for each layer i :
      gradient_w11 = dJ(W) / dw11   #error function J(W)
      gradient_w12 = d(W)/  dw12   #the symbol needs to be a lowercase delta not 'd'
     ...
      ...and so on for all the weights

> Again, in the above step, everything is done matrices (Enter NumPy!)

> Ok, now that I have my gradients, as I said above, I just need to subtract this from the original weights to descend down the hill (similar to sliding down the bowl!)...
i.e

original_w11 = original_w11 - gradient_w11
original_w12 = original_w12 - gradient_w12
... and so on (but using matrices in a single step!!)

> Now, consider this example,
#prediction function
z = 2(w_1)^2 + 4(w_2)^3

partial_der(z)/ w.r.t (w_1) = 2 * 2 * w_1 + 0        --- gradient_w1
partial_der(z)/ w.r.t (w_2) =  0 + 4 * 3 * (w_2)^2 --- gradient_w2

Now if my original weights for my features were , w_1 = 4; w_2 = 3

z = 32 + 27 * 4 = 140

//re-assigning my original weights

for w_1
w_1 = w_1 - gradient_w1
w_1 = 4 - (2 * 2 * 4)  = -12

and for w_2
w_2 = w_2- gradient_w2
w_2 = 3 - (4 * 3 * 3)^2  = -1293

now my new prediction z is
z = 2  * -12 * -12 + 4 * -1293 * -1293 * -1293!! that's a lot!
but the transition seems too damn high!! from 4,3 we reached -12,-1293 !! :-O

we might have actually missed the minimum value from the error function,
 something like
 So we might have jumped right across to the other side of the graph (of course the graph does not exactly correspond to the weights' example I've shown above)





So we need to jump down slowly!! In smaller steps!

So Now, there is this term called "learning rate" (alpha symbol)  that I can use to control my steps.






Something like:

w_1 = w_1 - (learning rate) * gradient_w1
w_2 = w_2  - (learning rate) * gradient_w2

Now, if I take my learning rate to be like 0.01 or something, and plug in the values in the above equations, I should get a decently small value resulting in


So that is good, and there is still hopes of reaching the bottom or minima or global minima where I can be assured that my system will accurately (to some extent!) predict that food item is a potato and not a book!

(FYI- the left image is again, only with respect to 1 weight!! so its weight vs error function)


So the challenge is to choose the best learning rate (alpha) so that the steps aren't too small  to an extent that the system might take ages to reach the minima, i.e they say the model will take a long time to converge!

We don't want that, or do we want the system to shoot up to the other side of the graph, clearly missing the minima just because we choose a large alpha (learning rate).





TakeAways:
> How gradient descent works - derivatives of a function gives the direction of rate of growth of a function and we want to go in the opposite direction.
> I achieve the above by taking partial derivatives of my cost function with respect to each of my weights on each neuron across all layers! Ofcourse, I need to do vectorised implementation of this. No loops!
> Ofcourse, I cannot be all greedy and decrease the weights left and right! This leads to me missing the minima (the point where the error is minimum) and jumping off to the other side of the bowl..
I need to carefully choose something called a learning rate that I will use along with my gradient while descending the hill. (while am subtracting the original weights)
> Some of the above facts might be redundant, but point was to emphasize on certain things, to make myself understand it better (even the pseudo code is for the same purpose)
> Hmm, I might have enough things to try out few things like say,
Initialize my :
 - input, output and hidden layers - number of neurons on each layer - Initialize weights for all neurons across all layers - choose a learning rate - number of iterations for my gradient descent (how my time do I take the steps down the hill)
- compute outputs for each neuron, across layers and finally predict the output.
- Not sure, If I can also implement gradient descent (back propagation - ofcourse I will have weights multiple layers which in turn makes the calculation of partial derivatives in each layer difficult ), as I still need to know what the partial derivative formula is (chain rule).

- Oh yea, I haven't discussed about the bias which I will be adding while multiplying the weights with features, I ll probably bring in some activation function as an example (tan h, relu, sigmoid) to sort of visit bias usage, also probably talk about whose partial derivatives I should be calculating - for which again I need to make use of an activation function.

- So as a part of the 'sub-idea' that I would want to work on, post covering the concepts of backprop and bias in NN, I would want to implement the entire flow, without any libraries - do not care about the accuracy, but I would like to see the cost being decreased at every iteration and weights getting updated, I ll obviously use stuff for references, but i'll try writing it myself.


End.