To give is to know.: April 2019

Monday, April 15, 2019

Begin.

It's been a while, caught up with exams, onset of overwhelming yet exciting subjects of the new semester.

So a couple things happened actually,

> Started exploring the implementation of basic neural network on the web. (without the optimization or even the validation part!). Main intention was to see the entire flow in action hopefully in a graph!).
> After some googling (there were ready codes available):

Here is what I tried to implement -

> I decided on a basic neural net structure, straying a bit away from what I wanted to be able to predict in the first place.

> This is what I wanted to implement

A few things that I explored on the way-

> I used a dataset that was given as a part of a course by my Uni, (Titanic survival dataset- which I think is available on Kaggle).
> I wanted to play with the dataset, hence explored a bit of Pandas - yet again, it is one powerful and beautiful utility! Using jupyter notebook as its just awesome.
> Realised Pandas understands data in terms of dataframes, awesome way for filtering, doing a basic EDA on the given data to get like an overview of the data.

> 1 important step which I had to unfortunately realise the hard way, was the part where I had to thoroughly 'clean' my data.
> Here, I explored why I was supposed to clean the data / How can i do all this in Python
1. Remove NaN - non numeric values.
2. Convert categorical data to numbers (Enumerations)
3. The important one was Normalisation - Sort of having all of the features in a standard scale. Like there could be a column A whose range is 1-5 and 1 more B whose range is 1000-60000. > So in this case, if they are used the same way, as pointed by this person on youtube, the weights assigned to those features might heavily rely on the numeric values alone and not its influence on the result - as in, if A has the value 5, B has 1000, B might be given the wrong weightage.
4. Hence normalize using, X = (val - mean) / standard deviation - Stand deviation describes the spreadness of my column.
5. To sort of squish the values of a column btw 0 and 1, I can use X = (val - min)/ (max - min)
6. Initialise the bias column (np.ones) to X.

> So I wanted to predict the age of the passenger given his/her features - something like, given his/her economic status(class), location....accompanied by .. price of ticket purchased... , try and guess their age.
After cleaning the data bit, I decided on the label (age)..

> O, I also need to squish the Y or labels between 0 and 1! as my NN gives me values between 0 and 1! Totally forgot to do this till the very end, hence I used to get faulty error function! Error function had errors :D

> Formed the backbone of the NN that had like 2 internal layers excluding the input and output layers.
> The weights to a layer followed the format:
ex on layer 1
[
[weight on neuron 1, weight on neuron2, weight on neuron 3 ], // for feature1
[weight on neuron 1, weight on neuron2, weight on neuron 3], // for feature2
[weight on neuron 1, weight on neuron2, weight on neuron 3] //for feature3
[0.1, 0.1, 0.1] //bias weights
]

> I had to initialise the weight matrices (numpy.random.rand) - initialise bias weights (0.1) as well.

> Multiplication between several matrices was a pain! was too hard to ensure the shapes of the matrices are to be maintained - obviously I could not figure stuff out myself, went wrong at several places real badly, hence I referred the web.

So the structure was something like:

> Also yes, for backward prop (climbing down the error hill), the gradients for all 3 weight matrices for 3 layers, unfortunately I referred the web for the ready made formula - but yes I understood the partial derivative part and how they derived the gradient wrt diff weight matrices using chain rule.

So this is the cost function (difference between actual and predicted values) that I plotted using Matplotlib

So yea, the flow seems okay, as in -
> Yes the cost function seems to be decreasing. So the offset btw the actual and predicted values seems to be decreasing at every run.
> But 1 main thing that I am not doing here is the validation and optimization part..
> This expects me to address train - cross validate (model evaluation) - test (error) part.
> And also if there is overfitting or underfitting in the model. Apparently there are techniques to prevent model from overfitting (the selected weights for features is highly inclined towards the training set and does not generalise well for new incoming data)
or
underfitting (the selected weights for features is highly generic and does not predict stuff well)

> Some of them are -
Have more data,
Regularization,
Drop Outs to prevent Overfitting.
Change network architecture
> Techniques to rectify underfitting -
Have more layers
More neurons in each layer
Change net architecture

TakeAways:

> Need to decide on the neural net architecture first - layers, neurons, learning rate, num_iterations and stuff.
> Data cleaning - data munging - data wrangling - to clean non-numeric values, enumerate categories, normalize the data(features and labels).
> Initialise weight matrices - add initial bias weights column (column of some 0.1 initially) as well!
> Add bias column (column of 1s) to the input features. (bias - used to fit the model better)
> Pandas for data exploration - dataframes, effective filtering, selection and manipulation of the data.
> Understand how the feature matrices are represented which when combined with the weight matrices results in activations that is passed onto the next layer.
> Ensure the shapes of matrices are maintained across layers.
> Understand how gradients for different weights are calculated using back prop and hence the partial derivatives (chain rule was confusing!)
> The chain rule formulas are faulty - as in
- the biases that were added in the beginning to the features and weight matrices had to be handled for each weights.
- had to transpose a couple results to ensure right shape is maintained. Was confusing! Hence copied off the formulas from the web.
> Satisfactory decrease in the cost function across iterations! Plot was nice to visualize:) but unfortunately have handled nothing.

> Need to incorporate train_cross-validate_test split for validation of the model.
> Also need to incorporate regularization, drop out to prevent overfitting(inclined to train data)
> Explore possibilities of underfitting as well.
> There is also something called gradient checking to double check if the gradient descent achieved in the model was right.

> Offf, that was a lot I had to explore in parallel with other things, I think some of them were incomplete...but its fine..I think I have a fair idea about the story..,
> for the next steps, with this background , I think I'll dive into CNN, and learn on the go types (Optimizations can be done directly wrt CNN)
> Have also enrolled for Coursera's Andrew NG's Deep Learning course. (One can audit this as well for free!)
> Have also audited the course on linear algebra on Coursera - the math for ML.. to sort of be able to appreciate the math better! Not sure if I can keep up.

Also, not sure if posting the code makes sense. I feel the satisfaction to see one's own code in action is awesome! So, even if a bit of effort is made to go out there, make an attempt to understand already written code snippets (which they say is far more challenging than writing ur own code), is worth it! :)

End.

Friday, April 5, 2019

CNN contd..

Ok so far, I've understood -

> Forward propagation - where I calculate the outputs for a given layer using the outputs from the previous layer by fusing them with previously initialised weight matrices.
> Gradient descent to calculate gradients that would be subtracted from the weights assigned to the neurons in order to minimize my error.
> The decrease in weights has to be done across ALL layers.

Consider the example:

The second layer weights has the shape (2,3) - to map 2 outputs of previous layer to 3 neurons.
Hence layer2's weights is given by a single matrix.

> layer 1's matrix : X = [x1 x2]... (1,2) shape

In reality X will be a 2D matrix, eg:
X =[
[f1,f2,f3], #row1 in our case, probably a row of pixel values of an image
[f11,f22,f33] #row2
]

W1 = [
[w11 w12 13], # maps x1 to all 3 neurons
[w21 w22 w23] # maps x2 to all 3 neurons
]... (2,3)
W2,W3.

similarly, for other weight matrices.

> So using the weight matrices in each layer, I calculate the activations of the next layer. (Matrix Ops)
> Next I need to calculate the final error J(W) = (actual value - observed value)/ total #avg

> J(W) depends on the final weights W3 which in turn depends on W2, that depends W1 which depends on the input features(X).., I think intuitively it makes sense as the next activations were calculated using previous weights to this layer and previous activations.

Now, I need to rectify my error a.k.a change my weights

> Backpropagation - So I am backtracking starting from the last layer towards the first layer to see how each of the weight matrices is having an influence on the final error.
> I also realised in the last post, in order to get how a variable independently influences a function, I need to consider all other variables as constant and take the derivative of that function w.r.t the variable of interest. This is sort of the definition of a partial derivative of a variable.
> So I take the partial derivatives of the final error function with respect to individual weight matrices and propagate that back to the respective layers so that they can "tweak" their weights accordingly.

Something of this sort:
J(W) = (blah)(y-y^) #y - actual value

y^ = a(z(4)) #a-activation function
z(4) = a(3) * W(3) #W(3) is the weights used to reach layer 4 from layer 3

a(3) = act_function(z(3))
z(3) = a(2) * W(2)

a(2) = act_function(z(2))
z(2) = a(1) * w(1)

a(1) = act_function(z(1))
z(1) = X * w(1) # X- inp features

So I can say that -

changes in J w.r.t W2 =(change in J wrt y^)
*(change in y^ w.r.t z(4))
*(change in z(4) w.r.t a(3))
*(change in a(3) w.r.t z(3))
*(change in z(3) w.r.t W2)!!

Need to do this much which I then subtract from the existing W3. This is sort of equivalent to me "taking a step down" from that hill seen in previous posts (w.r.t W3!)..

> I have tried not to include any formula to sort of retain continuity, as formulas intimidate me! (although once I understood the story , it shouldn't be hard to understand them as well)

> With a decent background in derivatives, I could follow the backprop on the web with formulas.
(just skimmed through the one given in matrices.io website)

https://matrices.io/deep-neural-network-from-scratch/

> I just had to know the derivative of hyperbolic tanh as they have used this as activation
function(tanh)..

> Now that I know the story of forward prop, backward prop - how backward prop uses partial derivatives to carry backward the error caused by the weights.

Pseudocode could be like:

for given number of iterations :
forward_propagation( ) #predictes stuff
backward_propogation( ) #calculates gradients of all layers in the backward direction
update_gradients()

> update the weight matrices (w = w - alpha * gradient_w).... alpha - learning rate
> the shape of a weight matrix w and gradient_w has to be equal as it is element wise subtraction
> the element wise subtraction implies how the weight of each feature is decreased accordingly - something like how important is a given feature on a neuron of a layer.

Bias:
> Had left this part for a while, as I thought it needed formulas for understanding bias.
> Now, the activation function that each layer determines, given by:
a(layer) = activation_function (z(layer))
where,
z(layer) = a(layer -1 ) * weight(layer - 1) #FMI : a(0) = feature set

> This activation is a nonlinear function that transforms the linear equation : features * weights so that patterns can be found, various other non-observable patterns can be found, using which probabilities of getting a value can be determined that squashes the result between 0 and 1.

> Some of the functions that I came across were the tanh(hyperbolic tanh(x)), Relu (rectified linear unit), Sigmoid - apparently there are lots of them, which one to choose - not discussing that now - ill probably refer this when required -

http://cs231n.github.io/neural-networks-1/ #refer commonly used activation functions.

> Now, on using a website https://desmos.com, I plotted the tanh function to see how it looks

a = tanh(x)
looks something like

Clearly the y values are between -1 and 1 and I see the graph ascends to 1 at around x = 0.5.

But what if I want the graph something like,

The above graph has the equation
a = tanh(x -1.4)

Where the graph has a sharp ascend at around x = 1.4 ... probably this predicts stuff better...

>Now that -1.4 is a variable and I do not know the "correct" value with which I can determine based on the training set. And I call this value the "bias" that helps me construct my model.
> Hence I put bias also as a part of the weight matrices which I will use in both forward and back propagation.
> The feature corresponding to bias will just be a row of 1s and corresponding row of 1s in the weight matrices.

>Now, because I am subtracting say a gradient w.r.t W3 from the actual matrix W3, it's obvious that both of their dimensions need to be the same.

eg : If a(3) has the shape (5,2) and a variable delta has the shape (5,1) and gradient_W3 = delta . a(3)
and the actual W3 has the shape (2,1) - I need to transpose the matrix a(3) to make it (2,5) so that
(2,5) . (5,1) = (2,1) which I can use for an element wise subtraction.

Handling bias in gradients..
>Bias - To cater to handling gradient of bias - in the eg:
z(3) = a(2) * w(2)
w(2) will have an additional row for bias, but its gone while z(3) is calculated, hence while calculating the error (backprop : gradient_w2 from z3) w3 is having an additional row for bias - need to cater to that as well, hence the formula for back-prop takes care of that as well.
> But the additional column added to the gradient for w3 cannot be used for back propagating the error calculation for w2, it needs to be removed.
> If the above is too difficult to understand, I can just stick to that website, it mentioned a bit clearly.

PS:
The formulas to calculate the gradients in backpropagation,
adjusting the matrices of activations (transposing) to match the weight matrices,
adjusting matrices to include bias corrections and re-adjusting such that they are not propagated
is all compiled in that website.

It might get complicated, I believe the formulas can just be briefly seen, just need to get a vague idea about what is happening.. But it's a nice exercise for the brain to try and understand the formulas from that website .

TakeAways:
- Forward propagation - calculate activations in each layer using previous layers' activations - finally predict value
- Use the above to calculate error.
- Back propagation : Starting from the last nodes, calculate the gradient wrt the immediate previous weight matrices (partial derivative) and use chain rule to propagate the error to previous layers (by taking partial derivatives of various weights ).
- Can check desmos.com to realise why we need biases.
- Update weight matrices using the calculated gradients from the previous step (don't forget learning rate - that decrease the step size while descending the hill)
- Adjust the activation and biases while back propagation.
- General pseudo code is given.

- Finally I think I have the stuff to have an end to end crude NN setup (without any error handlings)
- Shall try and implement stuff in the next one! (for sure this time! :-| wanted to have this post as I had left out backprop and bias)

End.