To give is to know.: March 2019

Monday, March 25, 2019

CNN contd..

> Now, the graph that I had in the last post, did not explain everything in detail.

> It was a graph of error function J(W) (y-axis) vs 1 parameter (or 1 weight ) that is supposed to be multiplied with 1 feature.
> But in reality there are large number of weights (in matrix forms) that is multiplied with features.
> Implying the fact that, the error function or the cost function is affected by each and every weight parameter that is used across all layers of our network.
> So, now how will I get to know how my error function changes once I change my weights. Cannot really rely on a 2D graph..

> So, I need to be able to constantly change all of my weights such that the error decreases in every iteration.

> But by how much?

This is when I came across gradient descent, which says use a 'gradient' to change the weights that u initially used to predict the final class.

Something like :

for each iteration:
for each layer i :
gradient_w11 = determine_gradient_for_weight_on_neuron1()
gradient_w12 = determine_gradient_for_weight_on_neuron2()
... and so on

new_weights_on_neuron1 = original_weights_on_neuron1_changed_by(gradient_w11)
new_weights_on_neuron1 = original_weights_on_neuron2_changed_by(gradient_w12)
.. and so on

calculate_error_with_new_weights() //check if this is decreasing!
(in case I forget : error is the potato being predicted as a book!)

> Ofcourse, I won't be calculating the gradients for each weights on each neuron individually - as in its all using matrices! (Numpy)

>Now, how is the gradient computed:

>This is where the math that I was taught a couple years ago comes into the picture.
It's about finding the derivatives of functions (differentiation). So the derivative of a graph a.k.a slope of the tangent to the graph - according to my understanding - will give me the direction of rate of growth of the function.

> This is what I understood - if any 1 comes up better/simpler understanding, pls feel free to post in the comments.

This is what I found on google images. Its actually from this youtube channel 3blue 1 brown. that is pretty amazing! (https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw)

So bottom line, if we assume the graph on top LHS as our error function (probably with 2 weights vs error), and take the derivative of the error function (with respect to weights), we get the direction of growth of the function..

But we want the exact opposite thing!
We want to descend that hill.
Hence they have told, reducing our weights by the obtained gradients, enables us to descend the hill on the left => implying that we are trying to obtain the weights (for our features) for which our error (graph on left) is minimized.

I think intuitively, it makes sense, gradient tells me by how much the function grows wen I increase my weight. If I decrease the weight by that gradient, the function comes down right.

Now, the derivative that I talked about, can be with respect to any of the weights.
The derivative for the graph/error function (y = f(x) ) or dy/dx form - tells me how the function y (error) changes (more like increases) , when there is tiny change in the value of x (weight).

> On similar lines i'd want to know how the y function changes for tiny changes in weight w11,w12,w13,w14........ and so on..

> For this, there is this partial derivative, which says how the graph/error function/y changes for tiny changes of a particular variable (weight) while others (other weights) are treated as constants.

>Something like:

for a given number of iterations:
for each layer i :
gradient_w11 = dJ(W) / dw11 #error function J(W)
gradient_w12 = d(W)/ dw12 #the symbol needs to be a lowercase delta not 'd'
...
...and so on for all the weights

> Again, in the above step, everything is done matrices (Enter NumPy!)

> Ok, now that I have my gradients, as I said above, I just need to subtract this from the original weights to descend down the hill (similar to sliding down the bowl!)...
i.e

original_w11 = original_w11 - gradient_w11
original_w12 = original_w12 - gradient_w12
... and so on (but using matrices in a single step!!)

> Now, consider this example,
#prediction function
z = 2(w_1)^2 + 4(w_2)^3

partial_der(z)/ w.r.t (w_1) = 2 * 2 * w_1 + 0 --- gradient_w1
partial_der(z)/ w.r.t (w_2) = 0 + 4 * 3 * (w_2)^2 --- gradient_w2

Now if my original weights for my features were , w_1 = 4; w_2 = 3

z = 32 + 27 * 4 = 140

//re-assigning my original weights

for w_1
w_1 = w_1 - gradient_w1
w_1 = 4 - (2 * 2 * 4) = -12

and for w_2
w_2 = w_2- gradient_w2
w_2 = 3 - (4 * 3 * 3)^2 = -1293

now my new prediction z is
z = 2 * -12 * -12 + 4 * -1293 * -1293 * -1293!! that's a lot!
but the transition seems too damn high!! from 4,3 we reached -12,-1293 !! :-O

we might have actually missed the minimum value from the error function,
something like

So we might have jumped right across to the other side of the graph (of course the graph does not exactly correspond to the weights' example I've shown above)

So we need to jump down slowly!! In smaller steps!

So Now, there is this term called "learning rate" (alpha symbol) that I can use to control my steps.

Something like:

w_1 = w_1 - (learning rate) * gradient_w1
w_2 = w_2 - (learning rate) * gradient_w2

Now, if I take my learning rate to be like 0.01 or something, and plug in the values in the above equations, I should get a decently small value resulting in

So that is good, and there is still hopes of reaching the bottom or minima or global minima where I can be assured that my system will accurately (to some extent!) predict that food item is a potato and not a book!

(FYI- the left image is again, only with respect to 1 weight!! so its weight vs error function)

So the challenge is to choose the best learning rate (alpha) so that the steps aren't too small to an extent that the system might take ages to reach the minima, i.e they say the model will take a long time to converge!

We don't want that, or do we want the system to shoot up to the other side of the graph, clearly missing the minima just because we choose a large alpha (learning rate).

TakeAways:
> How gradient descent works - derivatives of a function gives the direction of rate of growth of a function and we want to go in the opposite direction.
> I achieve the above by taking partial derivatives of my cost function with respect to each of my weights on each neuron across all layers! Ofcourse, I need to do vectorised implementation of this. No loops!
> Ofcourse, I cannot be all greedy and decrease the weights left and right! This leads to me missing the minima (the point where the error is minimum) and jumping off to the other side of the bowl..
I need to carefully choose something called a learning rate that I will use along with my gradient while descending the hill. (while am subtracting the original weights)
> Some of the above facts might be redundant, but point was to emphasize on certain things, to make myself understand it better (even the pseudo code is for the same purpose)
> Hmm, I might have enough things to try out few things like say,
Initialize my :
- input, output and hidden layers - number of neurons on each layer - Initialize weights for all neurons across all layers - choose a learning rate - number of iterations for my gradient descent (how my time do I take the steps down the hill)
- compute outputs for each neuron, across layers and finally predict the output.
- Not sure, If I can also implement gradient descent (back propagation - ofcourse I will have weights multiple layers which in turn makes the calculation of partial derivatives in each layer difficult ), as I still need to know what the partial derivative formula is (chain rule).

- Oh yea, I haven't discussed about the bias which I will be adding while multiplying the weights with features, I ll probably bring in some activation function as an example (tan h, relu, sigmoid) to sort of visit bias usage, also probably talk about whose partial derivatives I should be calculating - for which again I need to make use of an activation function.

- So as a part of the 'sub-idea' that I would want to work on, post covering the concepts of backprop and bias in NN, I would want to implement the entire flow, without any libraries - do not care about the accuracy, but I would like to see the cost being decreased at every iteration and weights getting updated, I ll obviously use stuff for references, but i'll try writing it myself.

End.

Friday, March 22, 2019

CNN contd..
Begin.

> So, before I start understanding Conv Neural Net in more details, I thought ill explore a bit more of basic Neural Net.
> After a bit of googling and youtube, I came across this site that had 1 of the best comprehensive detailed description about Deep Neural Net.
> It is simply mind blowing as to how they could accommodate all the necessary information. They have everything - best part - they have used an example to explain the flow.

https://www.matrices.io/deep-neural-network-from-scratch/

> Yes, it's a huge post, but I believe I can revisit this several times if needed, but the way they have described it is beautiful.
> From going through like half of the page contents - I am able to have a fair understanding on things like -
- Represent features as matrices
- Feed the features into the neural network
- How the NN fuses the features with weights. Matrix multiplications of weights.
Something like:
for each layer:
for each neuron in a given layer:
for each given feature from previous layer:
z = (feature) * weight on the neuron
res = some_sort_of_non_linear_function(z) // so, some of the functions could be - ReLU, sigmoid, tanh... basically they squeeze the res btw a given set of ranges which can be used by us for setting a threshold or cut-off for predicting valid results.
____
> Now I cannot afford to have the above loop as it is computationally expensive for the computer.
> Hence the smart people have come up with vectorised computations (using libraries like numpy where they make use of optimized underlying C functions)

> A line in this below post said : ' the vectorized NumPy call wins out by a factor of about 70 times:'

https://realpython.com/numpy-array-programming/

Here, they try to calculate the maximum profit when u r asked to make 1 purchase and 1 sale. They have 2 versions of the same - loops and numpy - Numpy seems super simplified (2 lines!).
____

So, continuing with NN,
- The output is called the activation function (non linear function)
- Realised how output of 1 layer's neurons is passed onto the next layer.
- In general, when I talk about each layer, the task of multiplying previous layers' outputs (if previous layer is the first layer - then input is the given features) with weights of all the neurons of this layer can be expressed as 1 step!

Some videos that I saw:

https://www.youtube.com/watch?v=aircAruvnKk

https://www.youtube.com/watch?v=2-Ol7ZB0MmU&t=1062s

There are a lot more stuff that I can visit, understanding has been a cumulative process.

Little beastly matrices for math!

From that website (matrices.io),
> I could make sense of what I have written on the LHS, x1 and x2 are the features (in my case, pixel intensity of the food at top left and bottom right corner) in first layer
> There are 4 neurons on the second layer with their own weights expressed as a matrix.
> First row corresponds to the weights for the feature x1 and its connections with the 4 neurons in that layer.
> Same goes with the second row of the weight matrix which is wrt the second feature's weights for all 4 neurons of the second layer.
> On asking numpy to multiply this, we get a variable Z which we give it to the symbol σ which is the activation functions (ReLU, sigmoid, tanh...)

> Now we have values that we can probably use to predict the final class (food ingredient), but what about taking care of the correctness??

Error Optimization:

I cannot afford my system detecting the object to be a book when it's actually a potato (but what if its a book that has a potato pic.. hmm.. for next time)!

> The website had summarized certain things I could actually relate, things like how I compute my error (predicted value - actual value)
> its the J(W) formula that I have written above - they call it the cost function (that tells how much is the error) which we need to minimize.
> the formula is actually intuitive, y is the actual output class (potato), y^ (book) is the calculated value, have used squared function to remove negative, have summed things up to get like a final accumulated collective error.

> Now, that error function is a convex function that is of the form like y = x^2 like a bowl shaped graph, once again, indicating that it actually has a minimum value (bottom of the bowl).

> So this is my error function and am here right now,

But I want to reach here! Hence I badly want to slide down (gradient descent) this bowl to the bottom.

TakeAways:

> So bit more on how I can represent input features and - weights on each neuron across layers w.r.t each input feature - using matrices. (each row in weight matrix represents each feature's connection with all the neurons in that layer)
> A rough algorithmic steps involved.
> Some sort of mathematical description (image)
> How numpy is all being boss when it comes to computations.
> Its a book! Not a potato?? No can do! - this is where I cannot afford to overlook error optimization - which I realised is a bowl and I just need to slide to the bottom and it's all "downhill" from that point :D

- In the next one I intend to write a bit more on gradient descent (which is the technique for sliding down the bowl) and back propagation which is the technique used by our NN to calculate errors.
- Post this, probably try and implement a basic "sub-idea" NN to see stuff in action.
- And yes, I havent mentioned about the 'bias' thing that I have written, shall visit that as well.
- Then visit Conv NN.
- Exams going on, might get delayed by a bit. End.

Wednesday, March 20, 2019

Convolutional Neural Network:

Ok so here is what I have understood so far. The story might be wrong, vague, in which case, do let me know.

> Of Course the way I have understood might need a lot of refinement, but this is based on what I have accumulated over the past 2 days through the web.
>I shall try and keep it as simple and intuitive as possible(say like if a person like myself doesn't have the necessary background).

> So for starters a Neural Network is a network of neurons spanned across multiple layers.
A neuron, just like a brain's neuron accepts several input signals (input features in our case) and spits out an output signal (output value).
> A simple neural network can have like

This is exactly what am trying to do but for identifying food ingredients.

> Every neuron accepts the features (in the above case - image pixels), multiplies those features with certain weights and passes the output to the next layer.

> The above steps are done in a very cool way using matrix operations using libraries like numpy in python as it is wayy too faster than having several loops.

> In the last layer (output), there are 2 neurons - each of which will have a probability value, that gives an information like - "how sure it thinks the picture is a cat or a dog"

> Now, this probability is what we are trying to optimize for correctness

> If we backtrack, the thing which is under our control is not the features, as that is our input and we cannot change that, but the weights - those weights were decided upon by our neural network.

> Hence, optimization of the neural network involves something called a backpropagation (gradient descent ) that "tweaks" or "tunes" its dials to "rectify" those weights to cater to identifying the cat or a dog picture correctly.

> So the network has to "learn" to do this. Hence it learns when we give it certain "correct" results. Hence this is some sort of supervised learning where we supervise or monitor if the learning is happening properly in contrast to unsupervised learning where the neural net thinks its smart enough to give accurate results without asking us for already correct results.

Relate:

Now, for my requirement, in the end -

- I would like the computer to "see" the food items and try and identify the food item correctly.

- Hence my video feed is the input to the neural net. Video is nothing but a stream of static images, hence image is my input.

- So if am able to do this first, when I throw an image at the neural net and it is able to identify the image, I think I might have gotten closer to the final result.

- But from what I know thru Open CV, image is a 2D array (not considering channels as of now). So there would be huge number of neurons if I were to represent each pixel as a neuron and it is way too costly.

- Hence the smart people have come up with something called Convolutional Neural Net where there is some sort of windowing is done, which I ll try and describe in the next post.

TakeAways:

- Neural Net has several layers of neurons.

- Neuron takes several input features and spits an output - Does this by multiplying input features with weights - and subjecting the result to some sort of function (activation function) to spit out a value.

- Based on the output I try and predict the result, hence I need to optimize the weights.

- Optimization is done through back propagation - which involves determining the amount of error (that is the difference between the correct output and the predicted output) and "tweak" my input weights.

- But for an image, normal method of giving image pixels to neurons is costly, hence I refer to a convolutional neural net approach.

- I realised by taking a top down approach when we learn something, we appreciate things better. Yes I know we might not have an in depth understanding of the topics, but I think its okay, as eventually we will get there if we pursue it long enough. But initially, if we have some sort of motivation, real world use case, wen we explore something, we definitely will appreciate it better. (And this has to be the case wen one starts his bachelors course)

Monday, March 18, 2019

Begin.

> So understanding how the system interprets live video stream using Open CV was a good exercise.
> Now I would want the system to smartly identify objects by itself from the feed. I should not be driving the identification step.
> Hence with that, I came across Convolutional Neural Net, that can be used to find patterns and identify images (extend that to live feed).
> Obviously there is a lot to know before I can proceed with the implementation. (Although in the end it might just be the part where I provide my train data and invoke a few function calls like everytime).
> I would like to know from a decent ground up.
> Shall visit the necessary things and create the next post.
> Of Course I need to keep the main goal at the back of my mind, once I feel I am equipped with the necessary things, I can move on.

End.

Thursday, March 14, 2019

Ideation contd.

Trying to implement :

Selectively ask the computer to track a given area from the real-time video stream.

And it seemed to be working -

> Select a custom area in the live feed, by drawing a bounding box, the computer calculates the histogram of that area
> that area's histogram is compared against the current live feed - the mean shift algorithm gives u a "track window" that specifies the "most relevant " area that matches with your bounding box.
> Draw a rectangle in the place of that "track window" for every frame of your live feed.

> Mean shift algorithm tries to detect a bigger concentration of white points.
Hence 1 disadvantage, is that if the object enters the frame from some other position, the algorithm initially fails to detect the object as it fails to find the "new concentration" of the white points.
> Mean shift algorithm does not track the size of the object of interest.
Hence I had to follow Cam shift algorithm.

Using camshaft I could keep track of the size of the object, its orientation as well:

1. As you can see, the green box is the ROI that I interactively selected from the live feed window (top second)

2. The black and white are calculated in real-time, that uses the histogram of the ROI and the back projection function of Open CV...it is the mask - white portion indicates relevant portions of the frame, and black corresponds to the non-relevant ones.

3. I use this mask from step 2 (obtained from the back-projection method), give it to the camshaft algorithm, that gives me the track window, its orientation, size which in turn is passed to the ready-made Open CV function that calculates relevant co-ordinates and finally draw the corresponding rectangle box (blue coloured box in the frame)

Takeaways :
A lot to take in, but bottom line, tracking a given Region Of Interest using its histograms - back projection - and finally mean shift OR cam shift algorithms to get the track window that is drawn on the live feed.

Trying to think about what I could do with the things explored so far,
1. I can probably try and revisit what I wanted to do in the first place - the system should be able to detect food items.
2. For food items, I need features - I need to revisit the algos that detect features (SIFT, ORB...).
3. Without incorporating any intelligence, a brute force approach could be, I select the ROI, the system matches the "features" of the ROI with the predefined set of features of various food items and based on most matches, gives the best result.
Or I can wait for a few more video tutorials where I might encounter something new.

PS: Of Course there were lighting issues, I think that is fine at the moment. As of now, am trying to understand how the thing works in the background.

End.

Begin:

Identification contd.

> Tried using the histogram of a road, to extract only the road from a traffic image.
Steps were like:

> Get the HSV of both the template and original image.
> Get the histogram of the template road.
> Use the back projection method of open CV to extract only those parts that match with the histograms of the template.
> Next imp step is to optimize the match, using kernel estimation and thresholding.

a. Kernel estimation, acc to my understanding estimates the density/intensity of the given pixel using a prescribed filter (Ellipse of a given size, circle of a given size, etc ..)
b. Thresholding tells the system to consider all the values below a threshold as black and above that threshold as white.

The above 2 filtering steps is done for the mask to improve the mask that selects the road :

Final step is to do merge (bitwise and ) the mask with the original image to get the below image on the left.

TakeAways:

1. Using histogram of the template ROI (region of interest) to filter out specific regions from the given image.
2. Best use case that I could think of is to track a given object in real time.
3. Next, I would want to be able to custom select an area in real time, and start tracking parts of the image that matches with the given bounding box.

To be contd..

Wednesday, March 13, 2019

Identification:

Begin.

Trying to get closer to what it takes to identify a person, object.

>Realised the concept of how the computer can use histograms to sort of identify things.
>Plotted various histograms of the R,G and Blue channels of an image to see how different colors are distributed.
> Tried plotting histograms for real time Video, turns out its a bad idea :D
>

image example - clearly green component is more

> Next to explore more of mouse events, I want to be able to get a histogram of a selected custom area from a real time stream.
>Using multiple flags (top_left, bottom_right - extracting the x and y ranges of the bounding box - slicing)
>

Pretty decent output. In the end, the result was:
a. Draw a bounding box over the live Video stream - (draw box using mouseCallBackEvents by locating position of the pointers)
b. Crop out the image inside the bounding box.
c. The cropped imaged is passed to a function that 'splits' the image into the 3 channels.
d. flatten the channel and plot the histogram with 255 bins (0-255) for each of the color.
As seen in the above pic, the histogram says the cropped image has a distribution of the green color .

TakeAways:
> A call back mouse event can be registered to track co-ords of the pointer and also various events.
> Use the above technique to draw a bounding box. Using this for selection in the image.
> Use plt.hist() function to plot the color distributions of the image.

End.

Thursday, March 7, 2019

Begin.

Sub - Idea implementation :

> After Harry Potter Cloak project, I couldn't help but explore more of Open CV which is when I encountered mouse events in Open CV.
> This is when I had my second idea in mind - to be able to accept mouse events and use it for perspective projection. Kinda like selected area has a perspective that goes away from the screen, which you will "rectify" and project it directly to the screen. Pretty cool idea to think of when u want to scan docs (as seen in CamScanner app).

Steps were pretty clear and simple:
> Accept co-ords of mouse pointer using callback function, keep track of the points.
> Color those locations to sort of make it readable.
> Also, explored options where user could delete the points set (by pressing 'd' key, Basically you can have various events).
> Lastly, ask Open CV to get the perspective projection matrix that transformed selected area to the screen (rectangular co-ord which u would have set in the beginning).
>And do this while Video captures ur live feed.

The result:

TakeAway:
> Mouse events to keep track of pointer's co-ords.
> Several other events that you can listen to. (try dot/line trails while live feed, pretty fancy it was)
> Perspective projection of custom area as selected by the mouse.

End.

Wednesday, March 6, 2019

Begin.

Ideation:

Started exploring Open CV (Computer Vision). It is one amazing utility! So what better way to learn, than implementing your ideas right!

So 1 use case that is fun and interesting that I could think of is a scenario where the user shows the application what all food ingredients he/she has, and the application should reverse engineer the items and present recipes.

So I have started exploring Open CV through a youtube channel :

https://www.youtube.com/channel/UC5hHNks012Ca2o_MPLRUuJw

The guy is brilliant, keeps things simple and to the point.

So far, I was able to :
> do some basic stuff, play around with camera of the system.
>Perspective projection was pretty cool
> selective masking of colors, displaying only the stuff I wanted
> Conceptually understood dilation (fill stuff in the masked region), erosion (remove stuff from the masked region), contour lines (draws boundaries around the objects of a given color - was pretty cool)
> On doing so, I came across something, which even I wanted to try out using the knowledge I had acquired.
>Realised, having a lot of sub ideas in your way of implementing your bigger idea is a good way to learn better!
>to implement Harry Potter's Invisibility Cloak! :)
> Concept was simple:
> Select your object of choice, that you would be using as a "Cloak".
> Record that bkg, wen the scene is empty (without you in the cam's field of view or the "Cloak":). Store it.
> During live, remove the Cloak obj(mask it) and select only the masked areas from the static background that you had recorded beforehand.

Looked something like this :

was excited! :)

>Went through a video where he uses algorithms SIFT, SURF and ORB to detect features in an image.
Apparently the above techniques for feature detection is only good for image comparisons itseems and not good for videos as it takes a good computational time.

> Am about to learn mouse tracking, once that is done, I want to be able to :
>Take area input from user - project it onto the screen

Take aways:
1. Main Idea: Show Me Recepies
2. Sub Ideas: Harry Potter Cloak (done), Input area - project it to screen
3. There are lot of things that I went thru - dont remember everything - the ones that I remember - Dialation, Erosion, Perspective Projection, Selective Masking (inRange), Contour Lines, Sift, Surf, Orb.....
4. Ofcourse, I dont need to remember anything at all, but just try and focus on the things that would cater my "idea". Hence its important to have an idea in mind wen u are learning something.

End.

Begin.

Learning Path:

Hello World,

The main purpose of this blog is to have a 1 stop source of reference, for all the things that I explore. Not sure if it turns out to be useful, but hopefully will give a head start for people who are as clueless as I was when I had to start off.

Main intentions :
1. I need to be able to organise the stuff that I learn (mainly technical stuff) in 1 place.
2. Any quick tips as to how to or not to do certain things.
PS:
This might not be a place where 1 could learn stuff, as in there might not be code snippets, instead the journey, sort of the path that I took to implement what I wanted to implement.
I have categorised the posts into day-wise to your right (in case of Mobile phones, click on "View Web version" of the blog).

End.