It has been a pretty hectic semester. Hence not able to keep up with the writing. There are lectures, assignments, student job.. lots to keep me active round the clock.
But the good news is that my lecture contents, and to some extent my student job as well, are sort of in sync with what I wanted to do here.They helped a lot to get a clearer understanding on certain topics. And also there is an exam on the topic in 2 days :D
So far, I've got a decent picture about Convolutions and incorporation of convolutions in the Neural Network with intermediate blocks Relu (to bring in Non-linearity), Max-Pooling, strides.. There is this awesome formula that gives me the dimensions of the filter outputs : output width:
(initial width + 2 * padding - Filter width) / (stride ) + 1
To sort of brief it up, here is what I could come up with, acc to my understanding:
So the story goes:
> I have the image of a fruit bowl and say I have a torch (with like a rectangular bulb to represent the square filter (5 x 5 x 3) as shown in the above image)
> I just slide the torch across the image - Each time when I am in a new position, I convolve(dot product) the torch with the corresponding area in the fruit bowl image where the torch is shining its light.
> The convolution operation is just multiplying the weights on the torch head with the corresponding image pixels in the area where the torch is shining its light.
> In the example picture above, the torch head is the 5 x 5 x 3 filter, and the area where the light shines in the image is the small red box , now I convolve these 2 to get 1 value which I store as a red blob in that second plane as drawn in the figure.
> Now I move/ slide the torch head - now the light is shining on the purple box. I convolve these 2 to get 1 value - represented as the purple blob in the second plane.
> Now I continue this and fill up that plane - this is called the activation plane.
> Activation plane represents the result of sliding the torch head (filter) across the image.
> Intuitively the filter tries to "filter" stuff out. As in, it could be a filter for detecting say edges or curves.
> Eg filter will have like
to sort of detect the symbol 'U' and discard everything.
> So the 4 x 4 image regions having 'U' in them will have higher result in the activation map's neuron in relation with other 4 x 4 regions.
> Like so there could be several filters in 1 Convolution layer itself, which in turn results in that many activation maps.
> One can visualize this output of multiple filter as a box (number of neurons drawn are not exact):
So here, there are 4 planes in the box => 4 activation maps => outputs of 4 filters that were convolved with the input image.
> Why is this done?
> So that I can detect for eg blobs or polygons..shapes.. solid shapes etc in the next layer.
> Here the intuition can be a real human eye which detects stuff layer by layer:
to identify an object, our brain first sees the edges(layer 1) then shape(layer 2) then the texture (layer3), finally recognize the object.
> The above box is essentially one more image/input with dimensions 6 x 6 x 4 .. just like the input image..
> Hence I can have additional filters in the next layers, by taking this box as my input and try run my torch against this new box.
> This leads me to being able to detect higher level features like say solid shapes...
ultimately the CNN structure might look something like :
((Layer - Relu) * m - Max-Pool) * n - (Fully Connected Layer - ReLu) * o - SoftMax. (flicked from my lecture notes :D )
> Here, Fully connected layer is like the normal NN, where all the 35 * 35 * 3 pixels of my input fruit bowl are connected to specified number of neurons. This is super complex operation, hence not used since the beginning of the layers... Hence the torch sliding helps reduce the number of computations.
So the ReLu, Max-Pool, SoftMax are pretty intuitive concepts which I shall be covering in the next post.
TakeAway:
> Bigger picture of what CNN is all about how one can detect stuff with a CNN architecture.
> Bit deeper into the workings.
> Intuition on Convolving a filter across an image.
> Concept of filters.
> Interpretation of Filter outputs as neuron-contained activation maps.
> On a side note, I might consider using tensorflow for directly building a basic CNN and train the model. Not sure when, probably after exams.