Monday, October 28, 2019

General stuff

At my student job,
  • I started with porting a neural net script that was a part of a Research Paper, implemented in Caffe library to TensorFlow. Things were pretty straight forward, I had to get TensorFlow equivalent of the Caffe's weights.
  • Construct the architecture defined in the paper in tf, load the weights. Here is when I explored sessions in TensorFlow, that lets me save weight loaded variables, that I can re-use for determining final predictions.
  • This was followed by a bit of Image Processing where I had to segment the image semantically, followed basic approaches for the same. Have used the nearest neighbour to evaluate the semantic meaning of the pixels.
  • Post this I had to explore techniques for anomaly detection - some of them were OneClassSVM, Feed inp to ResNet - use its features(last but 1 layer's op) as inp to the OneClassSVM, as I mentioned in the previous posts. 
  • As of now working on trying to reduce the representation of a given image/patch (dimensionality reduction - basically projection the given data into a lower space) and then reconstructing the image/patch in that lower space. Post this, compute the reconstruction error to try and pinpoint the image/patch that was relatively difficult for reconstruction.
Anyways, now I want to explore semi-structured analysis (textual stuff), after a bit of exposure to a few NLP techniques at Uni, I am exploring the same a bit more.

I am not sure if I'll work on this, but I have an idea that I'd want to implement.
Recently, I got a chance to explore OCR a bit using this lib tesseract (https://github.com/tesseract-ocr/tesseract) - that is used for extracting text from Images.., and sometime later in a video on youtube, I came across this GloVe - Global Vectors for word representation, (https://nlp.stanford.edu/projects/glove/) which is a research by Stanford.

- So a text can be stored in several ways for performing analysis. 
- One way is to record each word's frequency and use Bag Of Words - vector representing the count of each word and its position.
- Or I can just use an incidence/boolean matrix, to just indicate if the word is present or not.
- But GloVe, it represents each word in a pretty epic way, by retaining the contextual meaning of each word. Each word is represented as a vector of several dimensions, such that, similar words have similar vectors (cosine sim might be high). 

So of course, there is a lot more that I need to explore about GloVe and its usage.
But I thought if I can combine OCR, use GloVe for the extracted text embeddings, and be able to summarize/ predict/ analyse the given text (preferably using RNN or something similar as it keeps track of the timesteps)(Not sure about this part, but I think as of now, I would consider it progress if I can do the OCR part + word embeddings using GloVe part)


Takeaways:
  • Some insights about my part-time work.
  • Initiate new exploration :) 

Sunday, October 27, 2019

Uni Learnings

So, I thought I'll dedicate a post for the stuff that I get to explore here at the University. Also, include the topic/tasks that I am exploring at work - to have an idea about how the topics get translated to real-world use cases.

SVD - Singular Value Decomposition -

  • From what I have understood, this lets me decompose the rectangular matrix into 3 parts - left and right orthogonal matrices and a diagonal matrix. 
  • A use case for this was, in the Information Retrieval area, where I had like a sparse matrix - terms Vs document ..so like each document was represented by a  vector of 1s and 0s, where a 1 implied the term is present in the document, 0 otherwise.
  • I can give this matrix to the SVD function, that spits out 3 very VERY useful matrices- 
    • terms Vs topics
    • topics Vs topics
    • topics Vs documents
  • So the topics are like categories to which the text belongs to, (Sports, Food etc),
    • By mapping each of those terms to a set of categories and in turn using those categories to decide like the genre of the document is pretty cool.
  • This can help me in retrieving documents better for a given query - by considering the context. There are ready-made functions for performing SVD, just have to str the data properly to feed it in.
  • This is pretty cool especially if one is building some in-house information retrieval system for an organisation.
Map Reduce paradigm - 
  • Framework for distributed data processing. 
  • Damn cool technique iff, I can frame my problem as an MR task.
  • In simple terms, there is a Mapper - (Sort and Shuffle) - and Reducer nodes that do-
    • Mapper - Emits out Data tuples grouped by a key.
    • Reducer - Gets the Key grouped list of data tuples and performs some kind of aggregation. (count, sum, max, avg...)
    • One use case that I can think of is, lets say if I have several 1000s of docs spread across nodes in a network and I want to bring all the docs of each class in separate nodes, count # of docs in each class, perform some kind of textual analysis (relevance/priority), the Map task can like emit (category_id, <docId, content> ) and reducer will receive a list of (category_id, [<docId, content>,<docId, content>,<>....]) that can then aggregate all the tuples in the above list that belongs to a given category.
    • Internally the framework supports the part where all similar docs are directed to 1 reducer only and also sorted. Uses hashing principle, something like, a doc that belongs to a category id '54' may be mapped to a reducer number given by 54 % #reducers (Ensuring all docs of category 54 will reside in the same reducer.
    • There are several such intricate details in the MR framework and of course, Hadoop was the main framework that supported this I just need to write codes for the Mapper and Reducer task and rest everything is taken care. This can be used if 1 is searching for distributed processing (Of Course there are others like Apache Spark)
Signal Processing from IMU:
  • If one is interested in making sense of the motion that is captured by devices like an IMU (that is built-in on phones), one can make use of this app called Science Journal by Google.
  • That is a brilliant app that lets you capture acceleration (any axis), Ambient light in the room, sound, music and many more. It leverages on the IMU unit on the phone and lets u record stuff at will.
  • The best part, you can download the captured information as a CSV!, and now it's just any other data.
  • One can play with this data in several ways - 
    • Apply a version of clustering technique (K -Means, Hierarchical, DBSCAN...) to the data to "discover" similar segments in the signal.
    • With the knowledge of the type of motion, use this data to capture similar motions in the future, like a familiar gesture or something.
    • Detect the amount of light in the room and act accordingly.
    • Detect the level of the sound signal in the room and act accordingly.
    • One can choose to integrate all this to like an app that he/she is developing.
    • ML is just 1 step in the bigger project that can be developed.
    • Or 1 can go all out, and combine all types of signals and determine activity.
  • I can use the evaluation technique like constructing a confusion matrix (that captures False Positives, False Negatives and in turn Precision and Recall)

PCA:
  • Very similar to SVD, this technique is done for Square matrices that capture data records Vs Features.
  • If there are like 1000s of features in the given data and I do not know what to consider to perform analysis, predictions - I can rely on this technique Principal Component Analysis.
  • That allows me to strip off some columns that are not that representative of my data in-comparison to other stronger ones.
  • I need to construct a matrix called the correlation matrix that captures how 2 columns/features are correlated (as in, if an increase in 1 leads to an increase in another)
  • Now, I need to perform Eigen Decomposition on this Correlation matrix, that spits out 2 parts-
    • Eigenvalues and Eigenvectors (these are called the principal components)
  • This works based on the rule, AV = ΛV; A- co-occurrence matrix, Λ - Diagonal Matrix of eigenvalues; V: Matrix of Eigenvectors
  • So the vectors are arranged such that, the starting vectors (that are associated with largest eigenvalues) has the largest variance, i.e explains the data to a large extent in-relative to the ones that come later.
  • So we can restrict by stripping of vectors/components that are at the later stages, that is not representative of the data and retaining top-k vectors combined with their eigenvalues as our new data that has been reduced to a lower dimension.
  • This technique has been used in several areas including Genetics (to find the relevant gene), Eigenfaces - to generalise a whole collection of Human faces to capture unseen faces.
  • Can be used in several areas (including Images for learning their features- pixels.. )
MinHashing:
  • A technique that can be used for comparing documents that are too large to compare their raw boolean vectors. (where each dimension corresponds to a term)
  • To do a raw comparison, metrics like Jaccard Similarity, Jaccard distance, Euclidean, Manhattan, L1, L2 norm and many more can be used.
  • Jaccard distance although is very handy is computationally expensive(See formula) to be performed for several large sets.
  • MinHashing is a technique that hashes such large column vectors of documents to signatures with much fewer rows. The technique is pretty cool,
  • https://www.youtube.com/watch?v=96WOGPUgMfw&t=1080s
  • The gist is, it considers a permutation of rows(which are terms in docs) and captures the index(row # in that considered permutation) of 1st word that occurs in the document.
  • In the final hash, Number of rows = number of considered permutations, and it has been proven that determining the probability that the hashes of 2 columns (or docs) is same as Jaccard similarity of the 2 docs.
  • Overall, one can use this technique if he needs to find similarity of severely large boolean vectors by finding the MinHash signatures of the vectors and determining the fraction of them being equal.
TF * IDF:
  • Really cool scoring model for documents.
  • One can use this while building an in-house retrieval system, by ranking each doc based on their Tf*IDF score for each term of the input query.
  • Eg: If am searching for "Places I can visit in India", all the documents in the database can be ranked based on a score that is calculated for each term of the query and returns the docs in the descending order.
  • TF - Term Frequency, IDF - Inverse document Frequency - Intuition is like, if the count of a term is too high in a document (term freq), that does not mean its important for that document, it also matters in how many documents the term appears (document freq).
  • There are other complicated scoring models like the Okapi BM25...
Evaluation techniques:
  • I can evaluate my prediction system, information retrieval system etc... using several metrics.
  • I need to have the overall collection of docs at my disposal, which is relevant, irrelevant for a given query..
  • I can use Precision that captures the fraction of the retrieved documents that is relevant.
  • Recall - that captures the fraction of relevant documents that are retrieved, this is harder, as I need to know all possibly relevant documents for a given query, not just the ones that are retrieved by the system.
  • F-Measure, the measure that is unbiased between precision and recall.
  • I can use a matrix called the confusion matrix, that captures how accurate my model is.
  • One needs to strike a balance between Precision and Recall, as too high Precision might come at the cost of a lower recall and vice versa.
  • Precision, Recall and hence the confusion matrix needs to know the True Positives, True Negatives, FPs and FNs for determining the metrics.

The similarity between vectors:
  • One can use Cosine similarity between vectors - that forms the basis for User-based and Item-based recommendation systems.
and many more..

Takeaways:
  • Some of the good to know widely used techniques to get better clarity as to how one can go about processing the data at his/her disposal.



Wednesday, October 23, 2019

It has been quite difficult to keep up with posts. But it shouldn't matter as I am doing this as a release valve to sort of journal my thoughts.

So here is what I got to explore with respect to the CNN part using Neural Nets.
Having a basic, decent background in Convolutions/ Image processing from Uni, I referred https://www.datacamp.com/community/tutorials/cnn-tensorflow-python.

Tensorflow:
So from recent podcasts and videos I realised Tensorflow has this static way of creating Neural Nets, where I create a computational graph - basically I define a set of nodes in a graph that depicts the operations (convolutions/add/matrix multiply,etc..) for the incoming data, the edges representing the flow of data. (Although I came across Tensorflow 2.0 where there is an eager execution that gives one a lot more control to intercept data flow in the NN). Once I define the network entirely, I make use of the TensorFlow sessions to feed data into the defined the network. It Seems, this approach helps a lot to distribute the training process.

But it was a lot to code stuff from scratch, hence had to refer that blog, where I had ready-made code. I just had to understand stuff, which was decently hard, given the previous knowledge on Image processing and CNNs. But it was hard and confusing to run it directly.

Data Cleaning:

Had to realise this the hard way.
--
I got a chance to explore this technique called One-Class SVM for like a binary classification at my student job. Again, this was for Image anomaly detection. Overall, the task was pretty interesting, explored various approaches that I could have potentially taken. Some of them were:
- Give Image train data to the OneClassSVM API of sklearn, although it was a bit restrictive and not available in all versions.
- Apply PCA to the image data to retain the features/pixel data that explained maximum variance (shall try and cover such handy concepts that revolve around data processing/ pre-processing in the next post)
- Use ResNet to get The Most relevant features from the image and then apply PCA - post which gives the features to One-Class SVM. this was pretty interesting as per the link https://hackernoon.com/one-class-classification-for-images-with-deep-features-be890c43455d
- Lastly, it was the Autoencoder approach, which I have never explored, but I'd want to someday on an application that interests me. (semi-structured data)

But the sad part was, I NEVER really cared about how my data looked. I was in such a hurry to get the insights, I never gave a damn about the data. Unfortunately, it was largely crappy. It was skewed, un-curated, not representative enough, so I was asked to drop the whole task itself :( (time constraints)
--

Anyways, continuing with my exploration,
when I was about to drop the idea of using Tensorflow for this (which was pretty confusing for me - but was worth the experience), a friend asked me to explore Keras. It is too damn readable. 
A highly simplified way to define the network. This is when I realised, I NEED to stick to the larger picture in mind, no point taking a difficult path when either way I reach the same destination.


Built a basic NN using Keras with:
-32, 3 x 3 filters in the first layer (that took 1 channel from the inp image - grayscale)
-64, 3 x 3 filters in the second layer (depth 32 from prev layer)
-128, 3 x 3 filter in the 3rd layer (depth 64 from the previous layer - here, I realised, as u go further in the feed-forward network, the depth of the cube- Conv layer, increases rapidly, leading to HUUUGe number of neurons, in turn weights that is to be computed!)
-Flattened the layer (with prev depth 128!)
-A Fully Connected Dense layer with 256 neurons
- Finally a fully connected layer with the number of neurons that represents the number of classes that I want (in my case, the number of fruits)
- Forgot to mention about adding a MaxPool layer between each Conv layer, that halved the dimension, but retaining only the "strongest feature in the previous Conv layer".
- So, here each filter corresponded to an activation plane, that slides across the image, to "filter" out the "relevant" parts from the image. In other words, relevant neurons in the activation plane of each plane trigger the neurons in the next layer.

A few key observations:.
- Used Relu as activation for each neuron, that strips off negative values and retains max values. (there is also Leaky Relu)
- Use softmax (that gives the probability of each class) as the activation function of the last layer, that can be used to determine how relevant each category is for the Inp Image.
- Use batch gradient descent, that speeds up the convergence.(reaching the bottom of the loss bowl)
- Use Dropouts, that randomly switches off some nodes to increase the relevance of the final result.
- Use Momentum (that considers previous cumulative gradients- velocity+friction of the ball in the bowl, and current gradient - acceleration of the ball in bowl =- Andrew NG example) /Adam optimizer/ RMSProp optimizer.. and many more that can optimize the cost function.
- Like my Prof taught in class, Any model can be ultimately written as : 
hypothesis = regularizer + (regularisation constant) * (loss)
we just need to optimize the objective function such that the loss is minimised through (to find the weights and biases)
 Backpropagation - partial derivatives, Lagrange's Multipliers - KKT that can find the best values for weights across layers that minimizes the cost function.
- Use softmax - cross-entropy loss if using categorical values for the loss function,
- Can use Root mean squared error, mean squared error, L1 norm, L2 norm, Manhattan etc as loss function...or others if using regression ie predicting values but not categories. A lot to explore here.

But unfortunately, my PC is running out of juice, not able to keep the notebooks active. But as of now, I am getting a decent accuracy of around 73%. I'll probably either integrate this with the OpenCV that I had explored in the beginning OR explore something else.


Takeaways:
  • Explore a lot more approaches/topics before sticking to one. As immediately boiling down to a selection can restrict the exploration stage.
  • Its good to stray away from the main goal - Iff you can afford to do so - on the way u may stumble upon many more exciting things.
  • Realised Videos/Podcasts/Tech talks, gives u whole new perspective towards things.
    • Came across some really cool channels on youtube Lex Fridman, Strange Loop - this gives a good perspective towards translating ML stuff to industries. (Netflix one was really cool)
  • The wave of examinations over the past months provoked me to at least document the topics that I came across while studying. Some of the techniques, tools are really handy and seems like it can be appreciated more if I have documented some of them here.
  • As a next step, I'll try running the model against new images of fruits, see if it works, post that, try and see if it works for a live image captured from web camera.
  • Also want to sort of exploring semi-structured analysis for textual stuff, NLP, TF * IDF, lang models etc.