So, I thought I'll dedicate a post for the stuff that I get to explore here at the University. Also, include the topic/tasks that I am exploring at work - to have an idea about how the topics get translated to real-world use cases.
SVD - Singular Value Decomposition -
SVD - Singular Value Decomposition -
- From what I have understood, this lets me decompose the rectangular matrix into 3 parts - left and right orthogonal matrices and a diagonal matrix.
- A use case for this was, in the Information Retrieval area, where I had like a sparse matrix - terms Vs document ..so like each document was represented by a vector of 1s and 0s, where a 1 implied the term is present in the document, 0 otherwise.
- I can give this matrix to the SVD function, that spits out 3 very VERY useful matrices-
- terms Vs topics
- topics Vs topics
- topics Vs documents
- So the topics are like categories to which the text belongs to, (Sports, Food etc),
- By mapping each of those terms to a set of categories and in turn using those categories to decide like the genre of the document is pretty cool.
- This can help me in retrieving documents better for a given query - by considering the context. There are ready-made functions for performing SVD, just have to str the data properly to feed it in.
- This is pretty cool especially if one is building some in-house information retrieval system for an organisation.
Map Reduce paradigm -
- Framework for distributed data processing.
- Damn cool technique iff, I can frame my problem as an MR task.
- In simple terms, there is a Mapper - (Sort and Shuffle) - and Reducer nodes that do-
- Mapper - Emits out Data tuples grouped by a key.
- Reducer - Gets the Key grouped list of data tuples and performs some kind of aggregation. (count, sum, max, avg...)
- One use case that I can think of is, lets say if I have several 1000s of docs spread across nodes in a network and I want to bring all the docs of each class in separate nodes, count # of docs in each class, perform some kind of textual analysis (relevance/priority), the Map task can like emit (category_id, <docId, content> ) and reducer will receive a list of (category_id, [<docId, content>,<docId, content>,<>....]) that can then aggregate all the tuples in the above list that belongs to a given category.
- Internally the framework supports the part where all similar docs are directed to 1 reducer only and also sorted. Uses hashing principle, something like, a doc that belongs to a category id '54' may be mapped to a reducer number given by 54 % #reducers (Ensuring all docs of category 54 will reside in the same reducer.
- There are several such intricate details in the MR framework and of course, Hadoop was the main framework that supported this I just need to write codes for the Mapper and Reducer task and rest everything is taken care. This can be used if 1 is searching for distributed processing (Of Course there are others like Apache Spark)
Signal Processing from IMU:
- If one is interested in making sense of the motion that is captured by devices like an IMU (that is built-in on phones), one can make use of this app called Science Journal by Google.
- That is a brilliant app that lets you capture acceleration (any axis), Ambient light in the room, sound, music and many more. It leverages on the IMU unit on the phone and lets u record stuff at will.
- The best part, you can download the captured information as a CSV!, and now it's just any other data.
- One can play with this data in several ways -
- Apply a version of clustering technique (K -Means, Hierarchical, DBSCAN...) to the data to "discover" similar segments in the signal.
- With the knowledge of the type of motion, use this data to capture similar motions in the future, like a familiar gesture or something.
- Detect the amount of light in the room and act accordingly.
- Detect the level of the sound signal in the room and act accordingly.
- One can choose to integrate all this to like an app that he/she is developing.
- ML is just 1 step in the bigger project that can be developed.
- Or 1 can go all out, and combine all types of signals and determine activity.
- I can use the evaluation technique like constructing a confusion matrix (that captures False Positives, False Negatives and in turn Precision and Recall)
PCA:
- Very similar to SVD, this technique is done for Square matrices that capture data records Vs Features.
- If there are like 1000s of features in the given data and I do not know what to consider to perform analysis, predictions - I can rely on this technique Principal Component Analysis.
- That allows me to strip off some columns that are not that representative of my data in-comparison to other stronger ones.
- I need to construct a matrix called the correlation matrix that captures how 2 columns/features are correlated (as in, if an increase in 1 leads to an increase in another)
- Now, I need to perform Eigen Decomposition on this Correlation matrix, that spits out 2 parts-
- Eigenvalues and Eigenvectors (these are called the principal components)
- This works based on the rule, AV = ΛV; A- co-occurrence matrix, Λ - Diagonal Matrix of eigenvalues; V: Matrix of Eigenvectors
- So the vectors are arranged such that, the starting vectors (that are associated with largest eigenvalues) has the largest variance, i.e explains the data to a large extent in-relative to the ones that come later.
- So we can restrict by stripping of vectors/components that are at the later stages, that is not representative of the data and retaining top-k vectors combined with their eigenvalues as our new data that has been reduced to a lower dimension.
- This technique has been used in several areas including Genetics (to find the relevant gene), Eigenfaces - to generalise a whole collection of Human faces to capture unseen faces.
- Can be used in several areas (including Images for learning their features- pixels.. )
MinHashing:
- A technique that can be used for comparing documents that are too large to compare their raw boolean vectors. (where each dimension corresponds to a term)
- To do a raw comparison, metrics like Jaccard Similarity, Jaccard distance, Euclidean, Manhattan, L1, L2 norm and many more can be used.
- Jaccard distance although is very handy is computationally expensive(See formula) to be performed for several large sets.
- MinHashing is a technique that hashes such large column vectors of documents to signatures with much fewer rows. The technique is pretty cool,
- https://www.youtube.com/watch?v=96WOGPUgMfw&t=1080s
- The gist is, it considers a permutation of rows(which are terms in docs) and captures the index(row # in that considered permutation) of 1st word that occurs in the document.
- In the final hash, Number of rows = number of considered permutations, and it has been proven that determining the probability that the hashes of 2 columns (or docs) is same as Jaccard similarity of the 2 docs.
- Overall, one can use this technique if he needs to find similarity of severely large boolean vectors by finding the MinHash signatures of the vectors and determining the fraction of them being equal.
TF * IDF:
- Really cool scoring model for documents.
- One can use this while building an in-house retrieval system, by ranking each doc based on their Tf*IDF score for each term of the input query.
- Eg: If am searching for "Places I can visit in India", all the documents in the database can be ranked based on a score that is calculated for each term of the query and returns the docs in the descending order.
- TF - Term Frequency, IDF - Inverse document Frequency - Intuition is like, if the count of a term is too high in a document (term freq), that does not mean its important for that document, it also matters in how many documents the term appears (document freq).
- There are other complicated scoring models like the Okapi BM25...
Evaluation techniques:
- I can evaluate my prediction system, information retrieval system etc... using several metrics.
- I need to have the overall collection of docs at my disposal, which is relevant, irrelevant for a given query..
- I can use Precision that captures the fraction of the retrieved documents that is relevant.
- Recall - that captures the fraction of relevant documents that are retrieved, this is harder, as I need to know all possibly relevant documents for a given query, not just the ones that are retrieved by the system.
- F-Measure, the measure that is unbiased between precision and recall.
- I can use a matrix called the confusion matrix, that captures how accurate my model is.
- One needs to strike a balance between Precision and Recall, as too high Precision might come at the cost of a lower recall and vice versa.
- Precision, Recall and hence the confusion matrix needs to know the True Positives, True Negatives, FPs and FNs for determining the metrics.
The similarity between vectors:
- One can use Cosine similarity between vectors - that forms the basis for User-based and Item-based recommendation systems.
and many more..
Takeaways:
- Some of the good to know widely used techniques to get better clarity as to how one can go about processing the data at his/her disposal.