Coding Period (Week 6 + Week 7)
Continuing on my previous blog, which covered the project report for first half of the coding period, in this blog post I’ll cover the work that I did in week 6 & 7 and how there was a sudden change in the plan of action.
I had completed my first evaluations and was ready for a heavy 2nd half ahead 💪.
After successfully implementing the TF-IDF + K-Means model, the results weren’t that good since the distribution of items into clusters was very uneven, and simply re-ranking the items won’t make sense.
So we had decided to use LDA to reduce the dimensions of the dataset along with CountVectorizer, to replace TF-IDF.
At the end of week 5, we were thinking of using 200 topics for LDA and were also confused regarding the library (Gensim vs Sci-Kit Learn) and also had to integrate the CountVectorizer + LDA model with K-Means.
Week 6 was all about testing LDA models (Gensim and Sci-Kit Learn) by changing the hyperparameters and trying to integrate them with the K-Means model.
The Gensim model had the following problems:
- Multi-Thread Processing is often used to freeze my laptop and cmd.
- Repeated words for even lower topics.
- Poor recommendations
So we just stuck to Sci-Kit Learn for LDA and integrated it with K-Means. One more issue was the “number of topics”. Out intuition suggested that we should use around 100 to 200 topics but the model had repetitive values even for 50 –60 topics. So we decided to test it mathematically.
The Log-Likelihood values of topics from 5 to 50 are as follows:
Since 10 topics had the highest score, we decided to move forward with it.
I also made some minor changes in the cleaning module as well.
- Combined the name and description to create a single dictionary.
- Changed the name from “Untitled” to “ ”.
- Removed Unicode characters of the form “\r\n”, “\u1001”, etc.
My majority of the week went into integrating the LDA model with the K-Means model and not only the results were impressive, but also the distribution after K-Means clustering became much better when compared to the TF-IDF + K-Means model.
Note: You can check the code for the diagrams that I have used here.
The next step was to go forward with the 2nd layer of the model which initially was re-ranking based on stars and views, but this still won’t be able to recommend similar circuits because the number of items in a cluster is still a lot.
We needed to incorporate a distance metric so that we can recommend similar circuits even though the cluster might have 40k+ elements. This was a major problem since manually calculating these distances using K-Means had a time complexity of O(N ² * log (N)) which was heavy computationally.
There were also a few more things that we had to take care of :
- Noisy Dumped elements ( in the largest cluster )
- Usage of Tag Names
- Model Storage
- 2nd layer implementation for old and new projects.
- Time taken for recommending
So we decided to replace K-Means clustering with K-D Trees mainly because :
- The time complexity of the nearest neighbor search of K-D Tree is
O(log n)(hence it can be used online).
- The cost for building the tree is
O(N log(N))(faster training).
Using K-D Trees solved problem numbers 1,4 and 5 and this was the major work for weeks 8 and 9 which I’ll cover in the next blog.
The first layer was completed and the results were pretty good but then due to certain problems (mentioned above) we had to remove K-Means completely ( I was super sad since I had spent a lot of time on integrations, but okay) from our model.
I hadn’t really worked with K-D Trees before so I was also kinda nervous too. More details on my next blog covering week 8 and week 9.