GSoC’21 Coding Phase 1 @Circuitverse

Continuing on my previous blog, which covered the community bonding period of my GSoC journey, in this blog, I am going to write about the first phase of the coding period divided into weeks (1 to 5).

As I write this blog, we come to the end of week 5, and it's time for midterm evaluations for GSoC 2021. This was the proposed timeline for my project.

Community Bonding Period

Things I did in the community bonding period:

1. Setting the development environment
2. Bonding with mentors and getting clarity on the project
3. Coding the data cleaning module

The 3rd task was officially a part of my week 1 and 2 work, but I had some spare time during the 20 day community bonding period, and hence I was successfully able to do it. It took me a lot of time to get the SQL query right since it involved multiple attributes from 2 different tables. I can bet on the fact that I learned more SQL in these 3–4 days than in my entire semester at BITS.

SELECT projects.name,view,description, 
array_agg(tags.name) as tag_names from projects
LEFT JOIN taggings ON taggings.project_id = projects.id
LEFT JOIN tags ON tags.id = taggings.tag_id
where projects.project_access_type = 'Public' and
projects.forked_project_id is NULL
GROUP BY projects.id;

I finally was ready with the final SQL query (at least I thought I was), which was improved further by Satvik and Aboobacker, and the raw dataset (in JSON) was obtained and cleaned using Python.

The Coding Period begins!!

Since I had already coded the data cleaning module in the community bonding period itself, I had bought some spare time to make the best model for recommending the projects. My mentor Biswesh and I decided to go with 3 different models based on:

1. Clustering Algorithms
2. Topic Modelling using Latent Dirichlet Allocation (LDA)
3. KD Trees and Locality Hashing

Week 1

The next step was to extract features from the cleaned dataset and this was done using TF-IDF, which is a technique to quantify a word in a document by giving a score based on its term frequency and inverse of the document frequency and thus filtering out the important words. We thought of using TF-IDF individually for both the name and the description of the circuits and then finally adding them based on certain weights (equal as of now).

Once we had an embedding of each project, it was time to build the first layer of the model! We thought of testing the following clustering algorithms

1. Minibatch K means
2. Mean-shift
3. Agglomerative clustering
4. Affinity Propagation
5. Gaussian Mixture Models

Week 2

The steps after the TF-IDF extraction were as follows :

1. Use PCA or some other dimensionality reduction technique for converting it into 2 components in order to see the algorithm visually and get a faint idea of how the clustering is working with the dataset.

2. Run the clustering algorithm on it.

I had worked with all these algorithms while I was drafting a proposal with the sample dataset, so I was super confident with this part.

However, there was a problem that neither my mentor nor I had ever come across. In order to use PCA for dimensionality reduction, we had to convert the sparse matrix into a dense matrix and for the first time I faced this issue

NumPy.core._exceptions.MemoryError: Unable to allocate 42.7 GiB for an array with shape (169370, 33833) and data type float64

The problem was that, since the dataset was so huge (1,80,000 elements ), my poor 8 GB RAM didn’t have enough memory to store such a huge dense matrix (a sparse matrix obtained after TF-IDF takes much lesser space). My mentor and I spent 1–2 days trying to figure out a way to resolve this issue (I also tried to split the sparse matrix into smaller chunks and converting them to dense) but the error didn’t get resolved. We then decided to just stick with Mini Batch K means algorithm as our layer 1 clustering algorithm, since most of the other clustering algorithms didn’t work with sparse matrices directly.

Week 3

I, then implemented the K means algorithm, on the dataset and got pretty good results and also realized that I didn’t have stars or views in the dataset which were required to re-rank the results for layer 2. This meant editing the SQL query to include the number of stars and views which were an attribute of a different table 😔.

SELECT projects.id as id, projects.name,description,
array_agg(tags.name) as tag_names, view, count(stars.id) as star_count from projects
LEFT JOIN taggings ON taggings.project_id = projects.id
LEFT JOIN tags ON tags.id = taggings.tag_id
LEFT JOIN stars ON projects.id = stars.project_id
where projects.project_access_type = 'Public' and
projects.forked_project_id is NULL
GROUP BY projects.id;

I also included the cluster number of each project in the dataset so that I can re-rank the clusters on the basis of stars and views (this is what I had suggested as the final model in my proposal). Hence I was ready with the proposed model (without optimizing the algorithm) at the end of week 3!!

Week 4

This first half of this week went into optimizing the K-means algorithm. There were 4 variables essentially

1. The number of clusters

2. The re-assignment ratio

3. Random State

4. Mini-Batch size

The optimal number of clusters (suggested mathematically using SSE) was around 40, which was too low for 180k items, hence we decided to keep it at 500. The next problem that we faced (currently facing actually) is essentially the 80–20 problem. 80% of items are present in 20% of the clusters and there is a massive need for re-assignment. Hence, we thought of playing around with the hyperparameters (namely re-assignment ratio, the mini-batch size, and the random states) to find some similarities or trends in the items.

Since this model was almost done, we moved onto the second model for Topic Modelling using Latent Dirichlet Allocation(LDA) and I implemented this using Gensim (which had some advantages over other libraries like Scikit Learn) and was getting good recommendations for a small number of topics (30–40), but a lot of repeating values were being shown for higher number of topics (100 or so).

Week 5

The LDA model, implemented using Gensim performed better when we combined the name and description (we are testing this for our K-Means model as well) as a single string (since a lot of descriptions are an empty string), but, this model was still showing a lot of repeated topic names, for topic numbers > 100. So we decided to implement the LDA model using the Scikit learn library and that’s still a work in progress (I am facing some issues, will bother my mentor once I am done writing this blog).

I also coded the testing module which would be crucial in order to compare the results of various models. It basically selects some random projects and gives the URL of the top 10 recommendations of the particular cluster.

Just when I thought things were going smoothly, I caught a really bad viral fever (thankfully it's not COVID) and wasn’t able to work properly for the past 3–4 days but I have now almost recovered and will resume my work from tomorrow.

Things that are left for the 2nd half

1. Coding up the LDA model and if possible, the KD Trees model (depending on what the priority is and how are the current results)

2. Planning the 2nd layer of the model (currently its just a simple re-ranking based on the number of views and stars)

3. Creating a model pipeline.

4. Making a UI and integrating the system with the website (if time permits).

Conclusion

So, this was a short summary of a journey full of learnings, unexpected problems, and looking things up on stack overflow 😂. Really excited for the second half of the coding period! Wish me luck for my evaluations.

The next post will be the final article of this amazing journey covering the next half of the coding period and I will probably write another blog on the learnings that I learned from this experience, so follow for more such posts!

CS @ BITS, GSoC'21 @ CircuitVerse, ML enthusiast,Youtuber