COVID-19, Post 5

COVID-19 Literature Clustering

Interactive Bokeh Plot

Goal

Given the large number of literature and the rapid spread of COVID-19, it is difficult for health professionals to keep up with new information on the virus. Can clustering similar research articles together simplify the search for related publications? How can the content of the clusters be qualified?

By using clustering for labelling in combination with dimensionality reduction for visualization, the collection of literature can be represented by a scatter plot. On this plot, publications of highly similar topic will share a label and will be plotted near each other. In order, to find meaning in the clusters, topic modelling will be performed to find the keywords of each cluster.

By using Bokeh, the plot will be interactive. Users will have the option of seeing the plot as a whole or filtering the data by cluster. If a narrower scope is required, the plot will also have a search function which will limit the output to only papers containing the search term. Hovering over points on the plot will give basic information like title, author, journal, and abstract. Clicking on a point will bring up a menu with a URL that can be used to access the full publication.

This is a difficult time in which health care workers, sanitation staff, and many other essential personnel are out there keeping the world afloat. Sitting at home has given us time to try to help in our own way. We hope that our work will have some impact in the fight against COVID-19. It should be noted, however, that we are not epidemiologists, and it is not our place to gauge the importance of these papers. This tool was created to help make it easier for trained professionals to sift through many, many publications related to the virus, and find their own determinations.

Approach

  • Use Natural Language Processing (NLP) to parse the text from the body of each document
  • Turn each document instance d_i into a feature vector X_i using Term Frequency-inverse Document Frequency (TF-IDF)
  • Apply Dimensionality Reduction to each feature vector X_i using t-Distributed Stochastic Neighbor Embedding (t-SNE) to cluster similar research articles in the two dimensional plane X embedding Y_1
  • Use Principal Coponent Analysis (PCA) to project down the dimensions of X to a number of dimensions that will keep 0.95 variance while removing noise and outliers in embedding Y_2
  • Apply k-means clustering on Y_2 where k is 20, to label each cluster on Y_1
  • Apply Topic Modeling on X using Latent Dirichlet Allocation (LDA) to discover keywords from each cluster
  • Investigate the clusters visually on the plot, zooming in to specific articles as needed, and via classification using Stochastic Gradient Descent (SGD)

Conclusion

Overall, the goal is to cluster the published literature on COVID-19 and reduce the dimensionality of the dataset for visualization purposes. I created an interactive scatterplot of the papers in which material of similar themes are grouped together. This grouping enables professionals to quickly find material related to a central topic instead of having to manually search for each related work. The clustering of data was done through k-means clustering on a pre-processed, vectorized version of the literature’s body text. The k-means basically splits the data into clusters, while LDA handled topic modeling to identify keywords. This gives us the topics that are prevalent in each of the clusters. Both the clusters and keywords are found through unsupervised learning models and this can be useful in revealing patterns that humans may not have even considered.

Through this process, there was no need for organizing papers manually as this was all done due to latent connections inherent in the data.

K-means (represented by colors) and t-SNE (represented by points) were able to independently find clusters, showing that relationships between papers can be identified and measured. This means that papers written on similar topics are typically near one another on the plot and bear the same k-means label. However, due to the complexity of the dataset, k-means and t-SNE will sometimes arrive at different decisions. The topics of much of the given literature are continuous and will not have a concrete decision boundary. This means that k-means and t-SNE can find different similarities by which to group the papers.

Since this is unsupervised learning, I have to remind readers that this is not an exact science. I had to examine the plot to assert that clusters were actually being formed. Once I convinced myself that this was happening, I examined the titles and abstracts of some of the papers in different clusters. For the most part, similar research areas were clustered together. The last method I used for evaluation was classification. By training a classification model with the k-means labels and then testing it on a separate subset of the data, I could see that the clustering was not completely arbitrary and has performed well.

My manual inspection of the documents is limited, however, as I have not gone deep into assessing the meaning of the literature.


Code is found here