Shane Khalid

COVID-19, Post 5

COVID-19 Literature Clustering Interactive Bokeh Plot Goal Given the large number of literature and the rapid spread of COVID-19, it is difficult for health professionals to keep up with new information on the virus. Can clustering similar research articles together simplify the search for related publications? How can the content of the clusters be qualified? By using clustering for labelling in combination with dimensionality reduction for visualization, the collection of literature can be represented by a scatter plot. On this plot, publications of highly similar topic will share a label and will be plotted near each other. In order, to find meaning in the clusters, topic modelling will be performed to find the keywords of each cluster. By using Bokeh, the plot will be interactive. Users will have the option of seeing the plot as a whole or filtering the data by cluster. If a narrower scope is required, the plot will also have a search function which will limit the output to only papers containing the search term. Hovering over points on the plot will give basic information like title, author, journal, and abstract. Clicking on a point will bring up a menu with a URL that can be used to access the full publication. This is a difficult time in which health care workers, sanitation staff, and many other essential personnel are out there keeping the world afloat. Sitting at home has given us time to try to help in our own way. We hope that our work will have some impact in the fight against COVID-19. It should be noted, however, that we are not epidemiologists, and it is not our place to gauge the importance of these papers. This tool was created to help make it easier for trained professionals to sift through many, many publications related to the virus, and find their own determinations. Approach Use Natural Language Processing (NLP) to parse the text from the body of each document Turn each document instance d_i into a feature vector X_i using Term Frequency-inverse Document Frequency (TF-IDF) Apply Dimensionality Reduction to each feature vector X_i using t-Distributed Stochastic Neighbor Embedding (t-SNE) to cluster similar research articles in the two dimensional plane X embedding Y_1 Use Principal Coponent Analysis (PCA) to project down the dimensions of X to a number of dimensions that will keep 0.95 variance while removing noise and outliers in embedding Y_2 Apply k-means clustering on Y_2 where k is 20, to label each cluster on Y_1 Apply Topic Modeling on X using Latent Dirichlet Allocation (LDA) to discover keywords from each cluster Investigate the clusters visually on the plot, zooming in to specific articles as needed, and via classification using Stochastic Gradient Descent (SGD) Conclusion Overall, the goal is to cluster the published literature on COVID-19 and reduce the dimensionality of the dataset for visualization purposes. I created an interactive scatterplot of the papers in which material of similar themes are grouped together. This grouping enables professionals to quickly find material related to a central topic instead of having to manually search for each related work. The clustering of data was done through k-means clustering on a pre-processed, vectorized version of the literature’s body text. The k-means basically splits the data into clusters, while LDA handled topic modeling to identify keywords. This gives us the topics that are prevalent in each of the clusters. Both the clusters and keywords are found through unsupervised learning models and this can be useful in revealing patterns that humans may not have even considered. Through this process, there was no need for organizing papers manually as this was all done due to latent connections inherent in the data. K-means (represented by colors) and t-SNE (represented by points) were able to independently find clusters, showing that relationships between papers can be identified and measured. This means that papers written on similar topics are typically near one another on the plot and bear the same k-means label. However, due to the complexity of the dataset, k-means and t-SNE will sometimes arrive at different decisions. The topics of much of the given literature are continuous and will not have a concrete decision boundary. This means that k-means and t-SNE can find different similarities by which to group the papers. Since this is unsupervised learning, I have to remind readers that this is not an exact science. I had to examine the plot to assert that clusters were actually being formed. Once I convinced myself that this was happening, I examined the titles and abstracts of some of the papers in different clusters. For the most part, similar research areas were clustered together. The last method I used for evaluation was classification. By training a classification model with the k-means labels and then testing it on a separate subset of the data, I could see that the clustering was not completely arbitrary and has performed well. My manual inspection of the documents is limited, however, as I have not gone deep into assessing the meaning of the literature. Code is found here …

15 Apr 2020 • on research, NLP, t-SNE, PCA, machine learning
COVID-19, Post 4

Update as of 21:04 Fixed the SEIR Model …

12 Apr 2020 • on research, inflection point, SEIR, machine learning
COVID-19, Post 3, SEIR model

SIR Model The SIR Model is the most famous epidemiologic model. It considers the population that belongs to each of the following states: Susceptible (S) - the individual has not contracted the disease, but the individual can be infected due to transmission from infected people. Infected (I) - This person has contracted the disease. Recovered/Deceased (R) - This person either survived, hence recovered, or has died, hence deceased. β is the parameter for transmission rate, and γ is the recovery rate. On the right, you have ordinary differential equations that summarize this model in a short-term deterministic model. This ignores vital dynamics (ie. new births, for a growing population) so it is a closed population. The graph should make sense. Overtime, the number Susceptible (S) decreases, the number Recovered/Deceased (R) increases, and the number Infected (I) increases to a peak then drops back down. I solve the differential equations using a 4th order Runge-Kutta method. I decided to change my model to the superior SEIR Model: SEIR Model This has an “Exposed” compartment which is ideal for infections that have a significant incubation period during which individuals have been infected but are not yet infectious themselves. α is the parameter for the average incubation, and μ is the death rate. This is an excellent SEIR-model calculator for COVID So to re-iterate: S ==> Susceptible : number of susceptible E ==> Expose : number of exposed I ==> Infectious : number of infectious R ==> Recovered or Removed : number recovered (or immune) individuals. We have S + E + I + R = N, this is only constant because of the (degenerate) assumption that birth and death rates are equal, N is country population. There’s also a few other variables: R_0 & R_t ==> Reproduction number: The definition describes the state where no other individuals are infected or immunized (naturally or through vaccination) T_inf ==> Average duration of the infection, 1/T_inf can be treat as individual experiences one recovery in D units of time. T_inc ==> Average incubation period, Many paper and article define as 5.1 (reference, reference2) The R_0 can decrease from intervention measures such as government isolation/lock-down, vaccinations, etc. Here’s an example of the SEIR model on Hubei without intervention: Now, with intervention: A substantial difference as you can see. In models with intervention, we can reduct R_t by using differential decay functions, like the Hill Function. The Hill Function represents a half-decay that never reaches zero due to its exponential nature. I find that the average incubation period (T_inc) is 5.2, and the average infectious period (T_inf) is 2.9. As for R_t, I find this number by fitting the real data to the SEIR model’s curve. Same goes for cfr, or Case Fatality Rate. Given that the Hill Function involves the R_0 to half each time, that is what I set as the function for the total period of time that intervention is enforced. I think 80 days is a good average. So, once intervention begins, R0 = R0 * 0.5. TL;DW (too long, didn’t write especially considering the hour) so I’m just going to link my code here it’s relatively self-explanatory. Let’s just throw some relevant forecasting curves up first: United States New York Global My SEIR Model Code My SEIR Model Code …

12 Apr 2020 • on research, SIR model, SEIR model, machine learning
COVID-19, Post 2

Background: sigmoid “In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 45,000 scholarly articles, including over 33,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.” The data is available here: Update on Sigmoid Fit Model Updated today with data from 4/10/2020 We see that the model predicted 506.8708k total confirmed cases for the US for today (4/11/2020). As of 20:56 4/11/2020, the US has 532,879. That is around where my model predicts the value to be between the 12th and 13th of this month. RNN Model Basically it’s a neural network that has a feed-forward loop added that connects a hidden layer to itself existing as a hidden state. Updated today with data from 4/10/2020 An RNN works with sequence data to make predictions using sequential memory. The example they use in the link above involves asking a simple question like “What time is it?” First the word “What” enters the RNN to produce some output “01” but not before producing a “hidden state” The next input “time” produces an output “02” but it goes through a hidden state based on the one produced from the previous input combined with the one for this input. This continues sequentially all the way down to “?”, each word (input) incorporating the previous hidden state (which is a combination of the hidden states that came before it) to produce a final output. This final output is passed into the feed-forward layer to classify an intent. Here’s a closer look at the final hidden state of the example RNN: As each successive step is processed, the RNN has trouble retaining information from the previous steps. In the above image, you can see the first two hidden states (from “What” and “time”, for instance) are cut into small slivers by the time it reaches the final step. This is known as the Vanishing Gradient problem. Recall, that to train a neural network, there are three steps. A forward pass is done and a prediction is made This prediction is compared to the ground truth using a loss function. The loss function outputs an error value which is an estimate of how poorly the neural network is performing. Last, this error value is used to do back propagation, which calculates the gradients for each node in the neural network. The ‘gradient’ is the value that’s used to adjust the neural network’s internal weights, enabling the network to learn. The bigger the gradient, the bigger the adjustments, and vice versa. Now, when we are dealing with multiple layers and back propagation, each node in a layer calculates its gradient with respect to the effects of gradients in the layer before it. So if the gradient in a layer is small, the gradient in the successive layer will be even smaller. This means gradients exponentially shrink as it back propagates down. Earlier layers fail to do any learning as the internal weights are barely being adjusted due to such small gradients. This is the Vanishing Gradient problem. In general, RNN’s have bad short term memory. To combat this, the Long Short-Term Memory (LSTM) is made. Or a Gated Recurrent Unit (GRU). These basically are RNN’s that can learn long-term dependencies using mechanisms called ‘gates’. These gates are just different tensor operations that can learn what information to add/remove to the hidden state. I am not using an LSTM or GRU right now because an RNN trains faster and uses less computational resources. test …

11 Apr 2020 • on research, RNN model, machine learning, recurrent neural network
COVID-19, Post 1

Forecasting COVID-19 using Logistic Regression: Sigmoid-Fitting …

3 Apr 2020 • on research, sigmoid curve fitting, machine learning, eda
Cosmopolish Update

Project $550 Million Dollar Thermometer: …

19 Dec 2019 • on research, astropy, cosmology, kepler
Cosmopolish

Project KillBlue: …

28 Jun 2019 • on research, astropy, cosmology, kepler
CV

Here it is. …

21 Mar 2019 • on webdev, cv, css
Airplane Project with R; Some info about my build

Airline Project …

19 Mar 2019 • on software, setting up, R, buildguide
Bioinformatics pipeline for Linux

A to Z. …

19 Mar 2019 • on software, setting up, Linux