The purpose of this analysis is to facilitate access to scientific information. Specialists in epidemiology can locate references and relationships on important topics to their work with the information extracted from thousands of articles. The data analysis presented below is the result of the application with LDA (Latent Dirichlet Allocation) clustering method, a learning algorithm that is not overlooked.
The data used from kaggle.com has more than 40 thousand articles available about COVID-19. We only used those that have complete texts, which are around 29 thousand. We considered the topics as a probability distribution over words that occur in each article. Each article is a combination of one or more topics.
The algorithm created 20 groups that characterize the articles. The first graphic is a 2D dimension reduction in order to visualize the documents (the color indicates the topic of the document). The second graph is interactive and each bubble is a topic. By hovering the mouse over each bubble (topic), we can see the most significant words in it. You can interact with the graph by selecting topics on the left and adjusting the relevance metric on the right.
Document Similarity by Topic:
Each document is related to a degree of similarity for each topic. The figure shows the cluster of topics and the similarity they share between them.
Another way to observe this similarity between topics is using an intertropical distance map. It presents intersections between topics. This is because some topics consider some words important.