Artificial intelligence allows computers to automatically simulate how human beings attribute characteristics, or concepts, to words. This allows us to explore large amounts of data, performing analyzes such as: relationships between words, discovering analogies, observing stereotypes over time, etc.
The results of these analyzes are highly dependent on the source of information. Exploration allows to observe information in thousands of texts with minimal intervention by the specialist. This analysis consists on representing the words in vectors of multiple dimensions (50, 100 and 300) in order to show the characteristics that define the concept of each word. This methodology is applied in two types of applications:
Exploration of similarity, relations and stereotypes of words
AI model training for Natural Language Processing applications.
Word2vec trained models as well as vector and metadata are available on Embedding Projector plataform in order to support the academic community. The models are available in binary files to be used with the gensim library.
Analysis of word representation in documents related to COVID-19, SARS-CoV-2
Let’s explore some word representation results and relate to the results obtained!
The data collected: From the set of documents by Kaggle and Elsevier. Over 50,000 related scientific articles on COVID-19, SARS-CoV-2, and other key words related to coronavirus have been processed.
Methodology: After processing by automatic learning techniques (Wor2vec – BoW), each word is represented by vectors of multiple dimensions. For visualization, they were reduced to a 2D space using the T-SNE technique. The results are shown in the figure below. Each point on the graph represents a word.
In this figure we can visualize different groups created by similarity and them can be explored. For example, the figure below shows different clusters seen in the image presented above.
The Embedding Projector as defined: “… is a tool to provide the necessary measurements and visualizations during the machine learning workflow. It allows you to track metrics of experiments such as loss and accuracy, view the model graph, design incorporations in a lower dimensional space and much more”. We can continue analyzing other details of this data with this tool. Some interface features: