The collected newspaper headlines and tweets were analyzed with the text-mining framework provided in the tm package R. Through the tm package, the collected headlines and tweets (the “corpus”) were transformed into a document-term-matrix (DTM), i.e. a matrix in which a row is assigned to each specific document (each tweet) in the corpus and a column is assigned to each of the terms in the corpus. The construction of this matrix and its subsequent study through the functions implemented in the tm package were at the basis of the analyses discussed below.
Word cloud. The most frequent words were identified by counting the number of occurrences of each word in the collected tweets and were subsequently plotted as a word cloud with the wordcloud package of R. Word clouds provide a visual representation of the most frequent words found in a text document, by showing words with font sizes proportional to their frequency in the document. Words are usually also color-coded, with different colors associated to different frequencies.
How can I use this analysis? You can use this analysis to have an immediate visual impact of the words that dominated the newspaper stories or the Twitter conversation – example.
Network of frequent words. The associations between pairs of frequent words were identified by calculating the percentage of collected newspaper headlines or tweets in which the pair of words appeared together. From the pairwise matrix of word associations, a network graph was subsequently derived and plotted with the igraph package of R. Strong associations between certain words mean that these words were often found together in the same headline or tweet, while weak or no associations mean the opposite. The graph shows a set of nodes, each representing a word, connected by edges that represent the association between the words. The thickness of one edge is proportional to the strength of the association between two words. The most connected words are shown in green, while the least connected words are shown in red. A range of colors spanning between the two extremes is used for words with an intermediate degree of connectivity.
How can I use this analysis? You can use this analysis to visually identify the multiple parallel events that gave rise to a peak of newspaper stories or tweeting activity – example.
Topic modeling. The collected newspaper headlines and tweets were automatically categorized into different topics through the Latent Dirichlet Allocation method, as implemented in the “LDA” function of the topicmodels package of R. The topics were subsequently plotted as a bar graph, with the height of each bar proportional to the percentage of headlines or tweets fitting that topic. The most frequent words associated with each of the topics are also shown.
How can I use this analysis? You can use this analysis as another means to identify the multiple, parallel events that gave rise to a peak of newspaper stories or tweeting activity and find the most frequent words associated with each of them – example.
Example. As an example, let’s take a look at the text analysis graphics on the tweets on chemical weapons collected on January 23, 2018. The word cloud analysis provides a synoptic view of the most frequent words in the tweets. Beyond the words chemical weapons, the most prominent words appear to be Syria, Russia, Tillerson, and attack. Many other words can be seen in the word cloud. The network of frequent words graph clearly shows that three distinct conversations went on in parallel that day. Let’s now compare the three clusters visible in the network of frequent words graph with the topics automaticallyidentified by the LDA algorithm: the first and third topics seem to refer to the first cluster, the fourth topic seems to refer to the third cluster, the second topic seems to refer to both the first and second clusters. The most prominent words in each of the three clusters as well as the words associated with the identified topics can be subjected to a Twitter and a Google search, confining the searches to January 23, 2018. As shown below, the results shed light onto the three stories that dominated the Twitter discourse on that day.
January 23-24, 2018
- An alleged attack with chlorine gas in Eastern Ghouta, Syria is reported. Keywords: chemical, Syria, Russia, attacks, new. Twitter Google
- The International Partnership Against Impunity for the Use of Chemical Weapons is launched on January 23 in Paris. Keywords: chemical, weapons, use. Twitter Google
- A truck carrying over 30,000 liters of a chemical that can be used to make explosives is stolen in Belgium. Keywords: chemical, Belgium, stolen, bomb. Twitter Google