Analysing texts 2

## [1] "Loading session data. If this is the first time you are working on this session, this might take a while. Please, do not disconnect from the Internet."

We have seen in the previous session that words are linked together. But how are these words linked together? Wouldn’t it be nice if we could compare not just words but whole topics across documents in a collection? This is what the advanced technique topic modelling does. Topic Modelling is a popular technique in social and cultural analytics that summarises a collection of texts into a predefined number of topics. Have a look at http://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/.

Topic modelling is also popular, as it requires only minimal text organisation. Computers can learn topics by themselves. There are, however, known limitations of topic models with regard to the interpretation they help with. There is no guarantee that the automatically derived topics will correspond to what people would consider to be interesting topics/themes. They may be too specific or general, identical to other topics or they may be framings of larger topics, as opposed to genuinely distinct topics. Finally (and in common with other computational analysis techniques), the performance of topic modelling depends upon the quality and structure of the data. In our case the main issue will be that we only have 2 documents, which is not a lot of data. But as topic modelling is computationally quite expensive we should not overdo things here and just concentrate on this small corpus. Please, load the topic modelling library with library(topicmodels).

library(topicmodels)

For topic modelling, we have to define the total number of topics in advance. Please, enter n<-10 to set 10 topics.

n<-10

The actual topic modelling function is called LDA (Latent Dirichlet Allocation). It is really quite complicated but well explained in the reference above. It tries to define our n topics based on the most important words in each of them. A more detailed explanation can be found at http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/. Now run LDA with topic_model <- LDA(dtm, n, alpha=0.1, method=‘Gibbs’). This is advanced text analysis and we can ignore the arguments again. It should take a few minutes.

topic_model <- LDA(dtm, n, alpha=0.1, method='Gibbs')

We are now interested in the words/terms per topic and get those with terms(topic_model, 10). The parameter 10 indicates that we display 10 terms per topics.

terms(topic_model, 10)

##       Topic 1         Topic 2      Topic 3    Topic 4    Topic 5
##  [1,] "concentration" "jews"       "children" "vienna"   "jewish"
##  [2,] "now"           "jewish"     "years"    "dachau"   "november"
##  [3,] "buchenwald"    "german"     "wife"     "arrested" "synagogue"
##  [4,] "sachsenhausen" "germany"    "two"      "reporter" "also"
##  [5,] "camps"         "emigration" "old"      "jews"     "arrested"
##  [6,] "pol"           "will"       "taken"    "report"   "jews"
##  [7,] "death"         "can"        "died"     "herr"     "taken"
##  [8,] "june"          "gestapo"    "herr"     "jewish"   "fire"
##  [9,] "life"          "one"        "father"   "police"   "police"
## [10,] "camp"          "etc"        "november" "november" "berlin"
##       Topic 6  Topic 7    Topic 8   Topic 9     Topic 10
##  [1,] "men"    "will"     "one"     "camp"      "people"
##  [2,] "one"    "now"      "also"    "prisoners" "house"
##  [3,] "man"    "much"     "people"  "work"      "homes"
##  [4,] "oclock" "dear"     "time"    "one"       "destroyed"
##  [5,] "two"    "received" "however" "day"       "home"
##  [6,] "hours"  "holland"  "days"    "barracks"  "street"
##  [7,] "taken"  "letter"   "first"   "people"    "everything"
##  [8,] "people" "can"      "later"   "number"    "women"
##  [9,] "jews"   "know"     "course"  "prisoner"  "smashed"
## [10,] "three"  "tante"    "end"     "appell"    "men"

Anything interesting to see here? The next step would be to label the topics in order to interpret them. Please, take a moment to do just that. We can also see what the topics per document are with topics(topic_model, 10). 10 is the maximum number of topics returned.

topics(topic_model, 10)

##        1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
##  [1,]  5  3  3  5 10 10  3  6  3 10 10  3  3  2  3 10  3  3 10  2  3  3  6
##  [2,]  2  5 10 10  4  8  6  9  5  3  6 10 10  7  5  5 10  1  3  3  8  6  8
##  [3,]  8 10  6  3  2  5  5  8  4  5  8  5  8  3 10  1  2 10  5  1 10  5  2
##       24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
##  [1,]  4  6  8  5  4  2  6 10 10  2  5  1  5  7  3  5  5  5  5  5  7  1  4
##  [2,]  5  7  5 10  3  7 10  6  3  3  2  4  3  5  1  2  2  9 10 10  2  4 10
##  [3,]  8  8  2  3 10  8  5  5  5  8  7  5  6  2  2  4  6  6  2  3  8  5  2
##       47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
##  [1,] 10 10  3  6  6 10  6  9  6  2  3 10  6  6  8  5  6  5 10  5 10  6  2
##  [2,]  5  5  5  8 10  5  9  2  3  5 10  3  9  9  5  2  8  2  5 10  3  3  8
##  [3,]  9  4  2  9  3  2  8  7  9  4  5  6  8  8  2  7  9  1  2  1  6  5 10
##       70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
##  [1,]  9  5  8  9 10 10 10  4  7 10 10  8  7  2  3  3 10  3  3  3 10  8 10
##  [2,]  6  4  5  6  6  6  6  5  8  8  8  7  3  7 10  2  2  7  5  5  6  3  5
##  [3,]  4  8 10  5  5  5  3  3  1  6  6  2  6 10  7  7  5  8 10  7  8  2  8
##       93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
##  [1,]  6  4  4  4  2  4  4   3   5  10   5  10  10   7   2   2   3  10   6
##  [2,]  8  5 10  2  5  2  5   6  10   5  10   6   5   8   4  10   4   3   8
##  [3,]  4 10  5 10  4 10  2   8   2   8   2   1   6   6   8   7   6   7   2
##       112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
##  [1,]   4   6   6   6   6   4   5   4   2   7   9   4  10  10  10  10   6
##  [2,]   8   4   4   9   9   5  10   5   5   3   6   5   5   8   6   5   4
##  [3,]   9   8   3   8   8   8   9   2   4   8   2   8   8   2   8   6   8
##       129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
##  [1,]   3   3  10   6   3   6   3   4   1   3   5   4   6   6   3   4   4
##  [2,]   4   4   5   9   5   2   4   3   7   7   2   6   9   9   7   3   6
##  [3,]   8   8   8   8   4   8   6   6   5   1   1   8   8   2   8   5   9
##       146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
##  [1,]   7   6  10   9   2   6   2   3   4   4  10   9   6   9   6   5   2
##  [2,]   2   3   6   6   5   4   8   1   8   6   5   6   9   6   2   4   8
##  [3,]   3   5   8   8   4  10  10   4  10  10   3   8   8   1   8   3   4
##       163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
##  [1,]   6  10   2   9   2   6   4   2   5   6   2   8   9   9   5   3  10
##  [2,]   8   2   9   6   9   8   2   8   4   9   8   6   4   6   3   1   5
##  [3,]   3   8   6   8   1   9   8  10   3   8   9   9   8   4  10   2   8
##       180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196
##  [1,]   6   2   6   6   4  10   3   4   4   3   5   3   3  10  10   6   7
##  [2,]   9   6   9   4   2   6   5   3   1  10  10   5   8   8   3   3   8
##  [3,]   8   8   8   8   5   8   4   1   7   6   2   8   5   6   2  10   9
##       197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213
##  [1,]   7   3   3   7   3   7   3   3   3   3   3   2   3   3   7   3   7
##  [2,]   2   8   7   2   8   3   5   7   8   2   7   7   8   5   3   7   3
##  [3,]   3   7   5   8   4   4   2   8   1   4   4   3   2   7   8   8   2
##       214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230
##  [1,]   4   3   3   3   3   3   3   3   3   3   3   5   9   6   2   1   3
##  [2,]   2   8   7   7   5   8   5   7   8  10   7   6   6   9   8   9   1
##  [3,]   3   5   4   1  10   6  10   8   5   8   8   1   4   8  10   3   5
##       231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247
##  [1,]   6   9   6   8   6   2   7   3   2   3   6   6   9   5   5   6   9
##  [2,]   8   6   9   7   9   4   8   1   8   8   7   8   8   6  10   8   1
##  [3,]   9   8   8  10   8   8   3   2   7   1   3   3   6  10   8  10   6
##       248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264
##  [1,]   2   2   2   5   5   2   2   2   2   2   9  10   6   8   5   5   6
##  [2,]   8   8   1   2  10   4   1   6   5   4   8   5  10   6  10  10   8
##  [3,]  10   5   5   8   1   9   5   8   4   1   6   6   3   7   1   2   3
##       265
##  [1,]   5
##  [2,]   3
##  [3,]  10
##  [ reached getOption("max.print") -- omitted 7 rows ]

Let’s move on to something different. Have you heard about the Google Ngram Viewer, which plots frequencies of words using a yearly count in sources printed between 1500 and 2008 in Google’s book corpora? It has created quite an excitement in the digital methods world. Check out http://firstmonday.org/ojs/index.php/fm/article/view/5567/5535. There are many examples and some strong believers. You can try Google’s Ngram Viewer under https://books.google.com/ngrams.

Would it not be nice to use R to download all this data and then do more advanced work with it. There is a package for that, with which we can avoid complicated API details. Load the package with library(ngramr).

library(ngramr)

Let’s see how the word use over time of hacker vs programmer compares. Run ng <- ngram(c(‘hacker’, ‘programmer’), year_start = 1950). It uses the function ngram to connect to Google and download the data since 1950.

ng  <- ngram(c('hacker', 'programmer'), year_start = 1950)

Check it out with head(ng).

head(ng)

## Phrases: hacker, programmer
## Case-sentitive: TRUE
## Corpuses: eng_2012
## Smoothing: 3
##
##   Year Phrase Frequency    Corpus
## 1 1950 hacker 9.493039e-09 eng_2012
## 2 1951 hacker 1.168549e-08 eng_2012
## 3 1952 hacker 1.078450e-08 eng_2012
## 4 1953 hacker 1.010847e-08 eng_2012
## 5 1954 hacker 9.678727e-09 eng_2012
## 6 1955 hacker 9.315688e-09 eng_2012

Using ggplot, we can produce a nice graph with ng. Try ggplot(ng, aes(x=Year, y=Frequency, colour=Phrase)) + geom_line(). After the next session, we will understand better how, but it should also be self-explanatory.

ggplot(ng, aes(x=Year, y=Frequency, colour=Phrase)) + geom_line()

ngramr also has its own plotting function ggram using the ggplot library. Let us compare the popularity of monarchies and democracies with ggram(c(‘democracy’,‘monarchy’), year_start = 1500, year_end = 2000, corpus = ‘eng_gb_2012’, ignore_case = TRUE, geom = ‘area’, geom_options = list(position = ‘stack’)) + labs(y = ’’).

ggram(c('democracy','monarchy'), year_start = 1500, year_end = 2000, corpus = 'eng_gb_2012', ignore_case = TRUE, geom = 'area', geom_options = list(position = 'stack')) + labs(y = '')

There is a lot going on here but democracy seems to be winning. Can you guess what the parameters are doing? You can also refer to the very good website of the package. Let’s move on to another example. If you have done a little bit of research on the Holocaust, you might know that there are on-going debates on whether to call it Shoah or Holocaust (https://en.wikipedia.org/wiki/Names_of_the_Holocaust). Rewrite the last expression so with shoah instead of democracy and holocaust instead of monarchy.

ggram(c('shoah','holocaust'), year_start = 1500, year_end = 2000, corpus = 'eng_gb_2012', ignore_case = TRUE, geom = 'area', geom_options = list(position = 'stack')) + labs(y = '')

So, despite the efforts to popularise the Hebrew word Shoah, Holocaust is still far more commonly used. Frequency has proven to be a good indicator of the development and importance of ideas. But it is not the only text analysis tool that we can use to derive ideas from documents. Another commonly used text mining tool is information extraction (https://en.wikipedia.org/wiki/Information_extraction). With information extraction we can retrieve ideas and concepts directly from texts. R has various internal tools for information extraction. There is a section on it in your reference Humanities Data in R. We would like to concentrate on the online information extraction service OpenCalais, which is a web service provided by Thomson Reuters. Try it under http://www.opencalais.com/opencalais-demo/. Just copy and paste any text into the online form. Experiencing information extraction is the best way of learning about it.

As we describe in https://www.ehri-project.eu/information-extraction-noisy-texts-historical-research, information extraction is a highly useful research and archival tool. Semantically enriched library and archive federations have recently become an important part of EHRI’s work, as research users often have more demands on semantics than is generally provided by archival metadata. For instance, in archival finding aids place names are often only mentioned in free-form narrative text and not especially indicated in controlled access points for places. Researchers would like to search for these locations. Place name extraction from the descriptions might support this. But this kind of of information extraction is also difficult for historical documents, as places and people change their names or appear in different languages in these documents. This is why EHRI decided to invest into a dedicated NLP/gazetteer API of EHRI to extract information called ehri-locations.ontotext.com. It is based on a highly curated analysis of locations, people, organisations of the Holocaust.

The EHRI service is a web service we need to access it with the httr service. Run library(httr).

library(httr)

We would like to send an example text file from the testimonies to the EHRI service, which is stored on our local machine. We would like receive back the annotated text. I have downloaded an example file from the collection into the data folder of your SWIRL installation. In the httr package you define an input file to be sent/uploaded with input_file <- upload_file(path_to_data(‘0304.txt’)).

input_file <- upload_file(path_to_data('0304.txt'))

In this example we concentrate on extracting events. The EHRI API listens to EHRI_URL <- ‘http://ehri-persons.ontotext.com/extractor/extract’. Define it, please.

EHRI_URL <- 'http://ehri-persons.ontotext.com/extractor/extract'

Now, post your request to EHRI but remember that you have to be online while doing it. Type r <- POST(url=EHRI_URL, authenticate(“ehri”,“B7JhK5”), add_headers(“Content-Type”=“text/plain”, “outputFormat”=“application/json”, “Accept”=“application/vnd.ontotext.ces+json”), body = input_file). httr’s POST function takes the API’s URL, the key as part of the header and the input_file as the body of the request.

r <- POST(url=EHRI_URL, authenticate("ehri","B7JhK5"), add_headers("Content-Type"="text/plain", "outputFormat"="application/json", "Accept"="application/vnd.ontotext.ces+json"), body = input_file)

We receive the content of the request with httr’s content function. Enter result <- httr::content(r, “text”).

result <- httr::content(r, "text")

## No encoding supplied: defaulting to UTF-8.

It is a little bit complicated to get the results out of an httr result list that is returned by content. That’s why I have created a support function for you that should make that easy. With extract_places_EHRI(result) you should see the events that are contained in your submission document. Try it to get back locations from the document.

extract_places_EHRI(result)

##                                        name         lon       lat
## 1                                    Munich   11.566667  48.13333
## 2                             Silver Spring  -77.019004  39.00424
## 3                                        RG  -85.185000  34.26000
## 4                                     USHMM  -77.033000  38.88700
## 5                                     USHMM  -77.033000  38.88700
## 6                                    France    2.000000  47.00000
## 7                                     USHMM  -77.033000  38.88700
## 8                                        RG  -85.185000  34.26000
## 9                                    Vienna   16.373064  48.20833
## 10                                       RG  -85.185000  34.26000
## 11                                   Vienna   16.373064  48.20833
## 12                                   Vienna   16.373064  48.20833
## 13                                   Vienna   16.373064  48.20833
## 14                                       RG  -85.185000  34.26000
## 15  United States Holocaust Memorial Museum  -77.033000  38.88700
## 16                                   Vienna   16.373064  48.20833
## 17                                   Vienna   16.373064  48.20833
## 18                                  America -100.000000  40.00000
## 19                                  Bavaria   11.431111  48.77750
## 20                         Hamburg, Germany   10.000000  53.55000
[...]

Extracting entities can be a useful in the automatic analysis of texts. We can use the entities to understand content better or even provide effective links between different texts. If we know that two documents are about the same place, for instance, it seems logical that there is a link between them based on the place. Because I have worked a lot with computational archives, where we want to link collections and documents, I have done a lot of work in information extraction – especially with heterogeneous historical material. I have also been involved with analysing testimonies and other oral histories. Here, a typical question is the kind of sentiment a memory expresses. Is it a positive or negative memory? If so to what degree? These are the kinds of questions that automated sentiment analysis can answer, which we cover next.

Sentiment analysis aims to evaluate emotions and other highly subjective expressions in a text. Commonly this is applied to finding out whether tweets, movie reviews, etc. express a positive or negative sentiment. In fact, later in the module, when we discuss prediction models, we will look into an example based on movie reviews. Generally speaking, sentiment analysis is done using a so-called supervised approach, where a human annotates a collection with the sentiments of a document and the computer tries to learn about the decisions on sentiments of that human. We will use an unsupervised approach, where the computer learns by itself. Some people argue that sentiment analysis should never be done as unsupervised learning. Gary King, for instance, consistently warns us that this is one of the worst possible mistakes. Check out his website http://gking.harvard.edu/. He should be interesting for us.

Why do we use an unsupervised approach then? We choose this trade-off as training a computer programme requires extensive training collections and lots of manual work, which we cannot do here. Trade-offs like this are typical when working with digital methods. We will talk about them again in the prediction session of the module. Here, we use the dictionary-based sentiment analysis approach developed in https://www.r-bloggers.com/sentiment-analysis-on-donald-trump-using-r-and-tableau/. Check out their work! We slightly improved it using some of linguistic statistics we introduced earlier. But the basic approach is still the same. It uses a lexicon of positive and negative words and then simply counts the number of times positive and negative words appear in a text. If there are more positive words than negative words, the text will get an overall positive sentiment, otherwise a negative one.

I have prepared the corresponding sentiment function for you. Please, call it and assign the results to the docuuments with progrom_tes$sentiment_scores <- sentiment_scores(progrom_tes$Text). Input is the vector of all documents. With the call, we create a new column in dtm_df with the scores. Ignore the warnings!

progrom_tes$sentiment_scores <- sentiment_scores(progrom_tes$Text)

## Warning in weighting(x): unreferenced term(s): a+ abound abounds abundance
## abundant accessable acclaim acclamation accolade accolades accommodative
## accomodative accomplishment accomplishments accurately achievable
## achievement achievible acumen adaptable adaptive adjustable admirably
## admiration admire admirer admiring admiringly adorable adore adored
## adorer adoring adoringly adroit adroitly adulate adulation adulatory
## advantageously advantages adventuresome adventurous advocate advocated
## advocates affability affable affably affectation affection affinity
## affirm affirmation affluence affordable affordably afordable agilely
## agility agreeableness agreeably all-around alluring alluringly altruistic
## altruistically amaze amazed amazes ambitious ambitiously ameliorate
## amenable amenity amiability amiabily amiable amicability amicable amicably
## amity amply amusingly angelic apotheosis appealing applaud appreciable
## appreciated appreciates appreciative appreciatively approve ardent ardently
## ardor articulate aspiration aspirations aspire assurance assurances
## assuredly assuring astonish astound astounded astoundingly astutely
## attentive attraction attractive attractively attune audible audibly
## auspicious autonomous aver avid avidly award awards awe awed awesome
## awesomely awesomeness awestruck awsome [...]

## Warning in weighting(x): unreferenced term(s): 2-faced 2-faces abnormal
## abolish abominable abominably abominate abomination abort aborted
## aborts abrade abrasive abrupt abscond absent-minded absentee absurdly
## absurdness abuses abysmal abysmally abyss accidental accost accursed
## accuse accuses accusing accusingly acerbate acerbic acerbically ache
## ached aches achey aching acrid acridly acridness acrimonious acrimoniously
## acrimony adamant adamantly addict addicted addicting addicts admonish
## admonisher admonishingly admonishment admonition adulterate adulterated
## adulteration adulterier adversarial adversary adverse adversity afflict
## affliction afflictive affront aggravate aggravating aggravation aggression
## aggressiveness aggressor aggrieve aggrieved aggrivation aghast agonize
## agonizing agonizingly aground ail aimless alarmed alarming alienate
## alienated alienation allegation allege allergic allergies allergy ambiguity
## ambivalence ambivalent ambush amiss anarchism anarchist anarchistic
## anarchy anemic angrily angriness animosity annihilate annoy annoyance
## annoyances annoying annoyingly annoys anomalous anomaly antagonism
## antagonist antagonistic antagonize anti- anti-american anti-israeli anti-
## occupation anti-proliferation anti-semites anti-social anti-us anti-
## white antipathy antiquated antithetical anxieties anxiousness apathetic
## apathetically apocalypse apocalyptic apologist apologists appal appall
## apprehension apprehensions apprehensive apprehensively arcane archaic
## arduous arduously argumentative arrogance arrogant arrogantly asinine
## asininely asinininity askance asperse aspersion aspersions assail assassin
## assassinate assault assult astray asunder atrophy audacious audaciously
## audaciousness audiciously austere authoritarian autocrat autocratic
## avalanche avarice avaricious avariciously avenge averse aversion aweful
## awfulness awkward awkwardness ax [...]

Let’s plot a histogram of the scores’ distribution with hist(progrom_tes$sentiment_scores, main = ‘Sentiment Scores’, border = ‘black’, col = ‘skyblue’, xlab=’’). What do you see? Why do you think a list of dictionary words might be dependent on language usage? Does the list of positive and negative words therefore change over time? What does this mean for our historical comparison?

hist(progrom_tes$sentiment_scores,main='Sentiment Scores', border='black', col='skyblue', xlab='')

These kinds of esults would be better if we had a larger dataset and better language tools for a historical comparison. Online you can find a lot of sentiment analysis tools. Have you, for instance, heard about IBM’s Watson computer? In 2011, Watson competed on Jeopardy! against former winners Brad Rutter and Ken Jennings. Watson won and received the first place prize of $1m. IBM has published some of the underlying technologies and you can use them using an API. Unfortunately, registration is complicated. So, I decided to just point you to the online test environment. Try out IBM’s sentiment analysis under https://alchemy-language-demo.mybluemix.net/. Why not copy and paste one of the SOTU speeches or testimonies and see what it does? You can use the URLs from http://stateoftheunion.onetwothree.net/texts/ or http://wienerlibrarycollections.co.uk/novemberpogrom/testimonies-and-reports/overview. Browse down and check out the details. Targeted sentiments, for instance, are very interesting.

Let’s check what we have learned now.

Do you know what entity extraction is all about?

Extracting entitities such as raisins from cakes
Extracting things in texts such as names, places, etc.
Don’t know and don’t care
Finding relevant documents

Extracting things in texts such as names, places, etc.

Please, retrieve the two concepts of ‘war’ and ‘peace’ from ngram with a year_start=1900 and assign it to ng.

ng  <- ngram(c('war', 'peace'), year_start = 1900)

Run sentiment_scores(subset(sotu_all, (year == 2008) | (year == 2009))$speechtext) and explain the results. What does the expression do?

sentiment_scores(subset(sotu_all, (year == 2008) | (year == 2009))$speechtext)

## Warning in weighting(x): unreferenced term(s): a+ abound abounds abundance
## abundant accessable acclaim acclaimed acclamation accolade accolades
## accommodative accomodative accomplish accomplished accomplishment accurate
## accurately achievable achievements achievible acumen adaptable adaptive
## adequate adjustable admirable admirably admiration admire admirer admiring
## admiringly adorable adore adored adorer adoring adoringly adroit adroitly
## adulate adulation adulatory advantageous advantageously advantages
## adventuresome adventurous advocate advocated advocates affability affable
## affably affectation affection affectionate affinity affirm affirmation
## affirmative affluence affluent affordably afordable agile agilely
## agility agreeable agreeableness agreeably all-around alluring alluringly
## altruistic altruistically amaze amazed amazement amazes amazing amazingly
## ambitious ambitiously ameliorate amenable amenity amiability amiabily
## amiable amicability amicable amicably amity amply amuse amusing amusingly
## angel angelic apotheosis appeal appealing applaud appreciable appreciate
## appreciated appreciates appreciative appreciatively appropriate approval
## ardent ardently ardor articulate aspiration aspire assurance assurances
## assuredly astonish astonished astonishing astonishingly astonishment
## astound astounded astounding astoundingly astutely attentive attraction
## attractive attractively attune audible audibly auspicious authentic
## authoritative autonomous aver avid avidly award awarded awards awe awed
## awesome awesomely awesomeness awestruck awsome [...]

## Warning in weighting(x): unreferenced term(s): 2-faced 2-faces abnormal
## abolish abominable abominably abominate abomination abort aborted aborts
## abrade abrasive abrupt abruptly abscond absent-minded absentee absurd
## absurdity absurdly absurdness abused abuses abusive abysmal abysmally
## abyss accidental accost accursed accusation accusations accuse accuses
## accusing accusingly acerbate acerbic acerbically ache ached aches
## achey aching acrid acridly acridness acrimonious acrimoniously acrimony
## adamant adamantly addict addicted addicting addicts admonish admonisher
## admonishingly admonishment admonition adulterate adulterated adulteration
## adulterier adversarial adversary adverse adversity afflict affliction
## afflictive affront afraid aggravate aggravating aggravation aggression
## aggressiveness aggressor aggrieve aggrieved aggrivation aghast agonies
## agonize agonizing agonizingly agony aground ail ailing ailment aimless
## alarm alarmed alarmingly alienate alienated alienation allegation
## allegations allege allergic allergies allergy aloof altercation ambiguity
## ambiguous ambivalence ambivalent ambush amiss amputate anarchism anarchist
## anarchistic anarchy anemic angrily angriness angry anguish animosity
## annihilate annihilation annoy annoyance annoyances annoyed annoying
## annoyingly annoys anomalous anomaly antagonism antagonist antagonistic
## antagonize anti- anti-american anti-israeli anti-occupation anti-
## proliferation anti-semites anti-social anti-us anti-white antipathy
## antiquated antithetical anxieties anxiety anxious anxiously anxiousness
## apathetic apathetically apathy apocalypse apocalyptic apologist apologists
## appal appall appalled appalling appallingly apprehension apprehensions
## apprehensive apprehensively arbitrary arcane archaic arduous arduously
## argumentative arrogance arrogant arrogantly ashamed asinine asininely
## asinininity askance asperse aspersion aspersions assail assassin
## assassinate assault assult astray asunder atrocious atrocities atrocity
## atrophy audacious audaciously audaciousness audacity audiciously austere
## authoritarian autocrat autocratic avalanche avarice avaricious avariciously
## avenge averse aversion aweful awful awfully awfulness awkward awkwardness
## ax [...]code>


##          1          2
## -0.2810986 -0.3543721
Overall, our results would be better if we had a larger dataset and better language tools for a historical comparison. Online you can find a lot of sentiment analysis tools. Have you, for instance, heard about IBM’s Watson computer? In 2011, Watson competed on Jeopardy! against former winners Brad Rutter and Ken Jennings. Watson won and received the first place prize of $1m. IBM has published some of the underlying technologies and you can use them using an API. Unfortunately, registration is complicated. So, I decided to just point you to the online test environment. In your group, try out IBM’s sentiment analysis under https://alchemy-language-demo.mybluemix.net/. Why not copy and paste one of the SOTU speeches and see what it does? You can use the URLs from http://stateoftheunion.onetwothree.net/texts/. Browse down and check out the details. Targeted sentiments, for instance, are very interesting. Or, explore personality insights at https://personality-insights-livedemo.mybluemix.net/.
That’s it for today. Again, today can be only the beginning. Text analysis is a big field and still evolving fast. Next time, we concentrate on improving our visualisation skills.