## [1] "Loading session data. If this is the first time you are working on this session, this might take a while. Please, do not disconnect from the Internet."
Today we work on analysing texts. We have already talked about texts a little bit when we were looking at how to analyse the content of tweets. In this session, we will finally find out how to be able to do this by ourselves. Text analysis is an advanced field in the world of computational analytics, and we can rely on a very long tradition of doing text analysis and very established methods. Besides, it is fun and maybe at the same time the dominant form in social and cultural analytics – simply because we work so much with texts, documents, etc.
If you are interested in examples of how R can be used to work on contemporary questions of society, I recommend checking out the work of Ken Benoit http://www.kenbenoit.net/ from the LSE.
We will go back further in history and mainly work the US State of the Union Address. This allows to first of all understand a lot about the past state of the political constitution of the USA. The State of the Union (SOTU) data is taken from http://stateoftheunion.onetwothree.net/index.shtml and provides access to the corpus of all the State of the Union addresses from 1790 to 2015 at the time of writing. SOTU allows you to explore how specific words gain and lose prominence over time, and to link to information on the historical context for their use.
We did all the hard work of loading this data for you. It should be available to you as the data frame sotu_all. Try print out its first 2 entries by using head and 2 as a second parameter.
head(sotu_all, 2)
## speechtext
## 17900108.html Fellow-Citizens of the Senate and House of Representatives: I embrace with great satisfaction the opportunity which now presents itself of congratulating you on the present favorable prospects of our public affairs. The recent accession of the important state of North Carolina to the Constitution of the United States (of which official information has been received), the rising credit and respectability of our country, the general and increasing good will toward the government of the Union, and the concord, peace, and plenty with which we are blessed are circumstances auspicious in an eminent degree to our national prosperity. In resuming your consultations for the general good you can not but derive encouragement from the reflection that the measures of the last session have been as satisfactory to your constituents as the novelty and difficulty of the work allowed you to hope. Still further to realize their expectations and to secure the blessings which a gracious Providence has placed within our reach will in the course of the present important session call for the cool and deliberate exertion of your patriotism, firmness, and wisdom. Among the many interesting objects which will engage your attention that of providing for the common defense will merit particular regard. To be prepared for war is one of the most effectual means of preserving peace [...].
## 17901208.html Fellow-Citizens of the Senate and House of Representatives: In meeting you again I feel much satisfaction in being able to repeat my congratulations on the favorable prospects which continue to distinguish our public affairs. The abundant fruits of another year have blessed our country with plenty and with the means of a flourishing commerce. The progress of public credit is witnessed by a considerable rise of American stock abroad as well as at home, and the revenues allotted for this and other national purposes have been productive beyond the calculations by which they were regulated. This latter circumstance is the more pleasing, as it is not only a proof of the fertility of our resources, but as it assures us of a further increase of the national respectability and credit, and, let me add, as it bears an honorable testimony to the patriotism and integrity of the mercantile and marine part of our citizens. The punctuality of the former in discharging their engagements has been exemplary [...]. GO. WASHINGTON
## year date
## 17900108.html 1790 17900108
## 17901208.html 1790 17901208
You should see a lot of printed text and thus might have missed the structure of the data frame. Check it with str(sotu_all) and you will see three columns. The first one contains the full text, the second one the year and the third the exact date in the format YYYYMMDD.
str(sotu_all)
## 'data.frame': 230 obs. of 3 variables:
## $ speechtext: chr " Fellow-Citizens of the Senate and House of Representatives: I embrace with great satisfaction the opportunity "| __truncated__ " Fellow-Citizens of the Senate and House of Representatives: In meeting you again I feel much satisfaction in b"| __truncated__ " Fellow-Citizens of the Senate and House of Representatives: In vain may we expect peace with the Indians on ou"| __truncated__ " Fellow-Citizens of the Senate and House of Representatives: It is some abatement of the satisfaction with whic"| __truncated__ ...
## $ year : chr "1790" "1790" "1791" "1792" ...
## $ date : chr "17900108" "17901208" "17911025" "17921106" ...
str should also show you a problem that is typical for imported data. All columns are of type character, while clearly the year column is numeric and the date column describes an actual date. This can lead to issues later on when we will analyse the data. Let’s fix that. First, we tell R that year is a column with numeric data by running sotu_all\(year <- as.numeric(sotu_all\)year).
sotu_all$year <- as.numeric(sotu_all$year)
sotu_all’s date is clearly a date-string in the format YYYYMMDD. R has the powerful as.Date function that can parse such character strings and transform them into R’s date objects. The advantage is that we can then compare dates with each other, add dates to other dates, etc. http://www.statmethods.net/input/dates.html explains how this works. We basically need to tell R where to find the year in the string with %Y for a 4-digit year, where to find the month with %m and the day with %d. Run sotu_all\(date <- as.Date(sotu_all\)date, ‘%Y%m%d’).
sotu_all$date <- as.Date(sotu_all$date, '%Y%m%d')
Fighting with different formats of date and time is a key part of analysing culture and society. For some past events, for instance, we do not know exact dates or times and have to work with estimates. Sometimes, we only know the approximate time span of when, say, a Roman consul lived. This can cause a lot of issues. lubridate is an excellent R package to help with date and time calculations (https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html). We do not have time here to go through the details, but I recommend checking it out next time you struggle with dates and times. Running str(sotu_all) will confirm that all the columns are now of the correct type.
str(sotu_all)
## 'data.frame': 230 obs. of 3 variables:
## $ speechtext: chr " Fellow-Citizens of the Senate and House of Representatives: I embrace with great satisfaction the opportunity "| __truncated__ " Fellow-Citizens of the Senate and House of Representatives: In meeting you again I feel much satisfaction in b"| __truncated__ " Fellow-Citizens of the Senate and House of Representatives: In vain may we expect peace with the Indians on ou"| __truncated__ " Fellow-Citizens of the Senate and House of Representatives: It is some abatement of the satisfaction with whic"| __truncated__ ...
## $ year : num 1790 1790 1791 1792 1793 ...
## $ date : Date, format: "1790-01-08" "1790-12-08" ...
With sotu_all, we have a collection of documents reflecting over 200 years in the history of the US in the words of each president. It is time to do something with it and text mine it. As said, text mining is a very established discipline. Therefore it is not surprising that there is a very strong text mining package in R. It is called tm and you will find it in many R introductions (see also https://cran.r-project.org/web/packages/tm/index.html). Please, load tm with library(tm).
library(tm)
Text mining is the process of analyzing natural language text; often sourced from online content, such as emails and social media (Twitter or Facebook). In this chapter, we are going to cover some of the most commonly used methods of the tm package and employ them to analyse the historical content of the State of the Union speeches. To this end, first we need to create a so-called corpus. A corpus is term originating from the linguistics community. All you need to know about a corpus is that it is basically a collection of text documents. In R, you can create a corpus from many sources (online or local) and from many different formats such as plain text or PDF. Try getSources() to see which sources you can use.
getSources()
## [1] "DataframeSource" "DirSource" "URISource" "VectorSource"
## [5] "XMLSource" "ZipSource"
Now try to get the formats you can directly import into a tm corpus with getReaders().
getReaders()
## [1] "readDataframe" "readDOC"
## [3] "readPDF" "readPlain"
## [5] "readRCV1" "readRCV1asPlain"
## [7] "readReut21578XML" "readReut21578XMLasPlain"
## [9] "readTagged" "readXML"
There are many different formats! Luckily, we can ignore most of these formats here, as we have already loaded our data in a nice plain text format. If you are interested, please check online for the many ways the tm package is used. All we need for now is the vector source.
The first task we set ourselves is to understand historical policy differences. The last SOTU address of the Republican George W. Bush was in 2008, while the first one of the Democrat Barack Obama was in 2009. These were also the years of the height of a very severe financial crisis. So, we expect interesting changes in content. To identify these, we would like to produce a word cloud to compare the most frequent terms in both speeches in order to understand policy differences. Let us first create a smaller sample of the data containing only the speeches of 2008 and 2009. Do you remember how to do it with command subset? Take a moment to reflect on your subsetting skills and then type sotu_2008_2009 <- subset(sotu_all, (year == 2008) | (year == 2009)) to create a new sotu_2008_2009 data frame containing only the speeches of 2008 and 2009.
sotu_2008_2009 <- subset(sotu_all, (year == 2008) | (year == 2009))
Now we need to create a corpus. We use the vector source method. Each column in a data frame is of course a vector – as you might remember. We are only interested in the full texts and thus only need the column speechtext. Please, create the corpus with corp_small <- Corpus(VectorSource(sotu_2008_2009$speechtext)).
corp_small <- Corpus(VectorSource(sotu_2008_2009$speechtext))
We will investigate later the details of the corpus. For now we would like to concentrate on the well-defined steps to conduct a text analysis. It is quite simple really because of an insight by a linguist called Zipf (https://en.wikipedia.org/wiki/Zipf%27s_law). He stated that the more frequently a word appears in a document the more important it is for its content. So, if we talk about beautiful gardens, we will use the word garden often but also flowers, grass, etc. All we need to do then is to count all the words in a document and derive the word frequency to get an overview of what the document is about. We do this for all documents in a collection like SOTU and get a good comparison.
But before we count words, we need to tell the computer what a word is. A computer does not know such trivial things. In order to keep it simple, we say that a word is anything that is separated by white spaces. This means we first need to clean the texts so that these white spaces become consistent, and the computer can easily recognise the words. Let’s identify words then by first removing all punctuation with corp_small <- tm_map(corp_small, removePunctuation).
corp_small <- tm_map(corp_small, removePunctuation)
The tm_map function provides a convenient way of running the clean-up transformations in order to eliminate all details that are not necessary for the analysis using the tm package. To see the list of available transformation methods, simply call the getTransformations() function. Remember our final target is to get to a list of important words per document.
getTransformations()
## [1] "removeNumbers" "removePunctuation" "removeWords"
## [4] "stemDocument" "stripWhitespace"
Next we standardize all word spellings by lower-casing them all with corp_small <- tm_map(corp_small, content_transformer(tolower)). This makes it easier to count them. Why?
corp_small <- tm_map(corp_small, content_transformer(tolower))
Let us also remove all numbers as they do not add to the content of the speeches. Type corp_small <- tm_map(corp_small, removeNumbers).
corp_small <- tm_map(corp_small, removeNumbers)
Let us also strip out any extra white space that we do not need with corp_small <- tm_map(corp_small, stripWhitespace). Such extra whitespace can easily confuse our word recognition. Extra whitespaces can actually be a real issue in historical documents, as they have often been OCR’ed with imperfect recognition. Look it up! There are many cases described on the web. The Bad Data Handbook - Cleaning Up The Data So You Can Get Back To Work by McCallum and published by O’Reilly is an excellent summary of this other side of big data. The bigger the data the more likely it is also somehow ‘bad’!
corp_small <- tm_map(corp_small, stripWhitespace)
When I said earlier that the most frequently used words carry the meaning of a document, I was not entirely telling the truth. There are so-called stopwords such as the, a, or, etc., which usually carry less meaning than the other expressions in the corpus. You will see what kind of words I mean by typing stopwords(‘english’).
stopwords('english')
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
Do you agree that these words do not really carry meaning in English? Stopwords should be removed to concentrate on the most important words. Let us do that with the function removeWords and corp_small <- tm_map(corp_small, removeWords, stopwords(‘english’)).
corp_small <- tm_map(corp_small, removeWords, stopwords('english'))
That is it for the time being in terms of document transformations. We are now confident that we can start counting words to extract meaning. We would like to create a so-called term-document matrix (tdm). You remember that a matrix is 2-dimensional structure in R? The tdm will simply contain the most important words (also called terms) in the rows of the matrix and the (2) documents/speeches in the columns. Create it with tdm <- TermDocumentMatrix(corp_small).
tdm <- TermDocumentMatrix(corp_small)
Have a look at what you have done with tm’s inspect(tdm[1:5, 1:2]). Do you see the first 5 terms in the 2 documents? The numbers represent their frequencies.
inspect(tdm[1:5, 1:2])
## <<TermDocumentMatrix (terms: 5, documents: 2)>>
## Non-/sparse entries: 7/3
## Sparsity : 30%
## Maximal term length: 10
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms 1 2
## abandon 1 0
## ability 4 3
## abroad 2 1
## acceptable 1 0
## accepts 1 0
Let us next prepare a simple word cloud. Load the package with library(wordcloud).
library(wordcloud)
Next, tell the wordcloud package that you have a nice term document matrix ready for it with tdm_matrix <- as.matrix(tdm).
tdm_matrix <- as.matrix(tdm)
As always, we like things to be tidy and tell R that the matrix contains 2 speeches from 2008 and 2009. Name the columns with colnames(tdm_matrix) <- c(‘SOTU 2008’, ‘SOTU 2009’). colnames btw is a more convenient use of the names function we have already met earlier. It does what its name says it does.
colnames(tdm_matrix) <- c('SOTU 2008', 'SOTU 2009')
Now create the word cloud for SOTU 2008 by simply typing commonality.cloud(tdm_matrix,random.order=FALSE, scale=c(5, .5),colors = brewer.pal(4, ‘Dark2’), max.words=400). Ignore all the warnings, please. They are mainly about certain terms that cannot be plotted in the various regions of the plot. For learning purposes, we can ignore the warnings.
commonality.cloud(tdm_matrix,random.order=FALSE, scale=c(5, .5),colors = brewer.pal(4, 'Dark2'), max.words=400)
According to the package description (https://cran.r-project.org/web/packages/wordcloud/wordcloud.pdf) commonality.cloud plots a cloud of words shared across documents. It takes a number of parameters that we can ignore here. We can clearly see that America is important in 2008 and in 2009 as well as the people of course. More interesting is probably the comparison of the 2008 speech by Bush with the 2009 speech by Obama. wordcloud contains another function called comparison.cloud to compare documents. Run it with comparison.cloud(tdm_matrix, max.words=300,random.order=FALSE, colors = c(‘indianred3’,‘lightsteelblue3’), title.size=2.5).
comparison.cloud(tdm_matrix, max.words=300,random.order=FALSE, colors = c('indianred3','lightsteelblue3'), title.size=2.5)
We can clearly see that Obama concentrated more on the economy while Bush’s favourite topic was the war on terror. We are quite happy with this insight but we also feel we can do better. Wouldn’t it be nice if we could compare not just words but whole topics across documents in a collection? This is what the advanced technique topic modelling does. Topic Modelling is a popular technique in social and cultural analytics that summarises a collection of texts into a predefined number of topics. Have a look at http://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/.
Topic modelling is also popular, as it requires only minimal text organisation. Computers can learn topics by themselves. There are, however, known limitations of topic models with regard to the interpretation they help with. There is no guarantee that the automatically derived topics will correspond to what people would consider to be interesting topics/themes. They may be too specific or general, identical to other topics or they may be framings of larger topics, as opposed to genuinely distinct topics. Finally (and in common with other computational analysis techniques), the performance of topic modelling depends upon the quality and structure of the data. In our case the main issue will be that we only have 2 documents, which is not a lot of data. But as topic modelling is computationally quite expensive we should not overdo things here and just concentrate on this small corpus. Please, load the topic modelling library with library(topicmodels).
library(topicmodels)
Topic modelling is similar to clustering in that we have to define the total number of topics in advance. Please, enter n<-10 to set 10 topics.
n<-10
While the word clouds used a term-document matrix, topic models need the equivalent document-term matrix, where the documents are the rows and the words define the columns. Don’t ask me why. It is how it is. Never argue with a computer and especially if you are using a function developed by somebody else. Please, create dtm <- DocumentTermMatrix(corp_small).
dtm <- DocumentTermMatrix(corp_small)
The actual topic modelling function is called LDA (Latent Dirichlet Allocation). It is really quite complicated but well explained in the reference above. It tries to define our n topics based on the most important words in each of them. A more detailed explanation can be found at http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/. Now run LDA with topic_model <- LDA(dtm, n, alpha=0.1, method=‘Gibbs’). This is advanced text analysis and we can ignore the arguments again. It should even for just 2 documents take a few seconds.
topic_model <- LDA(dtm, n, alpha=0.1, method='Gibbs')
We are now interested in the words/terms per topic and get those with terms(topic_model, 10). The parameter 10 indicates that we display 10 terms per topics.
terms(topic_model, 10)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## [1,] "year" "economy" "ability" "good" "made"
## [2,] "terrorists" "care" "health" "many" "come"
## [3,] "citizens" "every" "power" "trust" "goal"
## [4,] "last" "know" "met" "keep" "life"
## [5,] "past" "health" "prosperity" "law" "community"
## [6,] "funding" "jobs" "back" "qaeda" "enemy"
## [7,] "home" "budget" "moment" "military" "find"
## [8,] "fight" "plan" "private" "women" "housing"
## [9,] "freedom" "cost" "benefits" "families" "like"
## [10,] "power" "even" "defend" "iran" "parents"
## Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
## [1,] "empower" "let" "will" "education" "united"
## [2,] "troops" "liberty" "people" "banks" "make"
## [3,] "iraqi" "seen" "can" "well" "chamber"
## [4,] "leaders" "call" "america" "just" "credit"
## [5,] "extremists" "come" "american" "plan" "effort"
## [6,] "peace" "coming" "new" "college" "war"
## [7,] "allow" "cut" "must" "save" "largest"
## [8,] "enemies" "growth" "congress" "businesses" "day"
## [9,] "fellow" "programs" "now" "buy" "money"
## [10,] "results" "includes" "nation" "address" "president"
Anything interesting to see here? Just like for the cluster analysis, the next step would be to label the topics in order to interpret them. Please, take a moment to do just that. We can also see what the topics per document are with topics(topic_model, 10). 10 is the maximum number of topics returned. The result is a list in order of importance of topics for a particular document.
topics(topic_model, 10)
## 1 2
## [1,] 8 8
## [2,] 1 2
## [3,] 6 9
## [4,] 4 10
## [5,] 7 3
## [6,] 5 7
## [7,] 3 5
## [8,] 2 4
## [9,] 10 1
## [10,] 9 6
Not bad. Both word clouds and topic modelling deliver some interesting insights. We now feel confident to explore the whole corpus. In the end, we would like to establish some simple linguistic statistics such as the most frequent words/terms in a collection as well as word trends that tell us about the ups and downs of concepts during the history of policy-making in the USA. Check out http://stateoftheunion.onetwothree.net/sotuGraph/index.html for a visualisation to compare two concepts in the SOTU speeches. We can do that ourselves using R!
Let us start by creating first a corpus of the whole collection. Run corp <- Corpus(VectorSource(sotu_all$speechtext)).
corp <- Corpus(VectorSource(sotu_all$speechtext))
This is now an 11 MB corpus and already quite something. Now we have to work slowly through the transformations again. Start again with corp <- tm_map(corp, removePunctuation). You might notice that some of the steps take a little longer than before. Nothing too dramatic though as the corpus is still quite small.
We are ready to count words and create dtm <- DocumentTermMatrix(corp).
dtm <- DocumentTermMatrix(corp)
The European Holocaust Research Infrastructure (EHRI; http://ehri-project.eu) provides access to many textual collections. One of the partners in EHRI is the UK’ Wiener Library, which has a unique collection of testimonies on the November 1938 Pogroms in Germany (http://wienerlibrarycollections.co.uk/novemberpogrom/ testimonies-and-reports/overview), which were later called Kristallnacht, Night of the Broken Glass (https://en.wikipedia.org/wiki/Kristallnacht). It was a pogrom against Jews throughout Nazi Germany on 9–10 November 1938, carried out by SA forces and German civilians. From their headquarters at the Jewish Central Information Office (JCIO), Alfred Wiener and his colleagues used their unrivalled network of contacts to collect over 350 eyewitness accounts describing the events of 9 and 10 November 1938 in cities, towns and villages throughout Germany and Austria.
We have have again loaded the data for you. It should be available to you as the data frame progrom_tes. Try head(progrom_tes, 2).
head(progrom_tes, 2)
## File
## 2 0001.txt
## 3 0002.txt
## Text
## 2 (For information only, distribution) Delft, 22nd November 1938 On 13.11. I arrived Amsterdam Berlin aeroplane wife following trying give account memory circumstances I experienced recent months I received authentic reports. As impossible make notes events, despite everything must written certain reservations. The desire bring Jewish problem ultimate liquidation Germany evident since beginning 1938 found first decisive expression Vermgenserklrung [statement assets] beginning year. The ways means information demanded Jews capitalisation small even smaller incomes formed integral part, shows determination place high value Jewish assets possible. There followed removal taxation reliefs every kind Jews (allowances children, taxation Jewish welfare institutions etc.), removal public body status Jewish Gemeinden [communities], connexion compulsory collection Jewish taxes, became voluntary payment, longer possible. The introduction so-called "gelben Bnke" [yellow benches; benches marked yellow paint; Jews allowed use benches] small pinprick. The radical movement received impetus invasion Austria [...]. Herr Oppenheim, now (1945) New York
## 3 Amsterdam, 19th November 1938 The reporter aware, reports also mention, Herr Heidenheimer Herr Metzger yet traceable. Whether Herr Rechtsanwalt [lawyer] Fritz Josefsthal actually traceable able say. It may Berlin returned Nuremberg. It true Dr. Fritz Lorch died Frth. He Jewish hospital undergo hernia operation emigrated. He suffered embolism result shock caused events died. Furthermore Bamberg restaurateur Herz heart condition also died shock, 60 years old. The reporter learned good sources rather knowledge owner export business Nuremberg, Simon Loeb, 60 years old, suffered fractured skull pogroms. The reason unknown. It possible obtain confirmation rumour. Leo Klein, man independent means aged 70, former owner Langer & Mainz, wholesale textiles, reported suffered several fractures arms therefrom. Confirmation required. Rechtsanwalt Justin Goldstein, Nuremberg, mistreated battered stitches hospital. His wife suffered head injury nervous shock it. It noted generally nerves Jewish people reporter knows agitated even falling plate smallest sound can seriously upset them [...].
Let us start by creating first a corpus of the whole collection. Run corp <- Corpus(VectorSource(progrom_tes$Text)).
corp <- Corpus(VectorSource(progrom_tes$Text))
This is now quite a large corpus with 265 documents. Now we have to work slowly through the transformations again. Start again with corp <- tm_map(corp, removePunctuation). You might notice that some of the steps take a little longer than before. Nothing too dramatic though as the corpus is still quite small.
corp <- tm_map(corp, removePunctuation)
Next, we remove the numbers with corp <- tm_map(corp, removeNumbers).
corp <- tm_map(corp, removeNumbers)
And, we remove the extra whitespaces with corp <- tm_map(corp, stripWhitespace).
corp <- tm_map(corp, stripWhitespace)
Let’s transform all words to lower case with corp <- tm_map(corp, content_transformer(tolower)).
corp <- tm_map(corp, content_transformer(tolower))
Finally, eliminate stop words with corp <- tm_map(corp, removeWords, stopwords(‘english’)).
corp <- tm_map(corp, removeWords, stopwords('english'))
We are ready to count words and create dtm <- DocumentTermMatrix(corp).
dtm <- DocumentTermMatrix(corp)
Hopefully, you remember that the dtm has the documents in the rows and the words/terms in the columns. The cells contain the frequencies of the words in the document. So, the row sums contain the number of words in a document. Test it by running head(rowSums(as.matrix(dtm))). We first tell R again that dtm is a matrix and then use the function rowSums, which does what it says it does. It calculates the row sums.
head(rowSums(as.matrix(dtm)))
## 1 2 3 4 5 6
## 2103 205 202 237 548 1034
Interesting! If we would run this function on the whole dtm, we could filter out the shortest and longest documents using the subset, min and max functions. Any idea how? It could be more complicated than it seems at first and you should probably create a new vector first. But we move on, as we are more interested in the words/terms and their frequencies, which will tell us more about the contents of the various documents. Let’s create a freq vector that contains the column sums, which are of course the frequencies of all words/terms in the whole corpus. Run freq <- colSums(as.matrix(dtm)).
freq <- colSums(as.matrix(dtm))
What does length(freq) deliver? Try it and find out that it returns the number of words in the corpus.
length(freq)
## [1] 12629
Let’s next investigate the 10 most frequent terms. I hope you remember that order sorts a vector for us. It returns the indexes of the original vector in order. With head we can get the top items in a vector. So, let’s create a most_freq_i vector with most_freq_i <- head(order(freq, decreasing = TRUE), 10) that returns the indexes of the 10 most frequent words in the collection. What does the decreasing parameter in order do? The answer should be in the name, but otherwise ask your friend the Internet.
most_freq_i <- head(order(freq, decreasing = TRUE), 10)
With most_freq_i, we can simply subset freq to return the 10 most frequent terms. Try freq[most_freq_i].
freq[most_freq_i]
## one people jewish jews also camp men taken
## 1002 828 675 638 556 556 516 420
## november now
## 397 392
Anything interesting to see here? Spend some time analysing the terms.
This was quite hard work and fairly advanced R stuff. There is a useful function in the tm package to find frequent terms. findFreqTerms(dtm, lowfreq=200) finds terms with the minimum term frequency of 200. Let’s run head(findFreqTerms(dtm, lowfreq=200), 10).
head(findFreqTerms(dtm, lowfreq=200), 10)
## [1] "already" "also" "arrested" "away" "camp" "can"
## [7] "children" "day" "days" "even"
Fairly interesting. But it gets more exciting with findAssocs, which finds term associations based on their correlations. Never heard of correlations? If you continue to use R, you will not escape them much longer. For now, it is enough to know that a correlation is a number between 0 and 1 where 0 indicates that two terms never appear together and 1 that they always do. So let’s run findAssocs(dtm, ‘camp’, corlimit=0.8) to find out which other ideas the witnesses might be talking about when thinking about streets in 80% of all cases.
findAssocs(dtm, 'camp', corlimit=0.80)
## $camp
## achieved ssblockfhrer often trousers
## 0.93 0.91 0.90 0.90
## block one inmates hygiene
## 0.90 0.89 0.89 0.89
## ranks number work triangles
## 0.89 0.88 0.88 0.88
## life barracks learnt torture
## 0.87 0.87 0.87 0.87
## guards sleeping orderliness workshops
## 0.86 0.86 0.86 0.86
## long jacket betide toilets
## 0.85 0.85 0.85 0.84
## jackets regardless button germanys
## 0.84 0.84 0.84 0.84
## lift pens day despite
## 0.84 0.84 0.83 0.83
## kind least seemed physical
## 0.83 0.83 0.83 0.83
## freedom easily tone working
## 0.83 0.82 0.82 0.82
## punished political individual running
## 0.82 0.82 0.82 0.82
## comrades fever blockfhrer like
## 0.82 0.82 0.82 0.81
## forbidden always rows caps
## 0.81 0.81 0.81 0.81
## aufseher collective enormously stubendienst
## 0.81 0.81 0.81 0.81
## chose civilised deliverance denounced
## 0.81 0.81 0.81 0.81
## encourage eradicated extinguished felled
## 0.81 0.81 0.81 0.81
## instinct insubordination middleaged politischen
## 0.81 0.81 0.81 0.81
## processed procession proven sores
## 0.81 0.81 0.81 0.81
## suppressed terminology form others
## 0.81 0.81 0.80 0.80
## outside lagerkommandant undertaking
## 0.80 0.80 0.80
What about records of killing? Run findAssocs(dtm, ‘kill’, corlimit=0.9). Please, note that we have reduced corlimit to reflect the fact that peace is not appearing so often and therefore the likelihood of finding related words is lower. Corlimit excludes words that appear less frequently with peace.
findAssocs(dtm, 'kill', corlimit=0.9)
## $kill
## acutely agile aisles
## 0.92 0.92 0.92
## alternatives bandaging bleeds
## 0.92 0.92 0.92
## brandishings bursts choose
## 0.92 0.92 0.92
## clicks collar collars
## 0.92 0.92 0.92
## compounded conflicts consoled
## 0.92 0.92 0.92
## contamination cruder crusts
## 0.92 0.92 0.92
## cursing depends dipped
## 0.92 0.92 0.92
## discs disorganised dissuade
## 0.92 0.92 0.92
## fetter frankfurtsd frantic
## 0.92 0.92 0.92
## fraudsters gag groggy
## 0.92 0.92 0.92
## hangs hits impaired
## 0.92 0.92 0.92
## indistinguishable interlaced keel
## 0.92 0.92 0.92
## keels knuckledusters krankenwrter
## 0.92 0.92 0.92
## lefthand limp limping
## 0.92 0.92 0.92
## louder lurched mouthfuls
## 0.92 0.92 0.92
## nagaika needless noisily
## 0.92 0.92 0.92
## planted plump pollute
## 0.92 0.92 0.92
## pounced pushes reinforcements
## 0.92 0.92 0.92
## remand removing restless
## 0.92 0.92 0.92
## ruse rushes russians
## 0.92 0.92 0.92
## sandbag sanittsbarack scalp
## 0.92 0.92 0.92
## schweinejuden shorthandled shuts
## 0.92 0.92 0.92
## sits slat smack
## 0.92 0.92 0.92
## somebodys splits sticky
## 0.92 0.92 0.92
## studded symptom teeter
## 0.92 0.92 0.92
## tenacity thatll touches
## 0.92 0.92 0.92
## unclear unsoldierly unteroffizier
## 0.92 0.92 0.92
## untersuchungsgefngnis urinated wachtmeister
## 0.92 0.92 0.92
## wakes wallops weights
## 0.92 0.92 0.92
## withstood
## 0.92
Some terms are more linked to ‘camp’ and ‘kill’ than others. Let’s check what we have learned next.
What does TDM stand for?
TermDocumentMatrix
In the content transformation workflow of the tm package for corp_small, how do I transform the content into lower cases?
tm_map(corp_small, content_transformer(tolower))
Find the word associations with ‘history’ using corlimit=0.5.
findAssocs(dtm, 'history', corlimit=0.5)
## $history
## affluent aligned manifested
## 0.80 0.80 0.80
## realise standards yards
## 0.80 0.80 0.80
## amt anticipate antonplatz
## 0.76 0.76 0.76
## astonishing barring basket
## 0.76 0.76 0.76
## bnke bricked bricklayer
## 0.76 0.76 0.76
## capitalisation compulsory conducting
## 0.76 0.76 0.76
## connexion counsellors critically
## 0.76 0.76 0.76
## daluege delft determination
## 0.76 0.76 0.76
## deterrent devastating dictated
## 0.76 0.76 0.76
## disapproval disclosed distances
## 0.76 0.76 0.76
## districtonly editorsinchief familienblatt
## 0.76 0.76 0.76
## fencing flush forceful
## 0.76 0.76 0.76
## forenames freight gelben
## 0.76 0.76 0.76
## geschichte glaubens grolmannstrasse
## 0.76 0.76 0.76
## growth hawkers henceforth
## 0.76 0.76 0.76
## herzog hessenassau hirschbergs
## 0.76 0.76 0.76
## historical hourly impetus
## 0.76 0.76 0.76
## incomes inconvenience innernministerium
## 0.76 0.76 0.76
## inscribing integral intensive
## 0.76 0.76 0.76
## iranische kndigungen kulturbundtheater
## 0.76 0.76 0.76
## langen legitimationskarte lehrhuser
## 0.76 0.76 0.76
## leibnitzstrasse licence loewenthal
## 0.76 0.76 0.76
## makkabi maxstrasse morgen
## 0.76 0.76 0.76
## neumeyer nonjewish oberstlandesgerichtsrat
## 0.76 0.76 0.76
## openair outline palstina
## 0.76 0.76 0.76
## partially pattern perception
## 0.76 0.76 0.76
## periodicals pes pinprick
## 0.76 0.76 0.76
## plots pogromlike polenaktion
## 0.76 0.76 0.76
## polishjewish polizeirevieren precede
## 0.76 0.76 0.76
## prelude principally producing
## 0.76 0.76 0.76
## raids railways rampaging
## 0.76 0.76 0.76
## reacted reassure reflection
## 0.76 0.76 0.76
## regierungsdirektor regions reichmannjungmann
## 0.76 0.76 0.76
## reliefs renounced reregistered
## 0.76 0.76 0.76
## reservations revival risen
## 0.76 0.76 0.76
## rosenhain rosenstock rothenburg
## 0.76 0.76 0.76
## sabres scrutiny searching
## 0.76 0.76 0.76
## seligenstadt slump stadtbahnen
## 0.76 0.76 0.76
## stadthausierschein stamping stlpchensee
## 0.76 0.76 0.76
## storehouses stroked sufficed
## 0.76 0.76 0.76
## switchboard syndicus tauber
## 0.76 0.76 0.76
## taxation traditional troubled
## 0.76 0.76 0.76
## typically uhlandstrasse ultimate
## 0.76 0.76 0.76
## untergrund varying verein
## 0.76 0.76 0.76
## verkehrsfallen victory waldi
## 0.76 0.76 0.76
## wickerwork wilmersdorferstrasse wrongly
## 0.76 0.76 0.76
## zeitschrift zionistisches mobilisation
## 0.76 0.76 0.73
[...]
Let’s move on to the second part of this session.