## [1] "Loading session data. If this is the first time you are working on this session, this might take a while. Please, do not disconnect from the Internet."
Today we work on analysing texts. We have already talked about texts a little bit when we were looking at how to analyse the content of tweets. In this session, we will finally find out how to be able to do this by ourselves. Text analysis is an advanced field in the world of computational analytics, and we can rely on a very long tradition of doing text analysis and very established methods. Besides, it is fun and maybe at the same time the dominant form in social and cultural analytics – simply because we work so much with texts, documents, etc.
If you are interested in examples of how R can be used to work on contemporary questions of society, I recommend checking out the work of Ken Benoit http://www.kenbenoit.net/ from the LSE.
We will go back further in history and mainly work the US State of the Union Address. This allows to first of all understand a lot about the past state of the political constitution of the USA. The State of the Union (SOTU) data is taken from http://stateoftheunion.onetwothree.net/index.shtml and provides access to the corpus of all the State of the Union addresses from 1790 to 2015 at the time of writing. SOTU allows you to explore how specific words gain and lose prominence over time, and to link to information on the historical context for their use.
We did all the hard work of loading this data for you. It should be available to you as the data frame sotu_all. Try print out its first 2 entries by using head and 2 as a second parameter.
head(sotu_all, 2)
## speechtext
## 17900108.html Fellow-Citizens of the Senate and House of Representatives: I embrace with great satisfaction the opportunity which now presents itself of congratulating you on the present favorable prospects of our public affairs. The recent accession of the important state of North Carolina to the Constitution of the United States (of which official information has been received), the rising credit and respectability of our country, the general and increasing good will toward the government of the Union, and the concord, peace, and plenty with which we are blessed are circumstances auspicious in an eminent degree to our national prosperity. In resuming your consultations for the general good you can not but derive encouragement from the reflection that the measures of the last session have been as satisfactory to your constituents as the novelty and difficulty of the work allowed you to hope. Still further to realize their expectations and to secure the blessings which a gracious Providence has placed within our reach will in the course of the present important session call for the cool and deliberate exertion of your patriotism, firmness, and wisdom. Among the many interesting objects which will engage your attention that of providing for the common defense will merit particular regard. To be prepared for war is one of the most effectual means of preserving peace [...].
## 17901208.html Fellow-Citizens of the Senate and House of Representatives: In meeting you again I feel much satisfaction in being able to repeat my congratulations on the favorable prospects which continue to distinguish our public affairs. The abundant fruits of another year have blessed our country with plenty and with the means of a flourishing commerce. The progress of public credit is witnessed by a considerable rise of American stock abroad as well as at home, and the revenues allotted for this and other national purposes have been productive beyond the calculations by which they were regulated. This latter circumstance is the more pleasing, as it is not only a proof of the fertility of our resources, but as it assures us of a further increase of the national respectability and credit, and, let me add, as it bears an honorable testimony to the patriotism and integrity of the mercantile and marine part of our citizens. The punctuality of the former in discharging their engagements has been exemplary [...]. GO. WASHINGTON
## year date
## 17900108.html 1790 17900108
## 17901208.html 1790 17901208
You should see a lot of printed text and thus might have missed the structure of the data frame. Check it with str(sotu_all) and you will see three columns. The first one contains the full text, the second one the year and the third the exact date in the format YYYYMMDD.
str(sotu_all)
## 'data.frame': 230 obs. of 3 variables:
## $ speechtext: chr " Fellow-Citizens of the Senate and House of Representatives: I embrace with great satisfaction the opportunity "| __truncated__ " Fellow-Citizens of the Senate and House of Representatives: In meeting you again I feel much satisfaction in b"| __truncated__ " Fellow-Citizens of the Senate and House of Representatives: In vain may we expect peace with the Indians on ou"| __truncated__ " Fellow-Citizens of the Senate and House of Representatives: It is some abatement of the satisfaction with whic"| __truncated__ ...
## $ year : chr "1790" "1790" "1791" "1792" ...
## $ date : chr "17900108" "17901208" "17911025" "17921106" ...
str should also show you a problem that is typical for imported data. All columns are of type character, while clearly the year column is numeric and the date column describes an actual date. This can lead to issues later on when we will analyse the data. Let’s fix that. First, we tell R that year is a column with numeric data by running sotu_all\(year <- as.numeric(sotu_all\)year).
sotu_all$year <- as.numeric(sotu_all$year)
sotu_all’s date is clearly a date-string in the format YYYYMMDD. R has the powerful as.Date function that can parse such character strings and transform them into R’s date objects. The advantage is that we can then compare dates with each other, add dates to other dates, etc. http://www.statmethods.net/input/dates.html explains how this works. We basically need to tell R where to find the year in the string with %Y for a 4-digit year, where to find the month with %m and the day with %d. Run sotu_all\(date <- as.Date(sotu_all\)date, ‘%Y%m%d’).
sotu_all$date <- as.Date(sotu_all$date, '%Y%m%d')
Fighting with different formats of date and time is a key part of analysing culture and society. For some past events, for instance, we do not know exact dates or times and have to work with estimates. Sometimes, we only know the approximate time span of when, say, a Roman consul lived. This can cause a lot of issues. lubridate is an excellent R package to help with date and time calculations (https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html). We do not have time here to go through the details, but I recommend checking it out next time you struggle with dates and times. Running str(sotu_all) will confirm that all the columns are now of the correct type.
str(sotu_all)
## 'data.frame': 230 obs. of 3 variables:
## $ speechtext: chr " Fellow-Citizens of the Senate and House of Representatives: I embrace with great satisfaction the opportunity "| __truncated__ " Fellow-Citizens of the Senate and House of Representatives: In meeting you again I feel much satisfaction in b"| __truncated__ " Fellow-Citizens of the Senate and House of Representatives: In vain may we expect peace with the Indians on ou"| __truncated__ " Fellow-Citizens of the Senate and House of Representatives: It is some abatement of the satisfaction with whic"| __truncated__ ...
## $ year : num 1790 1790 1791 1792 1793 ...
## $ date : Date, format: "1790-01-08" "1790-12-08" ...
With sotu_all, we have a collection of documents reflecting over 200 years in the history of the US in the words of each president. It is time to do something with it and text mine it. As said, text mining is a very established discipline. Therefore it is not surprising that there is a very strong text mining package in R. It is called tm and you will find it in many R introductions (see also https://cran.r-project.org/web/packages/tm/index.html). Please, load tm with library(tm).
library(tm)
Text mining is the process of analyzing natural language text; often sourced from online content, such as emails and social media (Twitter or Facebook). In this chapter, we are going to cover some of the most commonly used methods of the tm package and employ them to analyse the historical content of the State of the Union speeches. To this end, first we need to create a so-called corpus. A corpus is term originating from the linguistics community. All you need to know about a corpus is that it is basically a collection of text documents. In R, you can create a corpus from many sources (online or local) and from many different formats such as plain text or PDF. Try getSources() to see which sources you can use.
getSources()
## [1] "DataframeSource" "DirSource" "URISource" "VectorSource"
## [5] "XMLSource" "ZipSource"
Now try to get the formats you can directly import into a tm corpus with getReaders().
getReaders()
## [1] "readDataframe" "readDOC"
## [3] "readPDF" "readPlain"
## [5] "readRCV1" "readRCV1asPlain"
## [7] "readReut21578XML" "readReut21578XMLasPlain"
## [9] "readTagged" "readXML"
There are many different formats! Luckily, we can ignore most of these formats here, as we have already loaded our data in a nice plain text format. If you are interested, please check online for the many ways the tm package is used. All we need for now is the vector source.
The first task we set ourselves is to understand historical policy differences. The last SOTU address of the Republican George W. Bush was in 2008, while the first one of the Democrat Barack Obama was in 2009. These were also the years of the height of a very severe financial crisis. So, we expect interesting changes in content. To identify these, we would like to produce a word cloud to compare the most frequent terms in both speeches in order to understand policy differences. Let us first create a smaller sample of the data containing only the speeches of 2008 and 2009. Do you remember how to do it with command subset? Take a moment to reflect on your subsetting skills and then type sotu_2008_2009 <- subset(sotu_all, (year == 2008) | (year == 2009)) to create a new sotu_2008_2009 data frame containing only the speeches of 2008 and 2009.
sotu_2008_2009 <- subset(sotu_all, (year == 2008) | (year == 2009))
Now we need to create a corpus. We use the vector source method. Each column in a data frame is of course a vector – as you might remember. We are only interested in the full texts and thus only need the column speechtext. Please, create the corpus with corp_small <- Corpus(VectorSource(sotu_2008_2009$speechtext)).
corp_small <- Corpus(VectorSource(sotu_2008_2009$speechtext))
We will investigate later the details of the corpus. For now we would like to concentrate on the well-defined steps to conduct a text analysis. It is quite simple really because of an insight by a linguist called Zipf (https://en.wikipedia.org/wiki/Zipf%27s_law). He stated that the more frequently a word appears in a document the more important it is for its content. So, if we talk about beautiful gardens, we will use the word garden often but also flowers, grass, etc. All we need to do then is to count all the words in a document and derive the word frequency to get an overview of what the document is about. We do this for all documents in a collection like SOTU and get a good comparison.
But before we count words, we need to tell the computer what a word is. A computer does not know such trivial things. In order to keep it simple, we say that a word is anything that is separated by white spaces. This means we first need to clean the texts so that these white spaces become consistent, and the computer can easily recognise the words. Let’s identify words then by first removing all punctuation with corp_small <- tm_map(corp_small, removePunctuation).
corp_small <- tm_map(corp_small, removePunctuation)
The tm_map function provides a convenient way of running the clean-up transformations in order to eliminate all details that are not necessary for the analysis using the tm package. To see the list of available transformation methods, simply call the getTransformations() function. Remember our final target is to get to a list of important words per document.
getTransformations()
## [1] "removeNumbers" "removePunctuation" "removeWords"
## [4] "stemDocument" "stripWhitespace"
Next we standardize all word spellings by lower-casing them all with corp_small <- tm_map(corp_small, content_transformer(tolower)). This makes it easier to count them. Why?
corp_small <- tm_map(corp_small, content_transformer(tolower))
Let us also remove all numbers as they do not add to the content of the speeches. Type corp_small <- tm_map(corp_small, removeNumbers).
corp_small <- tm_map(corp_small, removeNumbers)
Let us also strip out any extra white space that we do not need with corp_small <- tm_map(corp_small, stripWhitespace). Such extra whitespace can easily confuse our word recognition. Extra whitespaces can actually be a real issue in historical documents, as they have often been OCR’ed with imperfect recognition. Look it up! There are many cases described on the web. The Bad Data Handbook - Cleaning Up The Data So You Can Get Back To Work by McCallum and published by O’Reilly is an excellent summary of this other side of big data. The bigger the data the more likely it is also somehow ‘bad’!
corp_small <- tm_map(corp_small, stripWhitespace)
When I said earlier that the most frequently used words carry the meaning of a document, I was not entirely telling the truth. There are so-called stopwords such as the, a, or, etc., which usually carry less meaning than the other expressions in the corpus. You will see what kind of words I mean by typing stopwords(‘english’).
stopwords('english')
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
Do you agree that these words do not really carry meaning in English? Stopwords should be removed to concentrate on the most important words. Let us do that with the function removeWords and corp_small <- tm_map(corp_small, removeWords, stopwords(‘english’)).
corp_small <- tm_map(corp_small, removeWords, stopwords('english'))
That is it for the time being in terms of document transformations. We are now confident that we can start counting words to extract meaning. We would like to create a so-called term-document matrix (tdm). You remember that a matrix is 2-dimensional structure in R? The tdm will simply contain the most important words (also called terms) in the rows of the matrix and the (2) documents/speeches in the columns. Create it with tdm <- TermDocumentMatrix(corp_small).
tdm <- TermDocumentMatrix(corp_small)
Have a look at what you have done with tm’s inspect(tdm[1:5, 1:2]). Do you see the first 5 terms in the 2 documents? The numbers represent their frequencies.
inspect(tdm[1:5, 1:2])
## <<TermDocumentMatrix (terms: 5, documents: 2)>>
## Non-/sparse entries: 7/3
## Sparsity : 30%
## Maximal term length: 10
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms 1 2
## abandon 1 0
## ability 4 3
## abroad 2 1
## acceptable 1 0
## accepts 1 0
Let us next prepare a simple word cloud. Load the package with library(wordcloud).
library(wordcloud)
Next, tell the wordcloud package that you have a nice term document matrix ready for it with tdm_matrix <- as.matrix(tdm).
tdm_matrix <- as.matrix(tdm)
As always, we like things to be tidy and tell R that the matrix contains 2 speeches from 2008 and 2009. Name the columns with colnames(tdm_matrix) <- c(‘SOTU 2008’, ‘SOTU 2009’). colnames btw is a more convenient use of the names function we have already met earlier. It does what its name says it does.
colnames(tdm_matrix) <- c('SOTU 2008', 'SOTU 2009')
Now create the word cloud for SOTU 2008 by simply typing commonality.cloud(tdm_matrix,random.order=FALSE, scale=c(5, .5),colors = brewer.pal(4, ‘Dark2’), max.words=400). Ignore all the warnings, please. They are mainly about certain terms that cannot be plotted in the various regions of the plot. For learning purposes, we can ignore the warnings.
commonality.cloud(tdm_matrix,random.order=FALSE, scale=c(5, .5),colors = brewer.pal(4, 'Dark2'), max.words=400)