Currently we are focusing much of our group research on the social media site Twitter.
We view Twitter data as an excellent opportunity to 1) investigate ongoing language variation and change prompted by changes in culture and technology, and 2) illustrate and improve a number of techniques for handling big data.
As well, Twitter data presents no conflict of interests in relation to other research (i.e. we all work in private industry).
Here is what we have available at present:
So, This Is Twitter?:
This is a very simple way to familiarize yourself with the language of Twitter.
Most often we are introduced to Twitter by "following" the tweets of selected individuals and/or organizations that we admire and respect.
However, this gives one a poor perspective in relation to what's really flowing through the Twitter "firehose" as it's called.
On this page you get a simple, gritty, no-holds-barred, straight-from-the-stream sample of tweets.
Every 10 minutes the page is updated with whatever happens to be flowing through the pipes (no selection or bias on our part).
You'll be amazed at the variety and content, and, uh, well... just have a look.
Twitter Stratified Random Sample (SRS):
Perhaps the most difficult aspect of doing any analysis of Twitter data is knowing the scope. While it is relatively easy to do simple searches, once you have the returned data, there is no simple method to add context (i.e. my search returns N tweets, but N out of how many tweets I don't know). Consequently, one can't reliably establish rates and trends, which are currently the primary target of most researchers. It's the old problem of establishing existence, but not knowing extent.
As a remedy to this problem with rates (and a number of other related issues), we have begun to build a stratified random sample (SRS) of Twitter data (i.e. tweets and associated statistics). Sparing the technical details, our overall target is to build a set of month-based corpora, each containing at least one million English tweets (plus any other tweets we collect in the process). To do this, we randomly sample the tweet stream every 10 minutes and collect until we have 240 tweets that we believe are English (our own test). 240 x 6 x 24 x 30 = 1,036,800 tweets minimum (we currently have been running about 3:1 non-English to English, so by the end of a month we've generally collected over 3 million tweets). In this manner, we have a time-stratified, random sample of tweets of sufficient size such that any trend found in the corpus should represent the full population with a very high degree of reliability.
We have been actively collecting samples since mid-November 2011.
Using the Twitter SRS, it is now a simple matter to establish rates of occurrence by date, by tweet, and/or by token. This allows us a very reliable means to establish trends, both for linguistic data (see the Twitter Lexicon below) and for topical data such as product or brand tracking (see Twitter Trends below).
As well, because we have collected the tweet text itself, we are not limited to the simple searches provided by the Twitter API. We have full control of the text processing methods and are able to base our trend data on much more targeted searches using full regular expressions and specific tokenization methods.
Several have asked about downloading the SRS data.
We now have several sub-samples that you can download.
Please see the Corpora page or contact us for more information.
Twitter Trend Tracking:
Once you have a well-designed SRS, accurate trending and tracking is a fairly straightforward process. Statistically speaking, any ratio, rate, trend, etcetera that is located in the sample is expected to hold true in the full population, and based on the size of the sample we can also expect a high degree of reliability. Combining this with a logic-based search using multiple regular expressions (i.e. a "give me this and that but not those" kind of search), you can unravel the sentiment of the Twitter universe (the twitterverse) in relation to a particular topic. This could be for tracking known topics, or for discovering emerging ideas.
Here are some BETA versions of reports we have constructed for tracking known topics. Updates will happen regularly as we improve the method or change topics:
- Romney - How is Mitt Romney doing on Twitter? Steady, but now with a slight increase.
- Santorum - Dwindled down to pretty much nothing.
- Gingrich - I'm down a bit. I'm up! Nope, never mind.
- Obama - Obama is always steady, almost like it's orchestrated. A little surge now and then if he says something controversial.
- Santa Claus - This is a great chart. A classic lack of interest after December. Santa gets no respect.
- Tim Tebow - Tim Tebow is back in the news. Interesting chart. Not much going on after the Superbowl, then he gets traded.
Have a look at
a linguistically-principled trending method based on the Twitter SRS.
Twit Trend is unique in that 1) it is generated from our SRS of twitter data and therefore the trends have a high degree of statistical validity and reliability and 2) it demonstrates both emerging and declining trends on a daily and weekly basis.
Use it to add linguistic-based trends to your applications and/or website.
You'll have to know what to do with data in the JSON format in order to use it, but we've done our best to make it simple and straight-forward.
Updated every six hours.
Have a look.
Twitter Current English Lexicon:
Based on the Twitter Stratified Random Sample Corpus, we regularly extract the Twitter Current English Lexicon. Basically, we're 1) pulling all tweets from the last three months of corpus entries that have been marked as "English" by the collection process (we have to make that call because there is no reliable means provided by Twitter), 2) removing all #hash, @at, and http items, 3) breaking the tweets into tokens, 4) building descriptive and summary statistics for all token-based 1-grams and 2-grams, and 5) pushing the top 10,000 N-grams from each set into a database and text files for review. So, for every top 1-gram and 2-gram, you know how many times it occurred in the corpus, and in how many tweets (plus associated percentages).
This is an interesting set of data, particularly when you compare it with a "regular" English corpus, something traditional like the Brown Corpus. Unlike most corpora, the top token (1-gram) for Twitter is "i" (as in me, myself, and I), there are a lot of intentional misspellings, and you find an undue amount of, shall we say, "callus" language (be forewarned). It's a brave new world if you're willing.
To use this data set, we recommend using the database version and KwicData, but you can also use the text version. Download the ZIP file you want, unzip it, then read the README file for more explanation about what's included.
Download the most current
TEXT Version, or