How to Use GOT3 to Extract Old Tweets in Python
Social media analysis is becoming an increasingly important part of the data scientist’s repertoire. Text is easily the most plentiful source of unstructured data available on the internet; one simply needs to look at the mind-boggling statistics about how much social media content is generated every single day to understand the sheer scale of the data being created. Twitter is one of the most popular social media platforms used to conduct text-based analysis, and this post will focus on how to leverage the power of Python’s open-source third-party libraries to extract all tweets relevant to the analysis required.
Twitter is one of the world’s most popular social media platforms
Twitter is the de facto social media platform used for text-based social network analysis, and for good reason. Having been created all the way back in 2006, Twitter has seen a total of 1.3 billion accounts created in its history and over 500 million tweets sent each day. That is a treasure trove of textual information available online that represents people’s real-time reactions to all manner of topics, from politics to sports to current affairs, movements and news. Also, unlike Instagram and Snapchat, Twitter is primarily a text-based platform with a hard character limit on each post (140 characters historically; 280 characters since 2017). That lends a degree of size predictability with respect to each tweet that is difficult to obtain from other social media platforms. For all those reasons and more, Twitter represents an excellent first step in extracting posts made by users all over the world on specific topics and performing techniques such as Sentiment Analysis to gain insights from this data.
Sentiment Analysis is one application to make use of the comprehensive textual data available on Twitter
With Python being the programming language of choice for most Data Scientists, libraries such as Python-Twitter and Tweepy are generally used as Python interfaces to the Twitter Search API in order to access the information on Twitter. However, this technique poses a problem. Twitter Search, and by extension its API, are not meant to be an exhaustive source of tweets. The Twitter Streaming API places a limit of just one week on how far back tweets can be extracted from that match the input parameters. So in order to extract all historical tweets relevant to a set of search parameters for analysis, the Twitter Official API needs to be bypassed and custom libraries that mimic the Twitter Search Engine need to be used. One of these, courtesy Jefferson Henrique & Dmitry Mottl is called GetOldTweets3 (GOT3). GOT3’s GitHub repository can be found here:
As with most open-source Python libraries installation is straightforward. As always, it is recommended to create a directory with its own Python Virtual Environment (preferably Python3.5+) for containerization and portability, as so:
cd Projects/Twitter python3.6 -m venv env source env/bin/activate
To install GOT3 on the Command Line:
python3.6 -m pip install GetOldTweets3
Once installed, this library’s commands can be relayed through either Command Line or inside Python itself. With respect to Command Line Utility, the results of the Tweet Search queries are stored inside a CSV file generated in the same directory (output_got.csv by default). For example:
-> Extracting the last 10 tweets by Donald Trump (@realDonaldTrump):
GetOldTweets3 --username “realDonaldTrump" --maxtweets 10
-> Extracting the last 5 tweets by Barack Obama (@barackobama) within a specific pair of dates (“until”date excluded)
GetOldTweets3 --username “barackobama” --maxtweets 5 --since 2019-06-01 --until 2019-07-01
But it is even more convenient to do this directly within Python itself, as extracting and storing tweet information inside Python variables lends itself well to the Regular Expressions & Data Cleaning operations that need to be carried out later to make these tweets useful for analysis. Each Tweet possesses several bits of information & metadata such as the tweet text, the username of the tweeter, the mentions & hashtags, the number of retweets and favorites, the date-time and the geolocation of the tweet. As described on the GitHub page, GOT3 has a class structure within Python that needs to be followed to extract tweets. Within Python:
import GetOldTweets3 as got;
-> Extracting the last 10 tweets by Donald Trump (@realDonaldTrump) / printing them out
tweetCriteria = got.manager.TweetCriteria().setUsername(‘realDonaldTrump’).setMaxTweets(10); tweets = got.manager.TweetManager.getTweets(tweetCriteria); for tweet in tweets: print(tweet.text + '\n');
This gives an output that mirrors the 10 most recent tweets on Donald Trump’s timeline, as displayed:
-> Extracting the last 5 tweets by Barack Obama (@barackobama) within a specific pair of dates (“until”date excluded) / printing them
tweetCriteria = got.manager.TweetCriteria().setUsername(‘barackobama’).setMaxTweets(5).setSince(‘2019-06-01’).setUntil(‘2019-07-01’); tweets = got.manager.TweetManager.getTweets(tweetCriteria); for tweet in tweets: print(tweet.text + '\n');
And that is how GOT can be used to extract all the tweets that are relevant for further analysis. There are several such search parameters that can be used to customize the retrieved list of tweets, like setQuerySearch() to search for a specific query, or setTopTweets() to only return top tweets. Because of Twitter’s immense popularity over the last decade and a half, obtaining historical tweets this way will allow for performing operations on an immense amount of data, increasing the likelihood of being able to extract insights and useful information after the necessary pre-processing has been performed.