Scraping historical tweets without a Twitter Developer Account

Mihaela Grigore
8 min readApr 6, 2021

--

Image by <a href=”https://pixabay.com/users/tumisu-148124/?utm_source=link-attribution&amp;utm_medium=referral&amp;utm_campaign=image&amp;utm_content=4168483">Tumisu</a> from <a href=”https://pixabay.com/?utm_source=link-attribution&amp;utm_medium=referral&amp;utm_campaign=image&amp;utm_content=4168483">Pixabay</a>
Image by Tumisu from Pixabay

The tool we will use:

  • snscrape

What you need:

  • Python 3.8

What you don’t need:

  • a Twitter Developer Account

For a research project related to public discourse about results on international large scale assessments I needed to scrape historical tweets, going back all the way to the begining of Twitter. This is how I discovered snscrape, a wonderful tool, easy to setup and use.

I didn’t find snscrape from the start, initially I was reading through the intricate details of Twitter Developer Account, application procedure, different levels of access, limits etc etc. But luckily a friend recommended snscrape and suddenly the task of collecting tweets became extremely easy.

Snscrape is a popular tool with social scientists for Tweets collection, at least in 2021. Apparently, it bypasses several limitations of the Twitter API.
The prettiest thing is that you don’t need Twitter developer account credentials (like you do with Tweepy, for example)

Table of contents

  1. Installing snscrape
  2. How to use snscrape
  3. Calling snscrape CLI commands from Python Notebook
  4. Using snscrape Python wrapper
  5. Tweets meta-information gathered with snscrape
  6. Dataset manipulation: JSON, CSV and Pandas DataFrame
  7. Basic exploration of our collected dataset of tweets
  8. Bonus: Publishing your Jupyter Notebook on Medium
  9. What next ? Sentiment analysis

We begin with some standard library imports.

import os
import subprocess

import json
import csv

import uuid

from IPython.display import display_javascript, display_html, display

import pandas as pd
import numpy as np

from datetime import datetime, date, time

1. Installing snscrape

Snscrape is available from its official github project repository.

Snscrape has two versions:

  • released version, which you can install by running this line in a command line terminal: pip3 install snscrape (for a Windows machine)
  • development version, which is said to have richer functionality, so this is the one I’ll be using.
    I will use the latter.

First, let’s check the current Python version, as snscrape documentation mentions it requires Python 3.8

from platform import python_version
print(python_version())
3.8.3

If you don’t see 3.8.x in your case, please upgrade your Python version before you continue this tutorial, otherwise you will not be able to install snscrape.

Installing the development version of snscrape.

pip install git+https://github.com/JustAnotherArchivist/snscrape.gitimport snscrape.modules.twitter as sntwitter

2. How to use snscrape

  • through its command line interface (CLI) in the command prompt terminal.
  • use Python to run the CLI commands from a Jupyter notebook, for example (if you don’t want to use the terminal to run commands)
  • or use the official snscrape Python wrapper. The Python wrapper is not well documented, unfortunately.

Parameters you can use:

  • — jsonl : get the data into jsonl format
  • — progress
  • — max-results : limit the number of tweets to collect
  • — with-entity : Include the entity (e.g. user, channel) as the first output item (default: False)
  • — since DATETIME : Only return results newer than DATETIME (default: None)
  • — progress : Report progress on stderr (default: False)
#Run the snscrape help to see what options / parameters we can use
cmd = 'snscrape --help'

#This is similar to running os.system(cmd), which would show the output of running the command in the Terminal
#window from where I started my Jupyter Notebook (which is what I used to develop this code)
#By using subprocees, I capture the commands's output into a variable, whose content I can then print here.
output = subprocess.check_output(cmd, shell=True)

print(output.decode("utf-8"))

Output

usage: snscrape [-h] [--version] [-v] [--dump-locals] [--retry N] [-n N]
[-f FORMAT | --jsonl] [--with-entity] [--since DATETIME]
[--progress]
{telegram-channel,vkontakte-user,weibo-user,facebook-group,instagram-user,instagram-hashtag,instagram-location,reddit-user,reddit-subreddit,reddit-search,twitter-thread,twitter-search,facebook-user,facebook-community,twitter-user,twitter-hashtag,twitter-list-posts,twitter-profile}
...

positional arguments:
{telegram-channel,vkontakte-user,weibo-user,facebook-group,instagram-user,instagram-hashtag,instagram-location,reddit-user,reddit-subreddit,reddit-search,twitter-thread,twitter-search,facebook-user,facebook-community,twitter-user,twitter-hashtag,twitter-list-posts,twitter-profile}
The scraper you want to use

optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v, --verbose, --verbosity
Increase output verbosity (default: 0)
--dump-locals Dump local variables on serious log messages (warnings
or higher) (default: False)
--retry N, --retries N
When the connection fails or the server returns an
unexpected response, retry up to N times with an
exponential backoff (default: 3)
-n N, --max-results N
Only return the first N results (default: None)
-f FORMAT, --format FORMAT
Output format (default: None)
--jsonl Output JSONL (default: False)
--with-entity Include the entity (e.g. user, channel) as the first
output item (default: False)
--since DATETIME Only return results newer than DATETIME (default:
None)
--progress Report progress on stderr (default: False)

3. Calling snscrape CLI commands from Python Notebook

Notice I make use of a few snscrape parameters:

  • — max-results, to limit the search
  • — jsonl, to have my results saved directly into a json file
  • — since yyyy-mm-dd, so collect tweets starting with this date
  • twitter-search will tell snscrape what the actual text to search is.
    Notice I use the ‘until:yyyy-mm-dd’. This is a workaround for the fact that sncrape does not have support for an — until DATETIME parameters.
    So I’m using Twitter’s search until feature. That is, I am using a feature already built-in in Twitter search.
    For more search operators that you can use and pass on to snscrape as part of the text to search for, see the Twitter documentation on search operators.
json_filename = 'pisa2018-query-tweets.json'

#Using the OS library to call CLI commands in Python
os.system(f'snscrape --max-results 5000 --jsonl --progress --since 2018-12-01 twitter-search "#pisa2018 lang:fr until:2019-12-31" > {json_filename}')

4. Using snscrape Python wrapper

start = date(2016, 12, 5)
start = start.strftime('%Y-%m-%d')

stop = date(2016, 12, 14)
stop = stop.strftime('%Y-%m-%d')
keyword = input('Keyword to search for in Twitter archive:')Keyword to search for in Twitter archive:pisa2018maxTweets = 1000

#We are going to write the data into a csv file
filename = keyword + start + '-' + stop + '.csv'
csvFile = open(filename, 'a', newline='', encoding='utf8')

#We write to the csv file by using csv writer
csvWriter = csv.writer(csvFile)
csvWriter.writerow(['id','date','tweet'])

#I will use the following Twitter search operators:
# since - start date for Tweets collection
# stop - stop date for Tweets collection
# -filter:links - not very clear what this does, from Twitter search operators documentation: https://developer.twitter.com/en/docs/twitter-api/v1/rules-and-filtering/search-operators
# but it looks like this will exclude tweets with links from the search results
# -filter:replies - removes @reply tweets from search results
for i,tweet in enumerate(sntwitter.TwitterSearchScraper(keyword + 'since:' + start + ' until:' + \ stop + ' -filter:links -filter:replies').get_items()):
if i > maxTweets :
break
csvWriter.writerow([tweet.id, tweet.date, tweet.content])

csvFile.close()

5. Tweets meta-information gathered with snscrape

Let’s have a look at all the information that is available for every single tweet scraped using snscrape.

For this code I am using one example file that I made precidely for this, which contains a single JSON object. If you want to use a JSON file created with the steps above, you need to make some changes before you can run json.loads on it, as explained in this stackoverflow discussion.

The solution for pretty printing JSON data inside a Jupyter Notebook comes from this github project.

Click on the + icons to expand the contents of that particular item.

filename = 'example.json'

with open(filename) as json_file:
data = json.load(json_file)

class RenderJSON(object):
def __init__(self, json_data):
if isinstance(json_data, dict) or isinstance(json_data, list):
self.json_str = json.dumps(json_data)
else:
self.json_str = json_data
self.uuid = str(uuid.uuid4())

def _ipython_display_(self):
display_html('<div id="{}" style="height: 600px; width:100%;font: 12px/18px monospace !important;"></div>'.format(self.uuid), raw=True)
display_javascript("""
require(["https://rawgit.com/caldwell/renderjson/master/renderjson.js"], function() {
renderjson.set_show_to_level(2);
document.getElementById('%s').appendChild(renderjson(%s))
});
""" % (self.uuid, self.json_str), raw=True)

RenderJSON([data])

6. Dataset manipulation: JSON, CSV and Pandas DataFrame

Converting JSON to Pandas DataFrame

Pandas DataFrame is the data structure of choice in Data Science, so we read the JSON file into a DataFrame.

Then we save it as CSV, since CSV is the most common file type for Data Science small projects.

filename = 'pisa2018-query-tweets'
tweets_df = pd.read_json(filename +'.json', lines=True)
tweets_df.shape

Output

(327, 23)

Saving DataFrame to CSV

tweets_df.to_csv(filename +'.csv', index = False)

7. Basic exploration of our collected dataset of tweets

Basic introduction to tweets

Tweets are 280 character messages (hence the name ‘microblogging’). Just like on other social media platforms, you need to create an account and then you can start participating to the tweetverse.

Tweets act as short status updates. Tweets appear on timelines. Timelines are collections of tweets sorted in a chronological order. On your account’s home page, you’re shown a timeline where tweets from people you follow will be displayed.

You can post your own brand new tweet, retweet an already existing tweet (which means ou just share the exact same tweet) or quote an existing tweet (similar to retweeting, but you can add your own comment to it).

You can also reply to someone else’s tweets or ‘like’ them.

Tweets often contain entities, which are mentions of:

  • other users, which appear in the form of @other_user
  • places
  • urls
  • media that was attached to the tweet
  • hashtags, that look like #example_hashtag. Hashtags are just a way to apply a label on a tweet. If I’m tweeting something about results of PISA, the Programme for International Student Assessment, I will likely use #oecdpisa in my tweet, for example.

Counting the number of Tweets we scraped

The following cell is overkill in this particular scenario, but imagine you just scraped 1 million tweets and you want to know how many you got. The cell below is a very efficient way to count in that case.

num = sum(1 for line in open(json_filename))
print(num)

Check tweets for a particular text

substring = 'justesse'

count = 0
f = open(json_filename, 'r')
for i, line in enumerate(f):
if substring in line:
count = count + 1
obj = json.loads(line)
print(f'Tweet number {count}: {obj["content"]}')
print(count)
f.close()

The actual content of the tweet is available through test_df[‘content’] or test_df.content

renderedContent seems to contain the same information as content.

tweets_df.iloc[0].content

Output

"Le périscolaire pour apprendre en s'amusant ... #pisa2018\n#UnPlanBpourLécole : https://t.co/cC28XiWfc7"

Links mentioned in the tweet are also listed separately in the outlinks column.

tweets_df.iloc[0].outlinks

Output

['https://www.amazon.fr/dp/1686530544']

We can gauge the popularity of a tweet through these features:

  • replyCount
  • retweetCount
  • likeCount
  • quoteCount
popularity_columns = ['replyCount', 'retweetCount', 'likeCount', 'quoteCount']
tweets_df.iloc[0][popularity_columns]

Output:

replyCount      0
retweetCount 0
likeCount 0
quoteCount 0
Name: 0, dtype: object

Find the most retweeted tweet in our dataset.

tweets_df.iloc[tweets_df.retweetCount.idxmax()][['content','retweetCount']]

Output:

content         #PISA2018\nLa France médiocrement classée dans...
retweetCount 161
Name: 103, dtype: object

8. Bonus: Publishing your Jupyter Notebook on Medium

pip install jupyter_to_mediumimport jupyter_to_medium as jtm
jtm.publish(
'Notebook_name.ipynb',
integration_token=paste_your_own_token_here,
pub_name=None,
title='Desired Medium article title',
tags=['scraping with Python', 'Twitter archive'],
publish_status='draft',
notify_followers=False,
license='all-rights-reserved',
canonical_url=None,
chrome_path=None,
save_markdown=False,
table_conversion='chrome'
)

And that’s about it for a quick intro to scraping tweets without the need to apply for a Twitter Developer Account and with no limitations for the maximum number of tweets we can get or for how far back in time we can go.

9. What next ? Sentiment analysis

What to do next with the tweets you just scraped ? In my case, I was very interested in NLP for sentiment analysis of tweets, or you may try topic modelling using Latent Dirichlet Allocation (LDA)

--

--

No responses yet