Introduction to Twitter Scraping and Analysis with Python
A collection of web scraping case studies, exercises, tutorials, and resources for the 2021 FSI summer course Humanistic Approaches to Media and Data.
Tools
We will be programming exclusively with Google Colab. In order to interact with the Twitter API, we will be using the Python wrapper tweepy
.
Prerequisites
The seminars expect that you already have an approved Twitter Developer Account. Apply for access to the Twitter API. A basic understanding of Python is also expected.
Coding Resources
- YouTube Python Tutorials
- O’Reilly Learn: A collection of every programming book you’ll ever need. A subscription is free with your netID and offers video lectures, introductory books, and answers to frequently asked programming questions.
- Free programming books: Tons of programming materials are free online. It doesn’t matter too much how you get started, as long as you just start!
Useful Links
- Twitter Documentation
- Getting Started with the Twitter API (v2)
- Tweepy Documentation
- StackOverflow
- Google Colaboratory
- YouTube Python Tutorials
- Practice coding Python online
- YouTube Pandas Tutorials
- Introduction to Cultural Analytics & Python
- Coursera Python Web Data course
Vocabulary
- Web scraping: Collecting data from the internet in an automated way
- Python: The language most often used for writing quick scripts for web scraping and data analysis. If you are using an API, there’s a good chance that there’s a wrapper in Python. Python has also been embraced by the data science community and has many libraries to support data cleaning and visualization.
- API: Application Programming Interface. In the context of web scraping, it is a system used by web site owners to monitor and control how data exits their platform.
- HTTP Methods: (e.g. POST, GET, UPDATE, DELETE) For web scraping we’re only interested in what are called “GET requests”, a request made to the website’s server for information. With that request, you include the type of information you need, and usually an authorization token.
- Rate limiting: The speed limit placed on programmers that prevents them from making too many requests at once and overworking a site’s servers. This varies from site to site.
- JSON: JavaScript Object Notation. A lightweight data format used throughout the web. If you receive a response through an API, it will almost certainly be in this format.
- Wrapper: In web scraping a wrapper is a library of code that translates the API into a language that you’re comfortable programming in.
Thank you
Thank you to Kavita Kulkarni for her guidance in constructing these seminars. And thank you to Melanie Walsh for basically creating this course first 😄
Contact
My contact information is available on the Center for Digital Humanities website.