Introduction to working with APIs
An API (âApplication Programming Interfaceâ, you donât need to remember that, no one does) is a collection of commands you can issue over the internet to collect data from websites. Collecting this data is usually done over code, but you can also see this data by just visiting URLs in your browser.
This URL https://api.github.com/users/princeton-cdh/repos will list information about all the code under the CDHâs GitHub account. (If youâre in Firefox, click on the âRaw Dataâ button to see things more clearly.) This is JSON data like any other, and you could write a quick script to list out all the codebases weâve worked on.
The power of this is that as long as you follow this formula:
https://api.github.com/users/{USER}/repos
you can get the public code bases on any GitHub user. From that information we could then get a count of all assignments we have related to each project so that we know how much work is left to do. GitHubâs API is complicated, but hereâs a link to the documentation if youâre interested.
If we didnât have an API, to accomplish something similar weâd have to scrape the CDHâs profile page, which would be a lot harder. APIs store information in a clean, consistent way that makes it easy to collect information (for better and for worse.)
Wrappers
We just finished working with tweepy
, a library that helps us to work with the Twitter API in python. This is an example of a wrapper. APIs are built to work with any language; all you need is an internet connection. But as weâll see, it can be clumsy to work directly with the API, so programmers built API wrappers like tweepy
to make their code cleaner and easier to use.
If youâre ever looking to work with an API, letâs take Reddit as an example, itâs worth googling âreddit api python wrapperâ, and youâll come across libraries that other people have used to connect to Redditâs API. For python, I recommend praw
.
Wrappers are simply supposed to make your life easier, but itâs possible that youâre going to work with APIs that donât have wrappers in the language you need. The New York Times API, for example, doesnât have a good python wrapper. If you make one, publish it online, and you can help other programmers!
API Limitations
As weâll see, itâs important to understand the limitations of an API as you embark on a research project. When companies like Spotify or Twitter open up their servers to connect with other computers, they want to impose rules to make sure no one spams their servers or scrapes their entire database too quickly. When considering how to pursue a research question, keep ârate limitingâ and âdata limitingâ in mind.
Rate limiting
Rate limits are kind of like speed limits imposed by an API. If Twitter allowed anybody to scrape every single tweet ever, it would place immense strain on their infrastructure (and since data is their main commodity, they can sell this access for a hefty fee.)
Rate limits are documented here. We have created one app, so in each 15-minute window we can execute 300 searches.
Since each search contains a maximum of 200 tweets, we can get 60,000 tweets (!) in 15 minutes. When just starting out, rarely will oneâs research questions reach the limit, but if you need a lot of data for your analysis, your scrape may take days to execute if the APIâs rate limit is especially low. Smaller companies probably will have stronger limitations on how much data you can collect from their site. In the case of Facebook and TikTok, you canât scrape data at all.
Data Limiting
This data is highly valuable, and companies like Facebook have decided to cut off developers from accessing that data. Spotify wonât allow programmers to access song play counts (even though itâs available in their UI.) Twitter limits users to only the latest 3200 tweets when scraping a userâs timeline. You can only get the latest 100 retweets on a tweet which has led to problems in the past.
When pursuing a research question, make sure that the resources that you need to answer that question will be fully included in the dataset youâre collecting otherwise you may be misinforming people.
API Keys
If someone has your public and private key, they can log in to the API and make it appear like you are abusing the rate limit or even post tweets as you (!). Never publish your private key anywhere.
Legal
I AM NOT A LAWYER, THIS IS NOT LEGAL ADVICE
As long as youâre using the API, youâre not doing anything illegal, but web scraping in general (say if I wrote a script to scrape all the email addresses from https://www.princeton.edu/) is quasi-illegal. We wonât go into how to scrape outside of APIs in these seminars, but consider that making thousands of requests to any website can cripple its infrastructure. This is called a Denial of Service Attack (DoS) and is super illegal. So itâs best to follow best practices when collecting data outside the formal API process.
Ethics
Working with Twitter data most users know that Twitter is a public place, and anyone can see their tweets. There are fewer ethical considerations when working with Reddit data, where people are operating under the assumption of anonymity.
But when you send a tweet, you may assume that itâs public, but how much control should you have after posting to a platform?
đ¤ How would you feel if a tweet you wroteâŚ
- âŚwas included as an example in a book about internet culture?
- âŚwas used to train a machine learning model to detect sarcasm?
- âŚincluded a picture that was used to train a facial recognition software?
All these are real-life examples. As you develop your research question, consider the ethical implications of collecting this data. If you scraped a userâs timeline, was this person a public figure? If you scraped all tweets that used a certain hashtag, how many people are you collecting data from? Did any of these people expect their data to be collected by a first years at Princeton?
Popular APIs
The Twitter API
General structure
Twitter is exceptional as a social media company because it gives researchers access to almost its entire site. Twitter allows you to access the following databases:
- Tweets
- Users
- Direct Messages
- Lists
- Trends
- Media
- Places
đ Given these databases, what kind of questions could you ask?
What is a Twitter ID?
A Twitter ID is a unique identifier for each tweet. Having a unique ID is important for Twitter to store the thousands of tweets posted every second. If you have the ID of a tweet (perhaps from JSON data that youâve scraped) and you want to see the original tweet in your browser, just follow this recipe:
https://twitter.com/{PROFILE}/status/{TWEET_ID}
# Example:
https://twitter.com/mattxiv/status/1368246126302945284
If youâd like, try to use this tweet ID 1116487177364365313
to find the original Tweet. Often, organizations like Documenting the Now will store tweet IDs only, and then researchers can ârehydrateâ those IDs. Since thereâs a limited time window to collect tweets during momentous occasions, youâre welcome to contribute to resources like these to preserve how people reacted online. Check out Documenting the Nowâs catalog.
Existing datasets
Before scraping your own datasets, consider that other people may have had similar questions as you. Check out these sites before beginning your work to see if other researchers can save you time! People who create datasets want them to be put to good use, so theyâll try to share them as widely as possible.
- Wikipedia has a lot of tabular data you can draw from.
- Github is often used for code, but researchers will often post their datasets there as well
- Documenting the Now, as mentioned above, has a huge resource of tweets.
- r/datasets contains a lot of datasets that people compiled for their research or for fun.