Data Cleaning
The most important (and least glamorous) part of working with data is creating a data pipeline. This data pipeline implements a cleaning process where you transform the data you find in the wild into a dataset that you can use to answer your research question. Data cleaning includesā¦
- Joining multiple datasets together.
- Updating values to adhere to a standard.
- e.g. survey data may ask for class year and youāll get ā2024ā, ārising sophomoreā, āseniorā, etc., and your goal would be to translate them all into four categories: 2022, 2023, 2024, 2025
- Manually splitting up data into categories or sorting data.
- Removing unnecessary data.
- Changing file formats to fit technical requirements.
This is also called ādata mungingā, and is often the longest part of the research process. When drawing data from an API, however, we can expect data to follow strict guidelines. Weāll be focusing on removing unnecessary data and changing Twitter data from the JSON we receive to a CSV that we can open in Excel or Google Sheets. Shifting Twitter data from JSON to CSV can turn a gigabyte of data into a megabyte of useful data, a 1000-fold difference.
Turn JSON into CSV
# import relevant dictionaries
import json
import requests
import pandas as pd
# load JSON from website
r = requests.get('https://raw.githubusercontent.com/kmcelwee/fsi-web-scraping-seminar/main/data/dog_feelings-tweets.json')
all_tweets = json.loads(r.text)
# Cycle through the tweets in that JSON and collect the information we care about
csv_dict = []
for tweet in all_tweets:
csv_dict.append({
'timestamp': tweet['created_at'],
'id': tweet['id'],
'text': tweet['full_text'],
'favorite_count': tweet['favorite_count'],
'retweet_count': tweet['retweet_count'],
'hashtags': ';'.join([h['text'] for h in tweet['entities']['hashtags']])
})
# Turn that list of dictionaries into a dataframe and save as a CSV
df = pd.DataFrame(csv_dict)
df.to_csv('dog_feelings-tweets.csv', index=False)
Here is a script that turns the JSON gathered from @dog_feelings
tweets and turns it into a CSV with the most basic features like timestamp
and favorite_count
. A good way to build a CSV is to create a list of dictionaries, as shown above. This would look something like:
[
{
'timestamp': 'Wed May 10 03:16:19 +0000 2017',
'id': 862144241782444038,
'text': "good. night. don't let. the bed bugs. bamboozle",
'favorite_count': 1856,
'retweet_count': 376,
'hashtags': ''
},
{
'timestamp': 'Wed May 10 00:44:59 +0000 2017',
'id': 862106154519977984,
'text': 'a thing. to remember: good things. are good. BUT. bad things. are not good. i think',
'favorite_count': 1562,
'retweet_count': 291,
'hashtags': ''
}
...
]
This would create the following data frame:
timestamp | id | text | favorite_count | retweet_count | hashtags |
---|---|---|---|---|---|
Wed May 10 03:16:19 +0000 2017 | 862144241782444038 | good. night. donāt let. the bed bugs. bamboozle | 1856 | 376 | Ā |
Wed May 10 00:44:59 +0000 2017 | 862106154519977984 | a thing. to remember: good things. are good. BUT. bad things. are not good. i think | 1562 | 291 | Ā |
Notice that each dictionary has the same keys. These are the column names. And then the associated value with each key is that columnās value for the given row. Getting your head around this structure is fundamental, so if you donāt fully understand, make sure to not just skim over this.
The code above can be readily reused, as long as you can properly find the path to the tweet information that you need.
Data validation
Data validation is an important part of data cleaning, but it can get complicated. Fundamentally, you want to make sure that the assumptions that you have about your data set are actually true. The Twitter API is very consistent, so you shouldnāt encounter too many problems, but consider using pythonās assert
statement) to double check your assumptions about how the data is structured.
Iāve outlined some hurdles I encountered when working with Twitter data and how I checked for them, but chances are, you wonāt encounter most of them if you keep to small data sets.
Case Study: MTV Internship
Youāre an intern at MTV and your boss wants to know if musicians active on Twitter get paid more money than those who arenāt active on Twitter.
š¤ What are some questions we should ask before pursuing this project?
View examples
- How will we consider artists that aren't on Twitter?
- What list of musicians should we consider?
- How will I get financial data for all the artists?
- Are different genders equally likely to be on Twitter? And what are the gender pay disparities in the music industry?
- How do we define "active on Twitter"? One tweet a week? A month?
- We'll have to manually tie artists to their Twitter accounts. How long will that take?
- How might outliers distort our calculation? The entertainment industry follows the power law, meaning a small number of people make a majority of the money. If Beyonce, who doesn't tweet as much, commands 10x the money of Cardi B who tweets a lot, how that one data point skew our numbers?
Tips
- Data work is both unfulfilling and time consuming. Have a clear research question in mind before pursuing any kind of data cleaning process.
- Manually sorting data takes up more time than you think! Always run the calculations and weigh whether the research question you want to answer is worth the time that you expect to invest.