- Introduction to Pandas đź
Introduction to Pandas đź
If you need to perform any kind of calculation with tabular data, use Pandas. Pandas is a library that helps programmers query, perform calculations on, and clean CSVs. If you get nothing else from these seminars, learning Pandas is probably the most useful investment of your time. Regardless of what you end up doing, youâre inevitably going to work with data in a table, and Pandas is a lot faster than Excel or Google Sheets.
Pandas has been to space. Itâs used on Wall Street and the Federal Reserve. Itâs used by digital humanitists and data journalists. Itâs powerful, and thankfully, if you know how to code, itâs pretty straightforward to learn!
What are data frames and series?
A âdata frameâ is a word youâll hear a lot. Itâs what Pandas (as well as other data manipulation software) calls the table object. Itâs also why the most common variable used to store a CSV is df
. In Pandas vernacular, âseriesâ just means âcolumnâ. A data frame contains multiple series objects.
Read and write a CSV
For all our examples weâll be using our CSV of @dog_feelings
tweets. In order to load a CSV you just need to provide a path to a CSV file, whether thatâs on your computer or on the internet.
url = 'https://raw.githubusercontent.com/kmcelwee/fsi-web-scraping-seminar/main/data/dog_feelings-tweets.csv'
df = pd.read_csv(url)
After changing your CSV, you may want to save it as a file, for that, just use the .to_csv()
method, where you just provide a filename where you want to save your data frame. (Just as a warning though, this will overwrite anything with the same name!)
df.to_csv('dog_feelings-tweets-v2.csv', index=False)
In general, just add index=False
. Itâs not worth going into the details here, but put succinctly, youâll get an extra column if you donât include that flag.
Selecting columns
To view all columns:
df.columns
To work with one column:
# Select the 'retweet_count' column
df['retweet_count']
To work with a subset of multiple columns:
# Select the 'retweet_count' and 'favorite_count' columns
df[['retweet_count', 'favorite_count']]
To create a new column, all you need to do is define it. For example, if we wanted to add favorites and retweets together for some reason, just add them together and set them equal to a new column:
df['retweets_plus_favorites'] = df['retweet_count'] + df['favorite_count']
There are many ways to apply functions to columns to manipulate them, but with this baseline knowledge, I trust you can Google appropriately. If you want to learn more, check out this YouTube lecture:
Selecting rows
In order select a subset of rows we often want to filter on a certain condition. For example, what if we wanted to only get tweets that had more than 1300 favorites, we could make whatâs called a âmaskâ. First we need to create an array of True and False (boolean) values:
df['favorite_count'] > 1300
The output of which would be:
0 True
1 True
2 True
3 True
4 True
...
1124 False
1125 False
1126 False
1127 False
1128 True
You can then apply this as mask to the data frame by using square brackets like so:
df[df['favorite_count'] > 1300]
This will select only the rows where df['favorite_count'] > 1300
is True.
Working with different datatypes
To see the different datatypes in our df
, type df.dtypes
. Youâll get the following result:
timestamp object
id int64
text object
favorite_count int64
retweet_count int64
hashtags object
Weâre oversimplifying here, but int64
means that the column contains an integer and the object
columns here are strings.
When we want to filter by date though (which we often want to do), we need to parse the timestamp column as a âdatetimeâ object. You do this with the pd.to_datetime
function. You can you can overwrite the timestamp
column like so:
df['timestamp'] = pd.to_datetime(df['timestamp'])
If you look at df.dtypes
again, youâll see that the data type of the timestamp
column is datetime64[ns, UTC]
. Now we can create new columns like âmonthâ or âday-of-weekâ using the .dt
accessor.
df['month'] = df['timestamp'].dt.month
df['day-of-week'] = df['timestamp'].dt.dayofweek
Most common functions
.shape
Often youâll want to get the number of rows or columns in a data frame. You can access this info using the .shape
attribute:
# return a list that gives the shape of the data frame
df.shape
# >>> (1129, 6)
# number of rows in data frame
df.shape[0]
# number of columns in a data frame
df.shape[1]
.head()
& .tail()
If you just want to see the first few or last few rows of a data frame, use the .head()
and .tail()
methods.
# Show the first 5 rows of a df
df.head()
# Show the first 10 rows of a df
df.head(10)
# Show the last 5 rows of a df
df.tail()
# Show the last 10 rows of a df
df.tail(10)
.sort_values()
Sort a dataframe by the given column.
# Sort df by the column 'favorite_count'
df.sort_values('favorite_count')
# Sort df by the column 'favorite_count' in ascending order
df.sort_values('favorite_count', ascending=False)
.count()
If called on a data frame, the count function gives the number of non-null values in all columns. If called on a series, the count function will return the number of non-null values in that column.
# How many non-null values are in each column?
df.count()
# How many non-null values are in the column 'hashtags'
df['hashtags'].count()
.mean()
, .median()
, .min()
, .max()
, and .quantile()
When provided numerical data, you will inevitably want to find various statistics on that distribution. Here are some of the most popular:
# Average number of retweets per tweet
df['retweet_count'].mean()
# Median number of retweets per tweet
df['retweet_count'].median()
# Minimum number of retweets per tweet
df['retweet_count'].min()
# Max number of retweets per tweet
df['retweet_count'].max()
# 25th quantile of retweets per tweet
# (meaning 25 percent of tweets have fewer retweets than this number)
df['retweet_count'].quantile(.25)
Question: Why might you prefer median instead of mean as a way to describe your data?
View Solution
Medians are less sensitive to outliers, meaning that one large or really small numbers won't dramatically change your calculation.
This is most common when talking about incomes. Especially in the US, economists will discuss "median household income." This is because the "mean household income" is significantly higher because economic inequality. Mean household income would be a poor reflection of the actual state of the economy.
For example, if you and three friends made $30k, $45k, $45k, and $50k/year, the mean would be $45k/year. If Jeff Bezos was added to the mix, the mean just rose to a few billion, not because you are any richer. The median, however, stayed the same at $45k.
.apply()
If youâve ever used Excel or Google Sheets, youâve probably wanted to take one column and apply some kind of function to it. In Pandas, the .apply()
function does that for you. If you have a series, and want to use each of those rows to create a new column (and you arenât doing simple arithmetic), then just feed a function into this method. This is best explained by example:
def sentence_count(tweet_text):
return len([sentence for sentence in tweet_text.split('.') if sentence != ''])
df['sentence_count'] = df['text'].apply(sentence_count)
Here we apply the new function sentence_count
to each value in the df['text']
column and assign it to a new column 'sentence_count'
.
Question: The sentence_count
function isnât entirely accurate. Can you think of why that might be?
View Solution
Sometimes the tweet separates sentences by newline instead of a period. Sometimes the tweet has enumerated points (e.g "1. ...", 2. ...", etc). When making these functions, you'll often have to continually test to make sure you are considering these "corner cases".
.str.contains()
This method returns true if the strings in your column contain a given substring. This is useful when filtering rows in our dataset. For example, if we wanted to know what tweets contain the word âgooooobâ:
df[df['text'].str.contains('gooooob')]
.groupby
.groupby()
isnât the most straightforward function, but it is very useful in many contexts. Many times youâll want to group categorical data together, and run a calculation for each of those groups. For example, if you wanted to know what the average number of favorites is per day of the week, hereâs what that would look like:
# Ensure that the "timestamp" column is appropriately parsed as a datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])
# Create a "day-of-week" column
df['day-of-week'] = df['timestamp'].dt.dayofweek
# Groupby "day-of-week", select the "favorite_count" column and get the mean
df.groupby('day-of-week')['favorite_count'].mean()
The output should look like this:
day-of-week
0 65683.261538
1 77871.142857
2 70649.446927
3 74367.388235
4 48420.922619
5 60879.256198
6 70115.652893
Name: favorite_count, dtype: float64
groupby()
is a function that can be difficult to get a handle on, so I wonât expand too much further. I recommend the following video if you need to learn more in order to answer your research question:
Plotting with Pandas
Thankfully, plotting with Pandas (as long as youâre keeping things simple) is pretty straightforward. Just add .plot()
to whatever series youâre working with. The default is a line chart, but if you want a bar chart, just set kind='bar'
. To set a title, just set title='My awesome graph'
. For everything else, just google âHow to change X in a graph with Pandasâ. Pandas is built on Matplotlib so familiarizing yourself with that library may be helpful, but in general, you shouldnât need to do too much outside of just changing that plot function.
đ Exercises
If you donât know how to proceed, try googling the solution. Prepare an answer before clicking the âView Solutionâ button.
1. How many tweets are were sent in 2019?
2. What was the average number of retweets that a tweet would get in 2017? 2018? 2019? âŚetc
3. How many tweets contain the word âgooooobâ?
4. How many tweets are in all caps?
5. Whatâs the average ratio of favorites to retweets?