Skip to content
## Importing libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Python intensive, day 5

Pandas

Motivating Example

Today we are going to continue learning to apply python to data science by introducing another python library that is similar to numpy called pandas.

Consider: based on what we've learned the past several days, what are some limitations of numpy? Can you think of any tasks you might want to do or analysis you might like to perform that would be difficult with numpy? Does this give you a guess as to what pandas specializes in?

Answer: numpy is specialized primarily for numerical operations, e.g. matrix multiplication, vector math, etc., but is more limited when dealing with other data types such as string, python objects, etc. In contrast, pandas objects are able to handle mixed data easily! As you will often run into this type of data when doing bioinformatics, pandas can be very useful.

Before we dive into the syntax, let's take a look at an example real-world application of pandas for a task that you might commonly face in biology. We are going to use the "Palmer penguins" dataset, which is a collection of various biometric data for several different penguin species and is a commonly used example dataset. Let's take a quick look at what the data looks like.

In the Palmer penguins dataset, each row represents an individual penguin, and each column represent a different measurement or characteristic of the penguin, such as its body mass or island of origin. The data are organized in this way so that variables (things we may want to compare against each other) are the columns while observations (the individual penguins) are the rows. This is a common way to organize data in data science and is called tidy data. Tidy data formatting also makes it easy to use code to manipulate and analyze, which we will see in this lesson.

penguins = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-07-28/penguins.csv')
penguins.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007

Here is an example of a transformation that we will be able to do with pandas that would be difficult to do manually or with numpy. We can summarize the data by calculating the average body mass (in kg) of each penguin species, broken up by sex. Using a few lines of code we can go from our raw data to a table that looks like this:

species sex body_mass_kg
Adelie female 3.368836
Adelie male 4.043493
Chinstrap female 3.527206
Chinstrap male 3.938971
Gentoo female 4.679741
Gentoo male 5.484836

Now, let's get started learning how this is done!

Pandas Series

A Series is the simplest data structure in Pandas. They are one dimensional (1D) objects composed of a single data type of any variety (string, integers); you can basically think of them as a single column in a spreadsheet. They are similar to arrays in numpy, however unlike those other 1D structures Series also have label-based indexing, meaning each element in the object can be accessed by specifying its specific label. In that way, they are similar to dictionaries in python.

We can manually create a Series in several ways:

Using the pd.Series() function, we provide it the data we want to store as a list, and optionally we can give each row of the data a label using the index argument. If we don't give it the index argument, it will automatically assign a numerical index to each row starting from 0.

When we print the Series, it will display as a column with the index on the left and the data on the right. The type of data being held in the series will be displayed at the bottom of the output.

#Using the pd.Series method:
s0 = pd.Series([10, 20, 30, 40])

print(s0)
0    10
1    20
2    30
3    40
dtype: int64
#Using the pd.Series method:
s1 = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])

print(s1)
a    10
b    20
c    30
d    40
dtype: int64

Another way to create a Series is to convert a (non-nested) dictionary into a Series. The keys of the dictionary will become the index labels while the values will become the data.

# Converting from dictionary to series
my_dictionary = {'first': 10, 'second': 20, 'third': 30}
s2 = pd.Series(my_dictionary)

print(s2)
first     10
second    20
third     30
dtype: int64

We can then access specific elements in the Series by referring to its index label enclosed in quotes and brackets. This is very similar to how a dictionary works!

print(s0[0])

print(s1["a"])

print(s2["second"])
10
10
20

Multi-indexed Series

Series objects may have multiple levels of indices. We call this multi-indexed. Using layers of indexing is a way of representing two-dimensional data within a one-dimensional Series object. Some people really like using multi-indexed Series. You can create a multi-indexed series by passing a list of lists to the index argument of the pd.Series() function. The first list will be the outermost level of the index, the second list will be the next level, and so on.

my_index = [["California", "California", "New York", "New York", "Texas", "Texas"], 
            [2001, 2002, 2001, 2002, 2001, 2002]]
my_values = [1.5, 1.7, 3.6, 4.2, 3.2, 4.5]

s3 = pd.Series(my_values, index=my_index)

print(s3)
California  2001    1.5
            2002    1.7
New York    2001    3.6
            2002    4.2
Texas       2001    3.2
            2002    4.5
dtype: float64

Retrieving an item from this data structure is similar to a nested dictionary, using successive [] notation. Or, you can passs it a tuple. You must pass the index labels in the order they were created (left to right)

print(s3["California"])

print("---")
print(s3["California"][2001])

print("---")
print(s3[("California", 2001)])

print("---")
# you can also use slicing to select multiple elements
print(s3["California":"New York"])
2001    1.5
2002    1.7
dtype: float64
---
1.5
---
1.5
---
California  2001    1.5
            2002    1.7
New York    2001    3.6
            2002    4.2
dtype: float64

In our work, we typically don't use multi-indexed Series. However, they are often the output of pandas functions, so it's good to know how to work with them. If you don't like the idea of multi-indexed Series, you can always convert them to a DataFrame using the reset_index() method.

s3.reset_index()
level_0 level_1 0
0 California 2001 1.5
1 California 2002 1.7
2 New York 2001 3.6
3 New York 2002 4.2
4 Texas 2001 3.2
5 Texas 2002 4.5

Pandas DataFrame

While Series is a "one-dimensional" data structure, DataFrames are two-dimensional. Where Series can only contain one type of data, the pandas DataFrame can have a combination of numerical and categorical data. Additionally, DataFrames allow you do have labels for your rows and columns.

DataFrames are essentially a collection of Series objects. You can also think of python DataFrames as spreadsheets from Excel or dataframes from R.

Let's manually create a simple dataframe in pandas to showcase their behavior. In the below code, we create a dictionary where the keys are the column names and the values are lists of data. We then pass this dictionary to the pd.DataFrame() function to create a DataFrame.

When we print the DataFrame, it will display as a table with the column names at the top and the data below. The index (in this case, automatically generated numerical index starting at 0) will be displayed on the left side of the table.

tournamentStats = {
    "wrestler": ["Terunofuji", "Ura", "Shodai", "Takanosho"],
    "wins": [13, 6, 10, 12],
    "rank": ["yokozuna", "maegashira2", "komusubi", "maegashira6"]
}

#Converting to a pandas DataFrame
sumo = pd.DataFrame(tournamentStats)

print(sumo)
     wrestler  wins         rank
0  Terunofuji    13     yokozuna
1         Ura     6  maegashira2
2      Shodai    10     komusubi
3   Takanosho    12  maegashira6

Pandas dataframes have many attributes, including shape, columns, index, dtypes. These are useful for understanding the structure of the dataframe.

print(sumo.shape)

print("---")
print(sumo.columns)

print("---")
print(sumo.index)

print("---")
print(sumo.dtypes)
(4, 3)
---
Index(['wrestler', 'wins', 'rank'], dtype='object')
---
RangeIndex(start=0, stop=4, step=1)
---
wrestler    object
wins         int64
rank        object
dtype: object

Pandas DataFrames also have the handy info() function that summarizes the contents of the dataframe, including counts of the non-null values of each column and the data type of each column.

sumo.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   wrestler  4 non-null      object
 1   wins      4 non-null      int64 
 2   rank      4 non-null      object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes

Selecting data in a Pandas dataframe

As with series objects, pandas dataframes rows and columns are explicitly indexed, which means that every row and column has a label associated with it. You can think of the explicit indices as the being the names of the rows and the names of the columns.

Unfortunately, in pandas the syntax for subsetting rows v.s. columns is different and can get a little confusing, so let's go through several different use cases.

Selecting columns

We can always check the names of the columns in a Pandas dataframe byt using the built-in .columns method, which simply lists the column index:

sumo.columns
Index(['wrestler', 'wins'], dtype='object')

If we want to refer to a specific column, we can specify its index (enclosed in double quotes) inside of square brackets [] like so:

#Single column:
sumo["wrestler"]
0    Terunofuji
1           Ura
2        Shodai
3     Takanosho
Name: wrestler, dtype: object

If we want to refer to multiple columns, we need to pass the columns as a list by enclosing the column indices in square brackets, so you will end up with double brackets:

#Multiple columns (note the double []!):
sumo[["wrestler", "rank"]]
wrestler rank
0 Terunofuji yokozuna
1 Ura maegashira2
2 Shodai komusubi
3 Takanosho maegashira6

Selecting rows:

The syntax for selecting specific rows is slightly different. Let's first check the labels of the row index; to do this we use the .index method:

print(sumo.index)
RangeIndex(start=0, stop=4, step=1)

Here we can see that while the column index labels were strings, the row index labels are numerical values, in this case 0 thru 3. If we wanted to pull out the first row, we need to specify its index label (0) in combination with the .loc method (which is required for rows):

sumo.loc[0]

If we want to select multiple rows, like with columns we need to pass it as a list using the double brackets. If we want to specify a range of rows (i.e. from this row to that row), we don't use double brackets and instead use ::

print(sumo.loc[[0,1]])
print(sumo.loc[0:2])

Note that in this case the row index labels are numbers, but do not have to be numerical, and can have string labels similar to columns. Let's show how we could change the row index labels by taking the column with the wrestler's rank and setting it as the index label (note that the labels should be unique!):

sumo = sumo.set_index("rank")

print(sumo)
               wrestler  wins
rank                         
yokozuna     Terunofuji    13
maegashira2         Ura     6
komusubi         Shodai    10
maegashira6   Takanosho    12
sumo.loc["yokozuna"]
wrestler    Terunofuji
wins                13
Name: yokozuna, dtype: object

We also need to use .loc if we are referring to a specific row AND column, e.g.:

print(sumo.loc["komusubi", "wrestler"])
Shodai

If we want to purely use numerical indexing, we can use the .iloc() method. If you use .iloc(), you can index a DataFrame just as you would a numpy array.

# Select the first two rows and the first two columns

sumo.iloc[0:2, 0:2]
wrestler wins
0 Terunofuji 13
1 Ura 6

There are many ways to select subsets of a dataframe. The rows and columns of a dataframe can be referred to either by their integer position or by their indexed name. Typically, for columns, you'll use the indexed name and can just do [] with the name of the column. For rows, if you want to use the integer position, you will use .iloc[]. If you want to use the index name, you will use .loc[].

For reference, here's a handy table on the best ways to index into a dataframe:

Action Named index Integer Position
Select single column df['column_name'] df.iloc[:, column_position]
Select multiple columns df[['column_name1', 'column_name2']] df.iloc[:, [column_position1, column_position2]]
Select single row df.loc['row_name'] df.iloc[row_position]
Select multiple rows df.loc[['row_name1', 'row_name2']] df.iloc[[row_position1, row_position2]]

Exercise: we'll use the penguins dataset from our initial example. 1) Print the 'species' column 2) Print the first five columns and first five rows 3) Print the columns "species", "island", and "sex" and the first ten rows of the dataframe

penguins.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
 7   year               344 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB
# Your code here

print(penguins['species'])

print("---")
print(penguins.iloc[0:5,0:5])

print("---")
print(penguins.loc[0:10, ['species', 'island', 'sex']])
0         Adelie
1         Adelie
2         Adelie
3         Adelie
4         Adelie
         ...    
339    Chinstrap
340    Chinstrap
341    Chinstrap
342    Chinstrap
343    Chinstrap
Name: species, Length: 344, dtype: object
---
  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm
0  Adelie  Torgersen            39.1           18.7              181.0
1  Adelie  Torgersen            39.5           17.4              186.0
2  Adelie  Torgersen            40.3           18.0              195.0
3  Adelie  Torgersen             NaN            NaN                NaN
4  Adelie  Torgersen            36.7           19.3              193.0
---
   species     island     sex
0   Adelie  Torgersen    male
1   Adelie  Torgersen  female
2   Adelie  Torgersen  female
3   Adelie  Torgersen     NaN
4   Adelie  Torgersen  female
5   Adelie  Torgersen    male
6   Adelie  Torgersen  female
7   Adelie  Torgersen    male
8   Adelie  Torgersen     NaN
9   Adelie  Torgersen     NaN
10  Adelie  Torgersen     NaN

Importing data to pandas

One of the most useful features of pandas DataFrames is its ability to easily perform complex data transformations. This makes it a powerful tool for cleaning, filtering, and summarizing tabular data. As shown above, we can manually create a DataFrame from scratch, but more commonly you will want to read in data from an external source, such as the output of a bioinformatic program, and do some manipulation of it. Let's read some data into a DataFrame to demonstrate.

Below you can see an example of how to read files into pandas using the pd.read_csv() function. The csv stands for 'comma-separated values', which means by defaults it will assume that our columns are separated by commas; if we wanted to change the delimiter (e.g. in the case of a tab-separated file), we can set the delimiter explicitly using the sep= argument.

penguins = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-07-28/penguins.csv", sep=',')

# The head() function from pandas prints only the first N lines of a dataframe (default: 10)
penguins.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007

When importing data into a DataFrame, pandas automatically detects what data type each column should be. For example, if the column contains only numbers, it will be imported as an floating point or integer data type. If the column contains strings or a mixture of strings and numbers, it will be imported as an "object" data type. Below are the different data types for the penguins column.

penguins.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
 7   year               344 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB

Looping through a dataframe

As a note, if we want to go through a dataframe line-by-line (i.e. row by row), because both the rows and columns are indexed it requires slightly more syntax than looping through other data structures (e.g. a dictionary or list). Specifically we need to use the .iterrows() method to make the data frame iterable. The .iterrows() method outputs each row as a Series object with a row index and the column:

for index, row in penguins.iterrows():
    print(f"Row index: {index}, {row['species']}, {row['island']}")

This can be slow for very large dataframes, but is useful if you need to perform actions on individual rows.

Modifying a dataframe

Now that we have our external data read into a DataFrame, we can begin to work our magic. If you have ever worked with tidyverse in the R language some of this might look familiar to you, as pandas serves a similar role and can do many of the same functions. Let's look at several useful common examples.

Filtering

When we discussed indexing, we looked at how we can select specific rows and columns in a dataframe, but often we will want to select rows based on a certain condition, e.g. in the dataframe that we just imported, only take the rows where the 'body_mass_g' column has a value greater than a given number. We can do this easily by specifying which column to filter on and a boolean statement, like so:

print(penguins[penguins['flipper_length_mm'] == 181.0])

#Saving as new data frame:
penguins_filtered = penguins[penguins['body_mass_g'] > 3300]

penguins_filtered.head()
       species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0       Adelie  Torgersen            39.1           18.7              181.0   
6       Adelie  Torgersen            38.9           17.8              181.0   
38      Adelie      Dream            37.6           19.3              181.0   
58      Adelie     Biscoe            36.5           16.6              181.0   
108     Adelie     Biscoe            38.1           17.0              181.0   
293  Chinstrap      Dream            58.0           17.8              181.0   
296  Chinstrap      Dream            42.4           17.3              181.0   

     body_mass_g     sex  year  
0         3750.0    male  2007  
6         3625.0  female  2007  
38        3300.0  female  2007  
58        2850.0  female  2008  
108       3175.0  female  2009  
293       3700.0  female  2007  
296       3600.0  female  2007  
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 male 2007
6 Adelie Torgersen 38.9 17.8 181.0 3625.0 female 2007

We can get more advanced with our filtering logic by adding multiple conditions and the following logical operators:

  • &: "and"
  • |: "or"
  • ~: "not"

Note that these are different logical operators than we have previously learned in base python!

If using multiple conditions, be sure to enclose each of them in parentheses!

penguins_filtered = penguins[(penguins['body_mass_g'] > 3300) & (penguins['bill_length_mm'] > 38)]

penguins_filtered.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 male 2007
6 Adelie Torgersen 38.9 17.8 181.0 3625.0 female 2007
7 Adelie Torgersen 39.2 19.6 195.0 4675.0 male 2007

Pandas also has a helper function called .isin() that is similar to the in operator in base python. It allows you to filter a dataframe based on whether a column value is in a list of values.

penguins[penguins['species'].isin(['Adelie', 'Gentoo'])]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
... ... ... ... ... ... ... ... ...
271 Gentoo Biscoe NaN NaN NaN NaN NaN 2009
272 Gentoo Biscoe 46.8 14.3 215.0 4850.0 female 2009
273 Gentoo Biscoe 50.4 15.7 222.0 5750.0 male 2009
274 Gentoo Biscoe 45.2 14.8 212.0 5200.0 female 2009
275 Gentoo Biscoe 49.9 16.1 213.0 5400.0 male 2009

276 rows × 8 columns

Exercise: filter the dataframe to only keep the birds observed in the year 2007 and with a bill_length_mm greater than 38mm

# Your code here

penguins[(penguins["year"] == 2007) & (penguins["bill_length_mm"] > 38)]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 male 2007
6 Adelie Torgersen 38.9 17.8 181.0 3625.0 female 2007
... ... ... ... ... ... ... ... ...
297 Chinstrap Dream 48.5 17.5 191.0 3400.0 male 2007
298 Chinstrap Dream 43.2 16.6 187.0 2900.0 female 2007
299 Chinstrap Dream 50.6 19.4 193.0 3800.0 male 2007
300 Chinstrap Dream 46.7 17.9 195.0 3300.0 female 2007
301 Chinstrap Dream 52.0 19.0 197.0 4150.0 male 2007

89 rows × 8 columns

We can also filter based on strings, not just numbers! For this, you will want to use a string matching function from python, such as .str.contains() (which also a partial match), .str.startswith() (checks to see if a value starts with a given string), or others.

penguins_filtered = penguins[penguins['species'].str.contains('Adel', case=False)]

penguins_filtered
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
... ... ... ... ... ... ... ... ...
147 Adelie Dream 36.6 18.4 184.0 3475.0 female 2009
148 Adelie Dream 36.0 17.8 195.0 3450.0 female 2009
149 Adelie Dream 37.8 18.1 193.0 3750.0 male 2009
150 Adelie Dream 36.0 17.1 187.0 3700.0 female 2009
151 Adelie Dream 41.5 18.5 201.0 4000.0 male 2009

152 rows × 8 columns

In the above example, this code looks for the string 'Adel' (case-insensitive, as we specify case=False) and only takes rows that contain the string somewhere in the species column.

Missing data in DataFrames

Missing data in a DataFrame is represeted by NaN (Not a Number). Pandas handles missing data in some specific ways, which we will discuss in this section.

Missing numbers are propagated through the DataFrame when doing arithmetic operations. In the example below, when we try to sum the two columns together element-wise, any row where one of the columns has a missing value will result in the sum being NaN.

ser1 = pd.Series([np.nan, np.nan, 2, 3])
ser2 = pd.Series([1, 2, np.nan, 4])
ser1 + ser2
0    NaN
1    NaN
2    NaN
3    7.0
dtype: float64

When using descriptive statistics and computational methods like .sum(), .mean(), pandas will ignore missing values and treat them like zero.

print(ser1.sum())

print(ser1.mean())
5.0
2.5

This behavior can be changed by using the skipna argument, which is True by default. If you set skipna=False, pandas will treat missing values as NaN and will not ignore them, resulting in the whole operation returning NaN.

print(ser1.mean(skipna=False))
nan

One very useful function to know is how to get rid of rows with missing data in them, as including them can often cause errors in downstream analysis or skew your results. There is a convenient function built in to pandas that does this called .dropna(). By default, it will drop any row that has a missing value in any column. This may not always be what you want. You can specify which columns to look at using the subset arugment.

penguins_nona = penguins.dropna()

penguins_nona.info()
<class 'pandas.core.frame.DataFrame'>
Index: 333 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
 7   year               333 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 23.4+ KB
penguins_nona_bill_len = penguins.dropna(subset=["bill_length_mm"])

penguins_nona_bill_len.info()
<class 'pandas.core.frame.DataFrame'>
Index: 342 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            342 non-null    object 
 1   island             342 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
 7   year               342 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 24.0+ KB

Alternatively, you may want to fill in missing values with a specific value. You can do this using the .fillna() method.

Calculating new columns

Let's say we will want to create a new column in our dataframe by applying some function to existing columns. For instance, in the dataframe of penguins data that we have been using, we want a column that normalizes body_mass_g column by performing a z-transform. A z-transform is the value minus the mean of the column divided by the standard deviation of the column.

First, we put the name of the new column, body_mass_z in square brackets after our penguin DataFrame variable. Then we pull out the body_mass_g column and perform the calculation, using helper methods like .mean() and .std(), on the right side of the assignment operator. This syntax is similar to creating a new key in a dictionary.

# Z-transform the body mass column

penguins["body_mass_z"] = (penguins["body_mass_g"] - penguins["body_mass_g"].mean()) / penguins["body_mass_g"].std()

penguins.head(10)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year body_mass_z
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007 -0.563317
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007 -0.500969
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007 -1.186793
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007 NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007 -0.937403
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 male 2007 -0.688012
6 Adelie Torgersen 38.9 17.8 181.0 3625.0 female 2007 -0.719186
7 Adelie Torgersen 39.2 19.6 195.0 4675.0 male 2007 0.590115
8 Adelie Torgersen 34.1 18.1 193.0 3475.0 NaN 2007 -0.906229
9 Adelie Torgersen 42.0 20.2 190.0 4250.0 NaN 2007 0.060160

Exercise: We can use multiple columns in the calculation of the new column. Create a column that contains the volume of the penguin's beak by assuming it is a cylinder, with bill_length_mm as the height and bill_depth_mm as the diameter.

Hint: The volume of a cylinder is given by the formula $V = \pi r^2 h$, where $r$ is the radius (half the diameter) and $h$ is the height.

An illustration of a penguin's head with 'bill length' and 'bill depth'. The bill length is the distance from the tip of the bill to the base, while the bill depth is the distance from the top of the bill to the bottom.  Note: In the raw data, bill dimensions are recorded as 'culmen length' and 'culmen depth'. The culmen is the dorsal ridge atop the bill.

# Your code here

penguins["bill_volume"] = (penguins["bill_depth_mm"]/2)**2 * np.pi * penguins["bill_length_mm"]

penguins.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year bill_volume
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007 10738.654055
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007 9392.592344
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007 10255.100899
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007 NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007 10736.693701

Summarizing your data

Another common task in data analysis is to calculate summary statistics of your data. Pandas as a number of helper methods like .mean(), .median(), .count(), unique(), etc. that can be used to describe your data. Pandas DataFrames has a handy .describe() method that will give you a summary of the data in each column. By default, it will calculate the count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum of each numerical column. However, if you give it the include='all' or include='object' argument, it will also include the count, unique, top, and freq (of top) of each categorical column.

penguins.describe(include='all')
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year bill_volume
count 344 344 342.000000 342.000000 342.000000 342.000000 333 344.000000 342.000000
unique 3 3 NaN NaN NaN NaN 2 NaN NaN
top Adelie Biscoe NaN NaN NaN NaN male NaN NaN
freq 152 168 NaN NaN NaN NaN 168 NaN NaN
mean NaN NaN 43.921930 17.151170 200.915205 4201.754386 NaN 2008.029070 10216.041172
std NaN NaN 5.459584 1.974793 14.061714 801.954536 NaN 0.818356 2468.270185
min NaN NaN 32.100000 13.100000 172.000000 2700.000000 NaN 2007.000000 5782.155471
25% NaN NaN 39.225000 15.600000 190.000000 3550.000000 NaN 2007.000000 8433.676950
50% NaN NaN 44.450000 17.300000 197.000000 4050.000000 NaN 2008.000000 9954.426135
75% NaN NaN 48.500000 18.700000 213.000000 4750.000000 NaN 2009.000000 11537.641963
max NaN NaN 59.600000 21.500000 231.000000 6300.000000 NaN 2009.000000 18416.870649
penguins.describe(include='object')
species island sex
count 344 344 333
unique 3 3 2
top Adelie Biscoe male
freq 152 168 168

Grouping and transforming your data

What if we want to use one of the categorical variables in our data as a factor level to calculate our summaries? For example, we may want to separately get the mean flipper length of each species of penguin. In order to do this, we need to do the following things:

  1. Split our data into different groups (e.g. based the values in the species column)
  2. Apply some function (aggregation or transformation)to each group in our data (e.g. calculates the mean)
  3. Combine each group back together into an output object

We input the column we want to group by into the .groupby() method. This creates a grouped dataframe object that acts like the regular dataframe, but is split into groups based on the column we specified.

penguin_groups = penguins.groupby('species')
print(penguin_groups)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1907d8590>

We can see that on its own this is not especially useful, as grouping the DataFrame does not produce a new DataFrame (just this weird output message telling us that this is a DataFrameGroupBy object). In order to output a DataFrame, we need to pass the grouped DataFrame to some function that aggregates or transforms the data in each group.

We group our data, select the column we want to aggregate (in this case, flipper_length_mm) and apply the .mean() function to it:

#Note the square brackets around the column name
penguins.groupby('species')['flipper_length_mm'].mean()
species
Adelie       189.953642
Chinstrap    195.823529
Gentoo       217.186992
Name: flipper_length_mm, dtype: float64

When we apply the .mean() method to the grouped dataframe, it returns a Series object with the mean flipper length of each species. The exact details of whether pandas returns a groupby object as a Series or a DataFrame gets a little technical; for our purposes, just know that you can make sure the returned object is converted to a DataFrame (which is usually most convenient) by using the .reset_index function:

penguins.groupby('species')['flipper_length_mm'].mean().reset_index()
species flipper_length_mm
0 Adelie 189.953642
1 Chinstrap 195.823529
2 Gentoo 217.186992

We can group by multiple columns by passing a list of column names to the .groupby() method (and using reset_index() as before to have it output as a DataFrame):

penguins.groupby(['species', 'sex'])['flipper_length_mm'].mean().reset_index()
species sex flipper_length_mm
0 Adelie female 187.794521
1 Adelie male 192.410959
2 Chinstrap female 191.735294
3 Chinstrap male 199.911765
4 Gentoo female 212.706897
5 Gentoo male 221.540984

So far we have just been applying the .mean function to our groups, but we can use other functions as well! One very useful function to know when grouping is .size(), which will return the number of rows in each group.

penguins.groupby(['species', 'sex']).size()
species    sex   
Adelie     female    73
           male      73
Chinstrap  female    34
           male      34
Gentoo     female    58
           male      61
dtype: int64

Exercise: Use grouping to answer the following question about the penguins dataset: Which island has the most Adelie penguins?

Hint: Think about which order you should group by for the most readable output

# Your code here

penguins.groupby(['species', 'island']).size()
species    island   
Adelie     Biscoe        44
           Dream         56
           Torgersen     52
Chinstrap  Dream         68
Gentoo     Biscoe       124
dtype: int64

Now try your previous code with the order of the columns to group by switched. What changes?

penguins.groupby(['island', 'species']).size()
island     species  
Biscoe     Adelie        44
           Gentoo       124
Dream      Adelie        56
           Chinstrap     68
Torgersen  Adelie        52
dtype: int64

This shows us that grouping occurs hierarchically, meaning pandas groups data in the order that you specify in! In our case the result is the same (i.e. the counts are equal no matter which column you group on first), but one way is more readily readable for our question than the other.

There are niche situations where it might matter, e.g. your aggregation function depends on the order of values (e.g. if you are sorting your grouped data), or you are doing some non-commutative operation, but generally speaking the results will be the same.

Exercise: now that we have an understanding of pandas, let's go back to our initial example! Calculating the average body mass (in kg) of each penguin species, by sex. We are aiming to reproduce the table below:

Make sure you end up with a DataFrame and not a Series.

species sex body_mass_kg
Adelie female 3.368836
Adelie male 4.043493
Chinstrap female 3.527206
Chinstrap male 3.938971
Gentoo female 4.679741
Gentoo male 5.484836
penguins["body_mass_kg"] = penguins["body_mass_g"] / 1000

penguins.groupby(['species', 'sex'])["body_mass_kg"].mean().reset_index()
species sex body_mass_kg
0 Adelie female 3.368836
1 Adelie male 4.043493
2 Chinstrap female 3.527206
3 Chinstrap male 3.938971
4 Gentoo female 4.679741
5 Gentoo male 5.484836

Exercise: using the .max() function, find the largest bird on each island.

penguins.groupby('island')['body_mass_g'].max()
island
Biscoe       6300.0
Dream        4800.0
Torgersen    4700.0
Name: body_mass_g, dtype: float64

Seaborn

Plotting with Seaborn

In addition to knowing how to import and manipulate data, we often want to also visualize our data. There are many tools available that are specialized in plotting data, such a R and ggplot, Excel, etc., but often it can be helpful to do some quick visualization in python as part of a pipeline. The "classic" way to do this is using the matplotlib library, but for this workshop we are instead going to use seaborn, which is based on matplotlib but has (in our opinion) better syntax (being somewhat reminiscent of ggplot, a commonly-used R library), and is specially designed to integrate with pandas.

Plotting can get quite complicated, so we are going to stick to some more "cookie-cutter" implementation that is geared more towards exploratory analysis, rather than making publication-quality figures.

Just as before, it can be helpful to think about what your end goal looks like. Let's say I want to end up with a scatterplot that shows bird bill length relative to body weight, with the color of each point corresponding to bird species, and the shape of the point corresponding to bird sex. With that goal in mind, let's look at syntax.

Defining the Data

We are going to continue using our penguins data set. seaborn has numerous functions for drawing different plots, summarized in the figure below. There are three different broad "families" of seaborn plots, which are shown in the figure below:

The different seaborn plot categories and the types of plots within them: relplots are scatter plots and line plots; distplots are histograms, KDE plots, ecdf plots, and rug plots; catplots are bar plots, box plots, violin plots, strip plots, swarm plots, and point plots

  • relplot plots show relationships between variables
  • displot show distibutions
  • catplot plot categorical data

Each plot function within a family usually has similar syntax, as they represent data in similar ways. To create a plot, we call one of these functions and specify our (pandas) dataframe and which columns to encode in which axis. For example, if we wanted a histogram:

penguins = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-07-28/penguins.csv')
penguins.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
sns.histplot(data=penguins, x="flipper_length_mm")
<Axes: xlabel='flipper_length_mm', ylabel='Count'>

A histogram showing 'flipper length' in millimeters on the x-axis ranging from 170-230 and 'Count' on the y-axis ranging from 0 to 80. There is a sharp peak around 190mm, a drop-off at 205mm, and then a smaller peak around 215mm.

We can see that for this type of plot, we only need to encode a single column (for the x-axis), but other types of plots might require additional axes. For example, a boxplot needs both an x and y axis defined:

sns.boxplot(data=penguins,x="species",y="bill_length_mm")
<Axes: xlabel='species', ylabel='bill_length_mm'>

A categorical boxplot. On the x-axis are 3 bird species: Adelie, Gentoo, and Chinstrap. The y-axis is 'bill length' in millimeters ranging from 35-60. Adelie have the shortest bill length with a median around 39mm while the other two species have similar distributions of bill length with medians around 46-47mm.

Documentation for each plot type is on seaborn's website, and it lists all required and optional arguments for each plot function: seaborn

Changing plot aesthetics

We can do much more useful things than just setting the x and y axis, however! We will frequently want to group our data, e.g. by species, by sex, etc., and change the plot aesthtics to reflect these groups. To do this, we add an additional argument that specifies which column in our data frame that we want to group on. For example, to color the bars on our histogram based on species, we would use the hue argument:

sns.histplot(data=penguins, x="flipper_length_mm",hue="species")
<Axes: xlabel='flipper_length_mm', ylabel='Count'>

An overlapping histogram showing the distribution of 'flipper length' in millimeters for 3 species of penguins: Adelie, Gentoo, and Chinstrap. The x-axis ranges from 170-230mm and the y-axis is 'Count'. Because the bars overlap, the data is difficult to parse.

We can see we have changed the colors of the bars, but as they are overlapping it is difficult to read. If we dig into the documentation of the histplot function, we can find that there is also the multiple argument, which changes how overlapping bars behave...let's make them stack instead of overlap:

sns.histplot(data=penguins, x="flipper_length_mm",hue="species",multiple="stack")
<Axes: xlabel='flipper_length_mm', ylabel='Count'>

A stacked histogram showing the distribution of 'flipper length' in millimeters for 3 species of penguins: Adelie, Gentoo, and Chinstrap. The x-axis ranges from 170-230mm and the y-axis is 'Count'. The bars are stacked on top of each other. The distributions for the 3 species are roughly normal. The Adelie species has a peak around 190mm with a count of 80, the Gentoo species has a peak around 210mm with a count of 20, and the Chinstrap species has a peak around 220mm with a count of 40.

Exercise: check the documentation page for the scatterplot function, and see if you can figure out how to make a scatter plot that shows bird bill length relative to body weight, with the color of each point corresponding to bird species, and the shape of the point corresponding to bird sex.

sns.scatterplot(data=penguins, x='bill_length_mm', y='body_mass_g', hue='species', style='sex')
<Axes: xlabel='bill_length_mm', ylabel='body_mass_g'>

A scatter plot with 'bill length' in millimeters on the x-axis ranging from 35-60 and 'body mass' in grams on the y-axis ranging from 3000-6000. The points are colored by species: Adelie, Gentoo, and Chinstrap. dots represent males and x's represent females. The plot shows a clear separation between the species based on bill length and body mass and a separation between sexes within species. Adelie have the smallest bill length and body mass, centered around 40mm and 4000g, respectively. Chinstrap have longer bill lengths centered around 450mm, but a similar range of body mass to the Adelie, centered around 3800g. The Gentoo have a bill length between the other two species, centered around 47mm but higher body mass, centered around 5000g. In all species, females have both shorter bills and lower body masses.