{ "cells": [ { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "## Importing libraries\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas\n", " \n", "## Motivating Example\n", "Today we are going to continue learning to apply python to data science by introducing another python library that is similar to `numpy` called `pandas`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">Consider: based on what we've learned the past several days, what are some *limitations* of `numpy`? Can you think of any tasks you might want to do or analysis you might like to perform that would be difficult with `numpy`? Does this give you a guess as to what `pandas` specializes in?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Answer: `numpy` is specialized primarily for numerical operations, e.g. matrix multiplication, vector math, etc., but is more limited when dealing with other data types such as string, python objects, etc. In contrast, `pandas` objects are able to handle mixed data easily! As you will often run into this type of data when doing bioinformatics, `pandas` can be very useful.\n", "\n", "Before we dive into the syntax, let's take a look at an example real-world application of `pandas` for a task that you might commonly face in biology. We are going to use the \"Palmer penguins\" dataset, which is a collection of various biometric data for several different penguin species and is a commonly used example dataset. Let's take a quick look at what the data looks like.\n", "\n", "In the Palmer penguins dataset, each row represents an individual penguin, and each column represent a different measurement or characteristic of the penguin, such as its body mass or island of origin. The data are organized in this way so that variables (things we may want to compare against each other) are the columns while observations (the individual penguins) are the rows. This is a common way to organize data in data science and is called **tidy data**. Tidy data formatting also makes it easy to use code to manipulate and analyze, which we will see in this lesson. \n" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | species | \n", "island | \n", "bill_length_mm | \n", "bill_depth_mm | \n", "flipper_length_mm | \n", "body_mass_g | \n", "sex | \n", "year | \n", "
---|---|---|---|---|---|---|---|---|
0 | \n", "Adelie | \n", "Torgersen | \n", "39.1 | \n", "18.7 | \n", "181.0 | \n", "3750.0 | \n", "male | \n", "2007 | \n", "
1 | \n", "Adelie | \n", "Torgersen | \n", "39.5 | \n", "17.4 | \n", "186.0 | \n", "3800.0 | \n", "female | \n", "2007 | \n", "
2 | \n", "Adelie | \n", "Torgersen | \n", "40.3 | \n", "18.0 | \n", "195.0 | \n", "3250.0 | \n", "female | \n", "2007 | \n", "
3 | \n", "Adelie | \n", "Torgersen | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "2007 | \n", "
4 | \n", "Adelie | \n", "Torgersen | \n", "36.7 | \n", "19.3 | \n", "193.0 | \n", "3450.0 | \n", "female | \n", "2007 | \n", "