Python intensive, part 4

Introduction

What is Python and why do we need it/why are you taking this workshop?

Python is a general purpose programming language that is commonly used for data analysis
Programming languages are a way for humans to give the computer commands
Regardless of how you collect data, it needs to be analyzed and code is the best way (Don't use excel!)

Data analysis in Python

Parts 1-3 of this workshop covered basic programming concepts and syntax in the context of Python. Many of these skills are transferable to any programming language, though the syntax will differ.

Starting today, we'll be using these skills to learn about Python's capabilities for large-scale data analysis, mainly using the pandas library. Recall that a "library" is just a collection of programs and functions. If the library is installed in your environment and imported into your program, you'll be able to use those functions freely. pandas is a well-maintained library that facilitates data analysis with functions and data structures for reading and interacting with large files.

We'll start with some review, and then start learning about pandas.

Jupyter basics

Jupyter notebooks are text files that can be rendered as formatted text and run code given the proper setup (see Installation).

Text is split into cells. Double clicking a cell allows you to edit it.

For code cells, there is also an option to run the code. You can do this by pressing SHIFT+ENTER while having it selected, or press the Run button at the top of the cell (exact location depends on the editor you're using). Because of the way we set up the notebook the code cells will be running Python code.

For this workshop, we'll be asking you to follow along by running code cells and by doing coding exercises by writing or editing code in code cell.

IMPORTANT: Run the code cells below to import the libraries we'll be using during this workshop.

# Run this cell to import the libraries we'll be using
# If you don't have the kernel loaded or installed, it will not work

import pandas as pd
import os

And run the following to demonstrate how the code blocks run and display code:

# Run this cell to print a message to the screen
print("this is my code cell")

this is my code cell

Any variables that you assign in one cell will be available in other cells. But they will not be saved between sessions. If you close the notebook and re-open it, you will need to re-run the previous cells to get your variables back. Therefore, it's important to be aware of the state of your notebook and the order in which your cells were run.

my_string = "this is my code cell"

print(my_string)

this is my code cell

Jupyter notebooks can be exported to pdf or html, so that other people can view both the code and its output. It's a good format for handing in homeworks, for example, since you can show your work. In this notebook, there will be exercises with placeholders for the code that you will have to fill in. For these exercises, we encourage you to work with each other, use google, LLMs, and whatever other resources if you are stuck. It's not an exam, but just a way to get practice of the concepts. Afterwards, we will post the completed notebook on our website so you can have examples of solutions.

Python review

Let's begin by reviewing some of the terms we've covered in the past sessions that will come up again today.

Term	Definition
Object	The thing itself (an instance of a class)
Variable	The name we give the object (a pointer to the object)
Class	The blueprint for the object (defines the attributes and methods of object)
Method	A function that belongs to an object
Attribute	A property of an object
Function	A piece of code that takes an input and gives an output/does something
Argument	The objects that are passed to the function for it to operate on
Library	Collections of python functions/capabilities that can be installed and loaded on top of base Python

Object Methods

Everything in Python is an object, and depending on the type of object, they may have certain methods that can be called on them. For example, strings have a method called upper() that converts the string to uppercase.

my_string = "hello"
my_string.upper()

'HELLO'

In the above code, my_string is a string object, and we are calling the upper() method on it. The method is called by using the . operator. Methods are functions that can only be used on objects of certain classes. You will often see methods strung together, like below:

my_string = "hello"
# This first makes the first letter uppercase, then swaps the cases of each letter
my_string.capitalize().swapcase()

'hELLO'

Object Attributes

Attributes are properties of an object that can be accessed using the . operator.

We'll be learning about some more complex objects today, like Pandas Series, which also have attributes. For example, Pandas Series have an attribute called size that tells you the number of elements in the underlying data. The difference between an attribute and a method is that attributes are information about a given object, rather than a task being performed on the object. Practically, this means that, while both attributes and methods are called on an object with the dot operator ., attributes are accessed without parentheses, so you don't need to call them like functions.

# this makes a pandas Series (we'll learn more about these shortly!)
my_array = pd.Series(["a", "b", "c", "d", "e"])

# this gets the size of the Series
my_array.size

Base Python Data Structures

When we want to store multiple pieces of data, we use data structures, which are a more complex type of object. We will go over two fundamental data structures that exist in base python (i.e. that don't require additional libraries).

Lists

Lists are one the most flexible of data structures in Python. They are created with [] and can contain any type of data. Each element in a list is separated by a comma. Lists are ordered and can be indexed, sliced, and concatenated just like strings. When lists are all numerical, they can also support mathematical operations like max() and min(). Lists can also be nested using another [] within the list.

Lists are our first introduction to a mutable data structure, meaning you can change a list without having to create a new one. Indeed, list methods may modify your data in place and/or return a new object. If the method modifies the object in place, its return value will be None. Modifying in place means you don't have to assign the result of the method to a new variable, while returning a new object means you do have to assign it. For example, list.append(x) updates the list in place, while list.pop() both returns the last element and removes it from the list in place.

Below are some useful operations and methods for lists. For a full list of methods, you can use help() on the list or consult the docs page.

Operations and methods for lists

Operation/Method	Description
`+`	Concatenation
`*`	Repetition
`[]`, `[:]`	Indexing, slicing
`.append(x)`	Add `x` to the end of the list
`.extend([x, y, z])`	Add `[x, y, z]` to the end of the list
`.insert(i, x)`	Add `x` at index `i` of the list
`.pop(i)`	Remove and return the element at index `i`, defaults to last element if none given

Use cases for lists

Lists are a data structure that is always there in the background, being useful. We see them when creating simple ordered collections to iterate through, when we need to store a sequence of data to reference later, or when we need to collected a bunch of objects together. Think of lists as a small temporary transport for data. Lists are not good for large datasets (because it will be slow) or when you need to do a lot of mathematical operations (because it lacks functionality).

Exercise: Below is a list of numbers that strictly increases until a peak and then decreases. Find the maximum of the list and the index of the maximum value. You may do this with built-in functions or methods or by iterating through the list manually.

my_list = [1,2,3,4,5,6,5,4,3,2,1]

# your code here

Solution

my_list = [1,2,3,4,5,6,5,4,3,2,1]

# Using built-in methods
max_value = max(my_list)
max_index = my_list.index(max_value)
print("The peak is", max_value, "at index", max_index)

# By iterating through the list
max_index = 0
max_value = my_list[max_index]
while max_value < my_list[max_index + 1]:
    max_index += 1
    max_value = my_list[max_index]
print("The peak is", max_value, "at index", max_index)

The peak is 6 at index 5
The peak is 6 at index 5

Dictionaries

Dictionaries store key:value pairs. Keys must be immutable and are typically strings or numerical identifiers, while the values can be just about anything, including other dictionaries, lists, or individual values. You can create a dictionary with {} or with the dict() function. The two ways to create a dictionary are shown below:

my_dict = {'a': 1, 'b': 2, 'c': 3}
my_dict = dict(("a", 1), ("b", 2), ("c", 3))

In recent versions of Python (3.8+), dictionaries keys maintain the order in which they were added, which is useful to maintain consistency when looping over the keys. However, in previous versions of Python, dictionaries were unordered. You can't index or slice dictionaries (since dictionary elements aren't accessed by index). But you can retrieve items by their key, e.g. my_dict["a"] or my_dict.get("a"). Like lists, dictionaries are mutable, so you can add, remove, or update the key:value pairs in place. Other methods return "View objects" that allow you to see the items in the dictionary, but won't allow you to modify the dictionary, however these objects can usually be easily converted to lists with the list() function. Here are some useful methods for dictionaries:

Operations and methods for dictionaries

Operation/Method	Description
`[<key>]`	Retrieve value by key
`.keys()`	Returns a view object of the keys
`.values()`	Returns a view object of the values
`.items()`	Returns a view object of the key:value pairs
`.update(dict)`	Updates the dictionary with the key:value pairs from another dictionary

Use cases for dictionaries

Dictionaries are a data structure that is more specialized for information that can be organized in a key:value pair way. You may see dictionaries being used to store associations between a name/ID and some characteristics, or to store a set of parameters for a function, or to organize a hierarchical grouping of information. Dictionaries are optimized for fast access to the values by key and for flexible organization of the data. Although you can edit the values of a dictionary, they aren't good for mathematical operations or for ordered data.

Exercise: Using the below dictionary, do the following:

Print out the entries for each pet.

Add ["seeds", "fruit"] to the favorite foods of Polly the parrot

Print the age of Mittens the cat

pets = {
    "Buddy": {
        "name": "Buddy",
        "breed": "Bulldog",
        "age": 4,
        "vaccinated": True,
        "favorite_foods": ["chicken", "peanut butter"],
    },
    "Mittens": {
        "name": "Mittens",
        "breed": "Persian cat",
        "age": 2,
        "vaccinated": False,
        "favorite_foods": ["tuna", "chicken"],
        "owner": {
            "name": "Alice",
            "contact": "555-0123"
        }
    },
    "Polly": {
        "name": "Polly",
        "breed": "Parrot",
        "age": 10,
        "vaccinated": True,
        "words_learned": ["hello", "bye", "Polly wants a cracker"],
    }
}

# Your code here

# 1. Print out the entries for each pet.
# 2. Add ["seeds", "fruit"] to the favorite foods of Polly the parrot
# 3. Print the age of Mittens the cat

Solution

pets = {
    "Buddy": {
        "name": "Buddy",
        "breed": "Bulldog",
        "age": 4,
        "vaccinated": True,
        "favorite_foods": ["chicken", "peanut butter"],
    },
    "Mittens": {
        "name": "Mittens",
        "breed": "Persian cat",
        "age": 2,
        "vaccinated": False,
        "favorite_foods": ["tuna", "chicken"],
        "owner": {
            "name": "Alice",
            "contact": "555-0123"
        }
    },
    "Polly": {
        "name": "Polly",
        "breed": "Parrot",
        "age": 10,
        "vaccinated": True,
        "words_learned": ["hello", "bye", "Polly wants a cracker"],
    }
}

# 1. Print out the entries for each pet.
for key in pets:
    print(pets[key])

# 2. Add ["seeds", "fruit"] to the favorite foods of Polly the parrot
pets["Polly"]["favorite_foods"] = ["seeds", "fruit"]
print(pets["Polly"]["favorite_foods"])

# 3. Print the age of Mittens the cat
print(pets["Mittens"]["age"])

{'name': 'Buddy', 'breed': 'Bulldog', 'age': 4, 'vaccinated': True, 'favorite_foods': ['chicken', 'peanut butter']}
{'name': 'Mittens', 'breed': 'Persian cat', 'age': 2, 'vaccinated': False, 'favorite_foods': ['tuna', 'chicken'], 'owner': {'name': 'Alice', 'contact': '555-0123'}}
{'name': 'Polly', 'breed': 'Parrot', 'age': 10, 'vaccinated': True, 'words_learned': ['hello', 'bye', 'Polly wants a cracker']}
['seeds', 'fruit']
2

Importing libraries

Recall that we covered how to import libraries of functions previously. For instance, we can import the built-in math library and then use the functions it contains:

import math
print(math.log(100))

4.605170185988092

We need to type math. so Python knows where to look for the log() function.

We can also use an alias if we don't want to type math. every time we use a function from the library:

import math as m
print(m.log(100))

4.605170185988092

Reading and writing data

One thing we haven't touched on yet that is integral to data analysis is how to get that data into your program. This is usually done by reading files. Likewise, when you've done your analysis and want to save the results you'll have to get that information out of the program so it is saved by creating and writing to files. In this next section, we'll learn how to do this first in the native Python way, and then later on we'll see how we can use pandas functions to read and write data. We'll use an example of bird names and bird sightings from two CSV files and then write out the total number of sightings for each bird to a new CSV file.

Vocab

Here are some terms that will be useful to know for the next sections.

Term	Definition
File	A collection of data stored on a disk
Line	A string of characters that ends with a newline character
Newline character	A special character that indicates the end of a line, usually `\n`
Delimiter	A character that separates data fields in a line, usually a comma, tab, or space
Parsing	The process of extracting data from a file
Whitespace	Any character that represents a space, tab, or newline
Leading/trailing whitespace	Whitespace at the beginning or end of a string

Reading files with Python

Reading data line by line

A common way to read data in Python is line by line. This is useful when you have a file that is too large to fit into memory all at once. You can read the file line by line and process each line as you go. This is also useful when you need to parse a file that has a specific structure so that it can be read into a data structure like a numpy array or pandas dataframe. In this section, when we talk about files, we mean text files that contain data that exist on your local machine. This is different from the data structures we've been working with so far, which are in memory (in your python instance).

The syntax for opening a file is open(filename, mode), where mode can be r for reading, w for writing, a for appending. Reading mode means that you can only read the file but not change it (on the disk). Writing mode means you are creating a new file or overwriting an existing file. Appending mode means you are adding to an existing file.

When you open a file, you can read it line by line using a for loop. While there are several ways to open a file in Python, the most efficient is with the with keyword, which will automatically close the file once you've done what you need. Here's an example of how that might look with a for loop to print each line of a file:

with open('filename.txt', 'r') as file:
    for line in file:
        print(line)

The above code will print each line of the file to the console. Notice the use of the keyword as, similar to how we setup an alias when we import a library. Here it serves a similar purpose, giving us a name to refer to the file object we've opened. We can name it anything we want, but file was just an example. The line variable is how we refer to each line of the file as we iterate through it. Again, it's an arbitrary name.

This for loop is similar to how we iterate through a list. The only difference is that we're iterating through something that is being read from our local computer rather than an object in memory.

In the example, we are only printing the lines of the file, but just as with any for loop, you can do anything you want with each line, such as apply a function to it, split it up and store it in a data structure, or write different parts of it to a new file. So think of that print(line) line as a placeholder for whatever you want to do with the line.

Let's work through a more concrete example of how we might read and then parse through a file. Let's suppose we have a list of taxon ids and bird names that we want to read into a dictionary. The file looks like this:

Anas rubripes,American Black Duck,6924
Fulica americana,American Coot,473
Spinus tristis,American Goldfinch,145310
Falco sparverius,American Kestrel,4665

First, run this block to download the file to the Jupyter notebook environment..

# This line downloads the file locally to the same folder as your notebook
!wget https://raw.githubusercontent.com/harvardinformatics/python-intensive/refs/heads/main/data/bird_names.csv

--2025-09-11 19:56:55--  https://raw.githubusercontent.com/harvardinformatics/python-intensive/refs/heads/main/data/bird_names.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...

Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response...

200 OK
Length: 4383 (4.3K) [text/plain]
Saving to: ‘bird_names.csv’


bird_names.csv        0%[                    ]       0  --.-KB/s
bird_names.csv      100%[===================>]   4.28K  --.-KB/s    in 0s

2025-09-11 19:56:55 (89.3 MB/s) - ‘bird_names.csv’ saved [4383/4383]

In the code below we first read the file line by line, then strip the whitespace and split the line by a comma. Then, we will create a dictionary where the key is the taxon id and the value is the common name of the bird.

filename = 'bird_names.csv'
if not os.path.exists(filename):
    filename = 'data/bird_names.csv'

bird_names = dict()

with open(filename, 'r') as file:
    for line in file:
        line = line.strip().split(',')
        # split out line[2]
        bird_names[line[2]] = line[1]

print(bird_names)

{'6924': 'American Black Duck', '473': 'American Coot', '145310': 'American Goldfinch', '4665': 'American Kestrel', '12727': 'American Robin', '474210': 'American Tree Sparrow', '3936': 'American Woodcock', '16010': 'Ash-throated Flycatcher', '5305': 'Bald Eagle', '9346': 'Baltimore Oriole', '19893': 'Barred Owl', '2548': 'Belted Kingfisher', '144815': 'Black-capped Chickadee', '4981': 'Black-crowned Night Heron', '199916': 'Black-throated Blue Warbler', '8229': 'Blue Jay', '7458': 'Brown Creeper', '10373': 'Brown-headed Cowbird', '6993': 'Bufflehead', '7089': 'Canada Goose', '7513': 'Carolina Wren', '7428': 'Cedar Waxwing', '6571': 'Chimney Swift', '9135': 'Chipping Sparrow', '9602': 'Common Grackle', '4626': 'Common Loon', '7004': 'Common Merganser', '8010': 'Common Raven', '9721': 'Common Yellowthroat', '5112': "Cooper's Hawk", '10094': 'Dark-eyed Junco', '10676': 'Dickcissel', '120479': 'Domestic Greylag Goose', '236935': 'Domestic Mallard', '1454382': 'Double-crested Cormorant', '792988': 'Downy Woodpecker', '16782': 'Eastern Kingbird', '17008': 'Eastern Phoebe', '494355': 'Eastern Red-tailed Hawk', '515821': 'Eastern Song Sparrow', '319123': 'Eastern Wild Turkey', '14850': 'European Starling', '544795': 'European house sparrow', '122767': 'Feral Pigeon', '9156': 'Fox Sparrow', '117100': 'Golden-crowned Kinglet', '14995': 'Gray Catbird', '4368': 'Great Black-backed Gull', '4956': 'Great Blue Heron', '16028': 'Great Crested Flycatcher', '20044': 'Great Horned Owl', '7047': 'Greater Scaup', '5020': 'Green Heron', '6937': 'Green-winged Teal', '514057': 'Greylag × Canada Goose', '792990': 'Hairy Woodpecker', '12890': 'Hermit Thrush', '204533': 'Herring Gull', '7109': 'Hooded Merganser', '4209': 'Horned Grebe', '199840': 'House Finch', '13858': 'House Sparrow', '7562': 'House Wren', '4793': 'Killdeer', '10479': 'Lark Sparrow', '7054': 'Lesser Scaup', '6930': 'Mallard', '326092': 'Mallard × Muscovy Duck', '4672': 'Merlin', '3454': 'Mourning Dove', '6921': 'Mute Swan', '9083': 'Northern Cardinal', '18236': 'Northern Flicker', '14886': 'Northern Mockingbird', '555736': 'Northern Yellow-shafted Flicker', '979757': 'Orange-crowned Warbler', '116999': 'Osprey', '62550': 'Ovenbird', '4647': 'Peregrine Falcon', '17364': 'Philadelphia Vireo', '4246': 'Pied-billed Grebe', '18205': 'Red-bellied Woodpecker', '6996': 'Red-breasted Merganser', '14823': 'Red-breasted Nuthatch', '5212': 'Red-tailed Hawk', '9744': 'Red-winged Blackbird', '7056': 'Redhead', '4364': 'Ring-billed Gull', '7044': 'Ring-necked Duck', '3017': 'Rock Pigeon', '1289388': 'Ruby-crowned Kinglet', '6432': 'Ruby-throated Hummingbird', '850859': 'Ruddy Duck', '9100': 'Song Sparrow', '72458': 'Spotted Sandpiper', '11935': 'Tree Swallow', '13632': 'Tufted Titmouse', '17394': 'Warbling Vireo', '14801': 'White-breasted Nuthatch', '9176': 'White-crowned Sparrow', '9184': 'White-throated Sparrow', '906': 'Wild Turkey', '7107': 'Wood Duck', '145238': 'Yellow Warbler', '18463': 'Yellow-bellied Sapsucker'}

Discussion: Explain each line

Here are some handy functions when working with lines in files. These are all string methods, so you can use them on any string, including strings that are read from a file.

Useful functions for reading files by line

Function	Description
`.strip()`	Removes leading and trailing whitespace and newlines from a string
`.split()`	Splits a string into a list of strings based on a delimiter
`.join()`	Joins a list of strings into a single string with a delimiter
`line[:]`	Indexing and slicing works on strings too
`.replace(old, new)`	Replaces all instances of `old` with `new` in a string

Useful special characters

Special characters in files are often used as delimiters or to indicate the end of a line. The two most common special characters are:

Character	Description
`\n`	Newline character
`\t`	Tab character

Exercise: Copy the code above and modify it so that the dictionary keys are the taxon ids and the values are another dictionary, with keys 'scientific_name' and 'common_name' and values the appropriate entries for that bird species.

For example, a sample dictionary entry should look like this:
{6924: {'scientific_name': 'Anas rubripes', 'common_name': 'American Black Duck'}}

filename = 'bird_names.csv'
if not os.path.exists(filename):
    filename = 'data/bird_names.csv'

# Your code here

Solution

filename = 'bird_names.csv'
if not os.path.exists(filename):
    filename = 'data/bird_names.csv'

bird_names = dict()

with open(filename, 'r') as file:
    for line in file:
        line = line.strip().split(',')
        bird_names[line[2]] = {'scientific_name': line[0], 'common_name': line[1]}

print(bird_names)

{'6924': {'scientific_name': 'Anas rubripes', 'common_name': 'American Black Duck'}, '473': {'scientific_name': 'Fulica americana', 'common_name': 'American Coot'}, '145310': {'scientific_name': 'Spinus tristis', 'common_name': 'American Goldfinch'}, '4665': {'scientific_name': 'Falco sparverius', 'common_name': 'American Kestrel'}, '12727': {'scientific_name': 'Turdus migratorius', 'common_name': 'American Robin'}, '474210': {'scientific_name': 'Spizelloides arborea', 'common_name': 'American Tree Sparrow'}, '3936': {'scientific_name': 'Scolopax minor', 'common_name': 'American Woodcock'}, '16010': {'scientific_name': 'Myiarchus cinerascens', 'common_name': 'Ash-throated Flycatcher'}, '5305': {'scientific_name': 'Haliaeetus leucocephalus', 'common_name': 'Bald Eagle'}, '9346': {'scientific_name': 'Icterus galbula', 'common_name': 'Baltimore Oriole'}, '19893': {'scientific_name': 'Strix varia', 'common_name': 'Barred Owl'}, '2548': {'scientific_name': 'Megaceryle alcyon', 'common_name': 'Belted Kingfisher'}, '144815': {'scientific_name': 'Poecile atricapillus', 'common_name': 'Black-capped Chickadee'}, '4981': {'scientific_name': 'Nycticorax nycticorax', 'common_name': 'Black-crowned Night Heron'}, '199916': {'scientific_name': 'Setophaga caerulescens', 'common_name': 'Black-throated Blue Warbler'}, '8229': {'scientific_name': 'Cyanocitta cristata', 'common_name': 'Blue Jay'}, '7458': {'scientific_name': 'Certhia americana', 'common_name': 'Brown Creeper'}, '10373': {'scientific_name': 'Molothrus ater', 'common_name': 'Brown-headed Cowbird'}, '6993': {'scientific_name': 'Bucephala albeola', 'common_name': 'Bufflehead'}, '7089': {'scientific_name': 'Branta canadensis', 'common_name': 'Canada Goose'}, '7513': {'scientific_name': 'Thryothorus ludovicianus', 'common_name': 'Carolina Wren'}, '7428': {'scientific_name': 'Bombycilla cedrorum', 'common_name': 'Cedar Waxwing'}, '6571': {'scientific_name': 'Chaetura pelagica', 'common_name': 'Chimney Swift'}, '9135': {'scientific_name': 'Spizella passerina', 'common_name': 'Chipping Sparrow'}, '9602': {'scientific_name': 'Quiscalus quiscula', 'common_name': 'Common Grackle'}, '4626': {'scientific_name': 'Gavia immer', 'common_name': 'Common Loon'}, '7004': {'scientific_name': 'Mergus merganser', 'common_name': 'Common Merganser'}, '8010': {'scientific_name': 'Corvus corax', 'common_name': 'Common Raven'}, '9721': {'scientific_name': 'Geothlypis trichas', 'common_name': 'Common Yellowthroat'}, '5112': {'scientific_name': 'Accipiter cooperii', 'common_name': "Cooper's Hawk"}, '10094': {'scientific_name': 'Junco hyemalis', 'common_name': 'Dark-eyed Junco'}, '10676': {'scientific_name': 'Spiza americana', 'common_name': 'Dickcissel'}, '120479': {'scientific_name': 'Anser anser domesticus', 'common_name': 'Domestic Greylag Goose'}, '236935': {'scientific_name': 'Anas platyrhynchos domesticus', 'common_name': 'Domestic Mallard'}, '1454382': {'scientific_name': 'Nannopterum auritum', 'common_name': 'Double-crested Cormorant'}, '792988': {'scientific_name': 'Dryobates pubescens', 'common_name': 'Downy Woodpecker'}, '16782': {'scientific_name': 'Tyrannus tyrannus', 'common_name': 'Eastern Kingbird'}, '17008': {'scientific_name': 'Sayornis phoebe', 'common_name': 'Eastern Phoebe'}, '494355': {'scientific_name': 'Buteo jamaicensis borealis', 'common_name': 'Eastern Red-tailed Hawk'}, '515821': {'scientific_name': 'Melospiza melodia melodia', 'common_name': 'Eastern Song Sparrow'}, '319123': {'scientific_name': 'Meleagris gallopavo silvestris', 'common_name': 'Eastern Wild Turkey'}, '14850': {'scientific_name': 'Sturnus vulgaris', 'common_name': 'European Starling'}, '544795': {'scientific_name': 'Passer domesticus domesticus', 'common_name': 'European house sparrow'}, '122767': {'scientific_name': 'Columba livia domestica', 'common_name': 'Feral Pigeon'}, '9156': {'scientific_name': 'Passerella iliaca', 'common_name': 'Fox Sparrow'}, '117100': {'scientific_name': 'Regulus satrapa', 'common_name': 'Golden-crowned Kinglet'}, '14995': {'scientific_name': 'Dumetella carolinensis', 'common_name': 'Gray Catbird'}, '4368': {'scientific_name': 'Larus marinus', 'common_name': 'Great Black-backed Gull'}, '4956': {'scientific_name': 'Ardea herodias', 'common_name': 'Great Blue Heron'}, '16028': {'scientific_name': 'Myiarchus crinitus', 'common_name': 'Great Crested Flycatcher'}, '20044': {'scientific_name': 'Bubo virginianus', 'common_name': 'Great Horned Owl'}, '7047': {'scientific_name': 'Aythya marila', 'common_name': 'Greater Scaup'}, '5020': {'scientific_name': 'Butorides virescens', 'common_name': 'Green Heron'}, '6937': {'scientific_name': 'Anas crecca', 'common_name': 'Green-winged Teal'}, '514057': {'scientific_name': 'Anser anser × Branta canadensis', 'common_name': 'Greylag × Canada Goose'}, '792990': {'scientific_name': 'Dryobates villosus', 'common_name': 'Hairy Woodpecker'}, '12890': {'scientific_name': 'Catharus guttatus', 'common_name': 'Hermit Thrush'}, '204533': {'scientific_name': 'Larus argentatus', 'common_name': 'Herring Gull'}, '7109': {'scientific_name': 'Lophodytes cucullatus', 'common_name': 'Hooded Merganser'}, '4209': {'scientific_name': 'Podiceps auritus', 'common_name': 'Horned Grebe'}, '199840': {'scientific_name': 'Haemorhous mexicanus', 'common_name': 'House Finch'}, '13858': {'scientific_name': 'Passer domesticus', 'common_name': 'House Sparrow'}, '7562': {'scientific_name': 'Troglodytes aedon', 'common_name': 'House Wren'}, '4793': {'scientific_name': 'Charadrius vociferus', 'common_name': 'Killdeer'}, '10479': {'scientific_name': 'Chondestes grammacus', 'common_name': 'Lark Sparrow'}, '7054': {'scientific_name': 'Aythya affinis', 'common_name': 'Lesser Scaup'}, '6930': {'scientific_name': 'Anas platyrhynchos', 'common_name': 'Mallard'}, '326092': {'scientific_name': 'Anas platyrhynchos × cairina moschata', 'common_name': 'Mallard × Muscovy Duck'}, '4672': {'scientific_name': 'Falco columbarius', 'common_name': 'Merlin'}, '3454': {'scientific_name': 'Zenaida macroura', 'common_name': 'Mourning Dove'}, '6921': {'scientific_name': 'Cygnus olor', 'common_name': 'Mute Swan'}, '9083': {'scientific_name': 'Cardinalis cardinalis', 'common_name': 'Northern Cardinal'}, '18236': {'scientific_name': 'Colaptes auratus', 'common_name': 'Northern Flicker'}, '14886': {'scientific_name': 'Mimus polyglottos', 'common_name': 'Northern Mockingbird'}, '555736': {'scientific_name': 'Colaptes auratus luteus', 'common_name': 'Northern Yellow-shafted Flicker'}, '979757': {'scientific_name': 'Leiothlypis celata', 'common_name': 'Orange-crowned Warbler'}, '116999': {'scientific_name': 'Pandion haliaetus', 'common_name': 'Osprey'}, '62550': {'scientific_name': 'Seiurus aurocapilla', 'common_name': 'Ovenbird'}, '4647': {'scientific_name': 'Falco peregrinus', 'common_name': 'Peregrine Falcon'}, '17364': {'scientific_name': 'Vireo philadelphicus', 'common_name': 'Philadelphia Vireo'}, '4246': {'scientific_name': 'Podilymbus podiceps', 'common_name': 'Pied-billed Grebe'}, '18205': {'scientific_name': 'Melanerpes carolinus', 'common_name': 'Red-bellied Woodpecker'}, '6996': {'scientific_name': 'Mergus serrator', 'common_name': 'Red-breasted Merganser'}, '14823': {'scientific_name': 'Sitta canadensis', 'common_name': 'Red-breasted Nuthatch'}, '5212': {'scientific_name': 'Buteo jamaicensis', 'common_name': 'Red-tailed Hawk'}, '9744': {'scientific_name': 'Agelaius phoeniceus', 'common_name': 'Red-winged Blackbird'}, '7056': {'scientific_name': 'Aythya americana', 'common_name': 'Redhead'}, '4364': {'scientific_name': 'Larus delawarensis', 'common_name': 'Ring-billed Gull'}, '7044': {'scientific_name': 'Aythya collaris', 'common_name': 'Ring-necked Duck'}, '3017': {'scientific_name': 'Columba livia', 'common_name': 'Rock Pigeon'}, '1289388': {'scientific_name': 'Corthylio calendula', 'common_name': 'Ruby-crowned Kinglet'}, '6432': {'scientific_name': 'Archilochus colubris', 'common_name': 'Ruby-throated Hummingbird'}, '850859': {'scientific_name': 'Oxyura jamaicensis', 'common_name': 'Ruddy Duck'}, '9100': {'scientific_name': 'Melospiza melodia', 'common_name': 'Song Sparrow'}, '72458': {'scientific_name': 'Actitis macularius', 'common_name': 'Spotted Sandpiper'}, '11935': {'scientific_name': 'Tachycineta bicolor', 'common_name': 'Tree Swallow'}, '13632': {'scientific_name': 'Baeolophus bicolor', 'common_name': 'Tufted Titmouse'}, '17394': {'scientific_name': 'Vireo gilvus', 'common_name': 'Warbling Vireo'}, '14801': {'scientific_name': 'Sitta carolinensis', 'common_name': 'White-breasted Nuthatch'}, '9176': {'scientific_name': 'Zonotrichia leucophrys', 'common_name': 'White-crowned Sparrow'}, '9184': {'scientific_name': 'Zonotrichia albicollis', 'common_name': 'White-throated Sparrow'}, '906': {'scientific_name': 'Meleagris gallopavo', 'common_name': 'Wild Turkey'}, '7107': {'scientific_name': 'Aix sponsa', 'common_name': 'Wood Duck'}, '145238': {'scientific_name': 'Setophaga petechia', 'common_name': 'Yellow Warbler'}, '18463': {'scientific_name': 'Sphyrapicus varius', 'common_name': 'Yellow-bellied Sapsucker'}}

Exercise: Why did we use a dictionary to store the data in the previous exercise? Think about what features of a dictionary make it a good choice or what features of lists make them a bad choice.

Below is an excerpt from a file of iNaturalist observations of birds in Cambridge, MA from the year 2023. We will loop through the file and count the number of observations of each species. We will also use the previously created dictionary to get the species names.

id,time_observed_at,taxon_id
145591043,2023-01-01 17:33:31 UTC,14886
145610149,2023-01-01 20:55:00 UTC,7004
145610383,2023-01-01 21:13:00 UTC,6993
145611915,2023-01-01 21:12:00 UTC,13858

Run the code block below to download the file to the Jupyter notebook environment.

# This line downloads the file locally to the same folder as your notebook
!wget https://raw.githubusercontent.com/harvardinformatics/python-intensive/refs/heads/main/data/bird_observations.csv

--2025-09-11 19:56:55--  https://raw.githubusercontent.com/harvardinformatics/python-intensive/refs/heads/main/data/bird_observations.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.

HTTP request sent, awaiting response...

200 OK
Length: 50448 (49K) [text/plain]
Saving to: ‘bird_observations.csv’


bird_observations.c   0%[                    ]       0  --.-KB/s

bird_observations.c 100%[===================>]  49.27K  --.-KB/s    in 0.01s

2025-09-11 19:56:55 (3.41 MB/s) - ‘bird_observations.csv’ saved [50448/50448]

Exercise: Work with a neighbor or two to do the following exercise: Loop through the file and count the number of observations of each species. After all the observations have been counted, print all the species names and the number of observations. You will need to use the dictionary you created in the previous exercise to get the species names. It's up to you what kind of data structure (if any) you want to use to store the counts.

Write out pseudocode for what you will do for each line in your birdfile

Try to turn the pseudocode into python code. If there's something you want to do, but don't know the syntax or function, raise your hand and we can help you.

Find out how many European Starlings were observed as proof that your code works.

# Your code here
filename = 'bird_observations.csv'
if not os.path.exists(filename):
    filename = 'data/bird_observations.csv'

bird_observations = dict()

with open(filename, 'r') as birdfile:
    # skip the header
    next(birdfile)
    for line in birdfile:
        # your code here
print(bird_observations)

  Cell In[19], line 13
    print(bird_observations)
                            ^
IndentationError: expected an indented block after 'for' statement on line 11

Solution

filename = 'bird_observations.csv'
if not os.path.exists(filename):
    filename = 'data/bird_observations.csv'

bird_observations = dict()

with open(filename, 'r') as birdfile:
    # skip the header
    next(birdfile)
    for line in birdfile:
        # clean up the line and split into list
        observation = line.strip().split(',')
        # get the bird id
        id = observation[2]
        # get the bird name by looking up in the bird_names dictionary
        name = bird_names[id]['common_name']
        # if this is the first time we're seeing the bird, add it to our observations dict
        if name not in bird_observations:
            bird_observations[name] = 0
        # increment the count by 1
        bird_observations[name] += 1
print(bird_observations)

{'Northern Mockingbird': 23, 'Common Merganser': 4, 'Bufflehead': 9, 'House Sparrow': 69, 'European Starling': 51, 'Northern Cardinal': 28, 'Mourning Dove': 31, 'Blue Jay': 39, 'American Black Duck': 2, 'Domestic Mallard': 14, 'Mute Swan': 33, 'Green-winged Teal': 8, 'American Robin': 92, 'Mallard': 49, 'Great Blue Heron': 26, 'Red-tailed Hawk': 36, 'Canada Goose': 112, 'Downy Woodpecker': 24, 'Ring-necked Duck': 22, 'Wild Turkey': 82, 'Common Loon': 9, 'Horned Grebe': 4, 'Redhead': 8, 'Feral Pigeon': 31, 'Golden-crowned Kinglet': 7, 'Red-bellied Woodpecker': 15, 'Hooded Merganser': 18, 'Belted Kingfisher': 3, 'Red-winged Blackbird': 35, 'Black-capped Chickadee': 14, 'Ruddy Duck': 2, 'Bald Eagle': 2, 'Dark-eyed Junco': 9, 'Carolina Wren': 7, 'House Finch': 19, 'White-throated Sparrow': 5, 'Song Sparrow': 24, 'Yellow-bellied Sapsucker': 3, 'White-breasted Nuthatch': 10, 'Eastern Red-tailed Hawk': 5, 'Tufted Titmouse': 8, "Cooper's Hawk": 17, 'Domestic Greylag Goose': 14, 'Rock Pigeon': 9, 'American Coot': 1, 'Greylag × Canada Goose': 1, 'Eastern Wild Turkey': 1, 'Brown Creeper': 7, 'Hairy Woodpecker': 2, 'Northern Flicker': 6, 'Greater Scaup': 1, 'Red-breasted Merganser': 2, 'American Woodcock': 7, 'Red-breasted Nuthatch': 1, 'Great Horned Owl': 23, 'Peregrine Falcon': 5, 'American Goldfinch': 18, 'Barred Owl': 2, 'Black-crowned Night Heron': 2, 'Tree Swallow': 11, 'Common Grackle': 14, 'Hermit Thrush': 4, 'Northern Yellow-shafted Flicker': 1, 'Chipping Sparrow': 3, 'Killdeer': 2, 'Gray Catbird': 20, 'Double-crested Cormorant': 17, 'Yellow Warbler': 3, 'Warbling Vireo': 2, 'Baltimore Oriole': 7, 'Common Yellowthroat': 2, 'White-crowned Sparrow': 2, 'Black-throated Blue Warbler': 1, 'Ovenbird': 1, 'Brown-headed Cowbird': 4, 'House Wren': 1, 'Cedar Waxwing': 4, 'European house sparrow': 1, 'Herring Gull': 4, 'Eastern Kingbird': 7, 'Great Black-backed Gull': 1, 'Green Heron': 10, 'Great Crested Flycatcher': 1, 'Wood Duck': 6, 'American Kestrel': 1, 'Osprey': 1, 'Ruby-throated Hummingbird': 3, 'Spotted Sandpiper': 2, 'Chimney Swift': 1, 'Eastern Phoebe': 1, 'Lark Sparrow': 2, 'Ring-billed Gull': 1, 'Dickcissel': 1, 'Merlin': 1, 'Ash-throated Flycatcher': 6, 'Pied-billed Grebe': 5, 'Lesser Scaup': 2, 'Orange-crowned Warbler': 2, 'Eastern Song Sparrow': 1, 'Philadelphia Vireo': 1, 'Ruby-crowned Kinglet': 2, 'Mallard × Muscovy Duck': 1, 'Fox Sparrow': 1, 'American Tree Sparrow': 1, 'Common Raven': 1}

If you routinely find yourself reading delimited files, you might want to use the csv library. The csv library also has the ability to parse Excel files or read and write to/from dictionaries directly. For more information, here's the doc page . Here's what the above code would look like using the csv module:

import csv

filename = 'bird_observations.csv'
if not os.path.exists(filename):
    filename = 'data/bird_observations.csv'

bird_observations = dict()

with open(filename, 'r') as birdfile:
    # this line takes the place of us having to strip and split the lines
    reader = csv.reader(birdfile, delimiter=',')
    # skip the header
    next(reader)
    for row in reader:
        id = row[2]
        name = bird_names[id]['common_name']
        if name not in bird_observations:
            bird_observations[name] = 0
        bird_observations[name] += 1
print(bird_observations)

{'Northern Mockingbird': 23, 'Common Merganser': 4, 'Bufflehead': 9, 'House Sparrow': 69, 'European Starling': 51, 'Northern Cardinal': 28, 'Mourning Dove': 31, 'Blue Jay': 39, 'American Black Duck': 2, 'Domestic Mallard': 14, 'Mute Swan': 33, 'Green-winged Teal': 8, 'American Robin': 92, 'Mallard': 49, 'Great Blue Heron': 26, 'Red-tailed Hawk': 36, 'Canada Goose': 112, 'Downy Woodpecker': 24, 'Ring-necked Duck': 22, 'Wild Turkey': 82, 'Common Loon': 9, 'Horned Grebe': 4, 'Redhead': 8, 'Feral Pigeon': 31, 'Golden-crowned Kinglet': 7, 'Red-bellied Woodpecker': 15, 'Hooded Merganser': 18, 'Belted Kingfisher': 3, 'Red-winged Blackbird': 35, 'Black-capped Chickadee': 14, 'Ruddy Duck': 2, 'Bald Eagle': 2, 'Dark-eyed Junco': 9, 'Carolina Wren': 7, 'House Finch': 19, 'White-throated Sparrow': 5, 'Song Sparrow': 24, 'Yellow-bellied Sapsucker': 3, 'White-breasted Nuthatch': 10, 'Eastern Red-tailed Hawk': 5, 'Tufted Titmouse': 8, "Cooper's Hawk": 17, 'Domestic Greylag Goose': 14, 'Rock Pigeon': 9, 'American Coot': 1, 'Greylag × Canada Goose': 1, 'Eastern Wild Turkey': 1, 'Brown Creeper': 7, 'Hairy Woodpecker': 2, 'Northern Flicker': 6, 'Greater Scaup': 1, 'Red-breasted Merganser': 2, 'American Woodcock': 7, 'Red-breasted Nuthatch': 1, 'Great Horned Owl': 23, 'Peregrine Falcon': 5, 'American Goldfinch': 18, 'Barred Owl': 2, 'Black-crowned Night Heron': 2, 'Tree Swallow': 11, 'Common Grackle': 14, 'Hermit Thrush': 4, 'Northern Yellow-shafted Flicker': 1, 'Chipping Sparrow': 3, 'Killdeer': 2, 'Gray Catbird': 20, 'Double-crested Cormorant': 17, 'Yellow Warbler': 3, 'Warbling Vireo': 2, 'Baltimore Oriole': 7, 'Common Yellowthroat': 2, 'White-crowned Sparrow': 2, 'Black-throated Blue Warbler': 1, 'Ovenbird': 1, 'Brown-headed Cowbird': 4, 'House Wren': 1, 'Cedar Waxwing': 4, 'European house sparrow': 1, 'Herring Gull': 4, 'Eastern Kingbird': 7, 'Great Black-backed Gull': 1, 'Green Heron': 10, 'Great Crested Flycatcher': 1, 'Wood Duck': 6, 'American Kestrel': 1, 'Osprey': 1, 'Ruby-throated Hummingbird': 3, 'Spotted Sandpiper': 2, 'Chimney Swift': 1, 'Eastern Phoebe': 1, 'Lark Sparrow': 2, 'Ring-billed Gull': 1, 'Dickcissel': 1, 'Merlin': 1, 'Ash-throated Flycatcher': 6, 'Pied-billed Grebe': 5, 'Lesser Scaup': 2, 'Orange-crowned Warbler': 2, 'Eastern Song Sparrow': 1, 'Philadelphia Vireo': 1, 'Ruby-crowned Kinglet': 2, 'Mallard × Muscovy Duck': 1, 'Fox Sparrow': 1, 'American Tree Sparrow': 1, 'Common Raven': 1}

Writing data line by line

Writing data to a file is similar to reading data from a file. You can open a file in write mode and then write to it line by line using the print() method, but this time passing in the variable we've stored the opened file in (in our case the variable is unimaginatively named file). Here's an example of writing a list of strings to a file:

my_text = ['this is a test', 'this is another test', 'this is the final test']

with open('my_text.txt', 'w') as file:
    for line in my_text:
        print(line, file=file)

# reading it back
with open('my_text.txt', 'r') as file:
    for line in file:
        print(line)

this is a test

this is another test

this is the final test

BONUS Exercise: Use the csv module to write the species counts to a new file. The file should have two columns: the species name and the number of observations. The file should be comma-delimited. How this is written may depend on how you stored the species counts.

# your code here

Solution

import csv

with open('bird_counts.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerow(['bird', 'count'])
    for bird, count in bird_observations.items():
        writer.writerow([bird, count])

Pandas

In this section, we will learn about a python package called pandas that contains very helpful functions and data structures for flat data types like the tab or comma-delimited files you might normally read into Excel. In previous iterations of this class, we taught both pandas and numpy, which is another library that pandas uses under the hood. In this workshop, we will focus on pandas only, and do a deeper dive.

First, we'll learn about the basic data structures available in pandas, what their closest corollaries are in base Python, and what advantages they have over those corollaries.

`pandas` series

A Series is the simplest data structure in pandas. They are one dimensional (1D) objects composed of a single data type of any variety (string, integers); you can basically think of them as a single column in a spreadsheet. In that sense they are similar to lists in base Python (and arrays in numpy), however unlike those other 1D structures Series also have label-based indexing, meaning each element in the object can be accessed by specifying its specific label. In that way, they are also similar to dictionaries in base Python.

Initializing a series

We can manually create a Series in several ways.

Using the pd.Series() function, we provide the data we want to store as a series, and optionally we can give each row of the data a label using the index argument. If we don't give it the index argument, it will automatically assign a numerical index to each row starting from 0 (just like a Python list).

When we print the Series, it will display as a column with the index on the left and the data on the right. The type of data being held in the series will be displayed at the bottom of the output.

#Making a Series using the pd.Series method:
s0 = pd.Series([10, 20, 30, 40])

print(s0)

0    10
1    20
2    30
3    40
dtype: int64

#Making a Series using the pd.Series method:
s1 = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])

print(s1)

a    10
b    20
c    30
d    40
dtype: int64

Another way to create a Series is to convert a (non-nested) dictionary into a Series. The keys of the dictionary will become the index labels while the values will become the data.

# Converting from dictionary to series
my_dictionary = {'first': 10, 'second': 20, 'third': 30}
s2 = pd.Series(my_dictionary)

print(s2)

first     10
second    20
third     30
dtype: int64

We can then access specific elements in the Series by referring to its index label enclosed in quotes and brackets. This is very similar to how a dictionary works!

print(s0[0])

print(s1["a"])

print(s2["second"])

10
10
20

So, Series can be thought of as a more versatile version of base Python lists and dictionaries. They are one dimensional. They default to numerical indexing for labeling starting with index 0 (like lists), but have the capability to use labels as indexes as well (like dictionaries).

Multi-indexed Series

Series objects may have multiple levels of indices. We call this multi-indexed. Using layers of indexing is a way of representing two-dimensional data within a one-dimensional Series object. Some people really like using multi-indexed Series. You can create a multi-indexed series by passing a list of lists to the index argument of the pd.Series() function. The first list will be the outermost level of the index, the second list will be the next level, and so on.

my_index = [["California", "California", "New York", "New York", "Texas", "Texas"], 
            [2001, 2002, 2001, 2002, 2001, 2002]]
my_values = [1.5, 1.7, 3.6, 4.2, 3.2, 4.5]

s3 = pd.Series(my_values, index=my_index)

print(s3)
print("...")
print(s3.index) # an index is an ordered set of tuples

California  2001    1.5
            2002    1.7
New York    2001    3.6
            2002    4.2
Texas       2001    3.2
            2002    4.5
dtype: float64
...
MultiIndex([('California', 2001),
            ('California', 2002),
            (  'New York', 2001),
            (  'New York', 2002),
            (     'Texas', 2001),
            (     'Texas', 2002)],
           )

Retrieving an item from this data structure is similar to a nested dictionary, using successive [] notation. Or, you can pass it a tuple. You must pass the index labels in the order they were created (left to right)

print("Just printing California")
print(s3["California"])
print("...")

print("Just printing California 2001")
print(s3["California"][2001])
print("...")

print("Using a tuple to get California 2001")
print(s3[("California", 2001)])
print("...")

print("you can also use slicing to select multiple elements")
print(s3["California":"New York"])
print("...")

print("You can use .index.isin to search for values that match your index")
print(s3[s3.index.isin(["Texas", "New York"], level=0)])
print("...")

print("You can select inner levels of multi-indexed series by specifying level=")
print(s3[s3.index.isin([2001], level=1)])

Just printing California
2001    1.5
2002    1.7
dtype: float64
...
Just printing California 2001
1.5
...
Using a tuple to get California 2001
1.5
...
you can also use slicing to select multiple elements
California  2001    1.5
            2002    1.7
New York    2001    3.6
            2002    4.2
dtype: float64
...
You can use .index.isin to search for values that match your index
New York  2001    3.6
          2002    4.2
Texas     2001    3.2
          2002    4.5
dtype: float64
...
You can select inner levels of multi-indexed series by specifying level=
California  2001    1.5
New York    2001    3.6
Texas       2001    3.2
dtype: float64

In our work, we typically don't use multi-indexed Series. However, they are often the output of pandas functions, so it's good to know how to work with them. If you don't like the idea of multi-indexed Series, you can always convert them to a DataFrame using the reset_index() method.

s3.reset_index()

      level_0  level_1    0
0  California     2001  1.5
1  California     2002  1.7
2    New York     2001  3.6
3    New York     2002  4.2
4       Texas     2001  3.2
5       Texas     2002  4.5

pandas Series are useful for representing a single column of data and have several operations that can be performed on them. These operations include sorting, slicing, mathematical transformations, filtering, and more.

Series objects have built-in methods like .mean() and .std() that can be used to calculate statistics on the data. Data in Series objects can also be filtered using boolean indexing, like series_name[series_name > 0] to get all the values greater than 0. Just like lists, you can perform mathematical operations on Series objects, like addition, subtraction, multiplication, and division.

Exercise: CODE BOWLING. Using the multi-indexed Series below, we have found the most populous state in 2001 and printed it. Expand the 1 liner to at least three lines of code so that it is more readable.

my_index = [["California", "California", "New York", "New York", "Texas", "Texas"], 
            [2001, 2002, 2001, 2002, 2001, 2002]]
my_values = [1.5, 1.7, 3.6, 4.2, 3.2, 4.5]

s3 = pd.Series(my_values, index=my_index)

# make the below more readable
print(s3[s3.index.isin([2001], level=1)].sort_values(ascending=False).index[0][0])

# your code here

New York

Solution

my_index = [["California", "California", "New York", "New York", "Texas", "Texas"], 
            [2001, 2002, 2001, 2002, 2001, 2002]]
my_values = [1.5, 1.7, 3.6, 4.2, 3.2, 4.5]

s3 = pd.Series(my_values, index=my_index)

# make the below more readable
print(s3[s3.index.isin([2001], level=1)].sort_values(ascending=False).index[0][0])

# Get all values for the year 2001 across all states
pop_2001 = s3[s3.index.isin([2001], level=1)]

# Sort the values in descending order to find the most populous state
pop_2001_sorted = pop_2001.sort_values(ascending=False)

# Get the index (state name) of the most populous state in 2001
most_populous_2001_state = pop_2001_sorted.index[0][0]
print(most_populous_2001_state)

New York
New York

`pandas` DataFrame

Another pandas data structure is the DataFrame, which is really convenient for its ability to easily perform complex data transformations. This makes it a powerful tool for cleaning, filtering, and summarizing tabular data.

While a Series is a "one-dimensional" data structure, DataFrames are two-dimensional. Where Series can only contain one type of data, DataFrames can have a combination of numerical and categorical data. Additionally, DataFrames allow you do have labels for your rows and columns.

DataFrames are essentially a collection of Series objects. You can also think of python DataFrames as spreadsheets from Excel or dataframes from R. Since DataFrames are made of multiple Series, the operations we can do on Series are also available on DataFrames

Initializing a DataFrame

Let's manually create a simple dataframe in pandas to showcase their behavior. In the below code, we create a dictionary where the keys are the column names and the values are lists of data. We then pass this dictionary to the pd.DataFrame() function to create a DataFrame.

When we print the DataFrame, it will display as a table with the column names at the top and the data below. The index (in this case, automatically generated numerical index starting at 0) will be displayed on the left side of the table.

tournamentStats = {
    "wrestler": ["Terunofuji", "Ura", "Shodai", "Takanosho"],
    "wins": [13, 6, 10, 12],
    "rank": ["yokozuna", "maegashira2", "komusubi", "maegashira6"]
}

#Converting to a pandas DataFrame
sumo = pd.DataFrame(tournamentStats)

print(sumo)

     wrestler  wins         rank
0  Terunofuji    13     yokozuna
1         Ura     6  maegashira2
2      Shodai    10     komusubi
3   Takanosho    12  maegashira6

This looks very similar to how we initialize base Python dictionaries.

Pandas dataframes have many attributes, including shape, columns, index, dtypes. These are useful for understanding the structure of the dataframe.

print(sumo.shape)

print("...")
print(sumo.columns)

print("...")
print(sumo.index)

print("...")
print(sumo.dtypes)

(4, 3)
...
Index(['wrestler', 'wins', 'rank'], dtype='object')
...
RangeIndex(start=0, stop=4, step=1)
...
wrestler    object
wins         int64
rank        object
dtype: object

Pandas DataFrames also have the handy info() function that summarizes the contents of the dataframe, including counts of the non-null values of each column and the data type of each column.

sumo.info()


RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   wrestler  4 non-null      object
 1   wins      4 non-null      int64
 2   rank      4 non-null      object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes

Importing data to `pandas`

Earlier, we leared how to read and write data from and to files in Python. Now, we'll learn how to get data into our program from files using pandas functions and data structures. pandas natively reads data from tab-delimited files into DataFrames, which is very convenient.

Below you can see an example of how to read files into pandas using the pd.read_csv() function. The csv stands for 'comma-separated values', which means by defaults it will assume that our columns are separated by commas; if we wanted to change the delimiter (e.g. in the case of a tab-separated file), we can set the delimiter explicitly using the sep= argument.

penguins = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-07-28/penguins.csv", sep=',')

# The head() function from pandas prints only the first N lines of a dataframe (default: 10)
penguins.head()

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0
1  Adelie  Torgersen            39.5           17.4              186.0
2  Adelie  Torgersen            40.3           18.0              195.0
3  Adelie  Torgersen             NaN            NaN                NaN
4  Adelie  Torgersen            36.7           19.3              193.0

   body_mass_g     sex  year
0       3750.0    male  2007
1       3800.0  female  2007
2       3250.0  female  2007
3          NaN     NaN  2007
4       3450.0  female  2007

When importing data into a DataFrame, pandas automatically detects what data type each column should be. For example, if the column contains only numbers, it will be imported as an floating point or integer data type. If the column contains strings or a mixture of strings and numbers, it will be imported as an "object" data type. Below are the different data types for the penguins column.

penguins.info()


RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   species            344 non-null    object
 1   island             344 non-null    object
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object
 7   year               344 non-null    int64
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB

Looping through a dataframe

As a note, if we want to go through a dataframe line-by-line (i.e. row by row), because both the rows and columns are indexed it requires slightly more syntax than looping through other data structures (e.g. a dictionary or list). Specifically we need to use the .iterrows() method to make the data frame iterable. The .iterrows() method outputs each row as a Series object with a row index and the column:

for index, row in penguins.iterrows():
    print(f"Row index: {index}, {row['species']}, {row['island']}")

Row index: 0, Adelie, Torgersen
Row index: 1, Adelie, Torgersen
Row index: 2, Adelie, Torgersen
Row index: 3, Adelie, Torgersen
Row index: 4, Adelie, Torgersen
Row index: 5, Adelie, Torgersen
Row index: 6, Adelie, Torgersen
Row index: 7, Adelie, Torgersen
Row index: 8, Adelie, Torgersen
Row index: 9, Adelie, Torgersen
Row index: 10, Adelie, Torgersen
Row index: 11, Adelie, Torgersen
Row index: 12, Adelie, Torgersen
Row index: 13, Adelie, Torgersen
Row index: 14, Adelie, Torgersen
Row index: 15, Adelie, Torgersen
Row index: 16, Adelie, Torgersen
Row index: 17, Adelie, Torgersen
Row index: 18, Adelie, Torgersen
Row index: 19, Adelie, Torgersen
Row index: 20, Adelie, Biscoe
Row index: 21, Adelie, Biscoe
Row index: 22, Adelie, Biscoe
Row index: 23, Adelie, Biscoe
Row index: 24, Adelie, Biscoe
Row index: 25, Adelie, Biscoe
Row index: 26, Adelie, Biscoe
Row index: 27, Adelie, Biscoe
Row index: 28, Adelie, Biscoe
Row index: 29, Adelie, Biscoe
Row index: 30, Adelie, Dream
Row index: 31, Adelie, Dream
Row index: 32, Adelie, Dream
Row index: 33, Adelie, Dream
Row index: 34, Adelie, Dream
Row index: 35, Adelie, Dream
Row index: 36, Adelie, Dream
Row index: 37, Adelie, Dream
Row index: 38, Adelie, Dream
Row index: 39, Adelie, Dream
Row index: 40, Adelie, Dream
Row index: 41, Adelie, Dream
Row index: 42, Adelie, Dream
Row index: 43, Adelie, Dream
Row index: 44, Adelie, Dream
Row index: 45, Adelie, Dream
Row index: 46, Adelie, Dream
Row index: 47, Adelie, Dream
Row index: 48, Adelie, Dream
Row index: 49, Adelie, Dream
Row index: 50, Adelie, Biscoe
Row index: 51, Adelie, Biscoe
Row index: 52, Adelie, Biscoe
Row index: 53, Adelie, Biscoe
Row index: 54, Adelie, Biscoe
Row index: 55, Adelie, Biscoe
Row index: 56, Adelie, Biscoe
Row index: 57, Adelie, Biscoe
Row index: 58, Adelie, Biscoe
Row index: 59, Adelie, Biscoe
Row index: 60, Adelie, Biscoe
Row index: 61, Adelie, Biscoe
Row index: 62, Adelie, Biscoe
Row index: 63, Adelie, Biscoe
Row index: 64, Adelie, Biscoe
Row index: 65, Adelie, Biscoe
Row index: 66, Adelie, Biscoe
Row index: 67, Adelie, Biscoe
Row index: 68, Adelie, Torgersen
Row index: 69, Adelie, Torgersen
Row index: 70, Adelie, Torgersen
Row index: 71, Adelie, Torgersen
Row index: 72, Adelie, Torgersen
Row index: 73, Adelie, Torgersen
Row index: 74, Adelie, Torgersen
Row index: 75, Adelie, Torgersen
Row index: 76, Adelie, Torgersen
Row index: 77, Adelie, Torgersen
Row index: 78, Adelie, Torgersen
Row index: 79, Adelie, Torgersen
Row index: 80, Adelie, Torgersen
Row index: 81, Adelie, Torgersen
Row index: 82, Adelie, Torgersen
Row index: 83, Adelie, Torgersen
Row index: 84, Adelie, Dream
Row index: 85, Adelie, Dream
Row index: 86, Adelie, Dream
Row index: 87, Adelie, Dream
Row index: 88, Adelie, Dream
Row index: 89, Adelie, Dream
Row index: 90, Adelie, Dream
Row index: 91, Adelie, Dream
Row index: 92, Adelie, Dream
Row index: 93, Adelie, Dream
Row index: 94, Adelie, Dream
Row index: 95, Adelie, Dream
Row index: 96, Adelie, Dream
Row index: 97, Adelie, Dream
Row index: 98, Adelie, Dream
Row index: 99, Adelie, Dream
Row index: 100, Adelie, Biscoe
Row index: 101, Adelie, Biscoe
Row index: 102, Adelie, Biscoe
Row index: 103, Adelie, Biscoe
Row index: 104, Adelie, Biscoe
Row index: 105, Adelie, Biscoe
Row index: 106, Adelie, Biscoe
Row index: 107, Adelie, Biscoe
Row index: 108, Adelie, Biscoe
Row index: 109, Adelie, Biscoe
Row index: 110, Adelie, Biscoe
Row index: 111, Adelie, Biscoe
Row index: 112, Adelie, Biscoe
Row index: 113, Adelie, Biscoe
Row index: 114, Adelie, Biscoe
Row index: 115, Adelie, Biscoe
Row index: 116, Adelie, Torgersen
Row index: 117, Adelie, Torgersen
Row index: 118, Adelie, Torgersen
Row index: 119, Adelie, Torgersen
Row index: 120, Adelie, Torgersen
Row index: 121, Adelie, Torgersen
Row index: 122, Adelie, Torgersen
Row index: 123, Adelie, Torgersen
Row index: 124, Adelie, Torgersen
Row index: 125, Adelie, Torgersen
Row index: 126, Adelie, Torgersen
Row index: 127, Adelie, Torgersen
Row index: 128, Adelie, Torgersen
Row index: 129, Adelie, Torgersen
Row index: 130, Adelie, Torgersen
Row index: 131, Adelie, Torgersen
Row index: 132, Adelie, Dream
Row index: 133, Adelie, Dream
Row index: 134, Adelie, Dream
Row index: 135, Adelie, Dream
Row index: 136, Adelie, Dream
Row index: 137, Adelie, Dream
Row index: 138, Adelie, Dream
Row index: 139, Adelie, Dream
Row index: 140, Adelie, Dream
Row index: 141, Adelie, Dream
Row index: 142, Adelie, Dream
Row index: 143, Adelie, Dream
Row index: 144, Adelie, Dream
Row index: 145, Adelie, Dream
Row index: 146, Adelie, Dream
Row index: 147, Adelie, Dream
Row index: 148, Adelie, Dream
Row index: 149, Adelie, Dream
Row index: 150, Adelie, Dream
Row index: 151, Adelie, Dream
Row index: 152, Gentoo, Biscoe
Row index: 153, Gentoo, Biscoe
Row index: 154, Gentoo, Biscoe
Row index: 155, Gentoo, Biscoe
Row index: 156, Gentoo, Biscoe
Row index: 157, Gentoo, Biscoe
Row index: 158, Gentoo, Biscoe
Row index: 159, Gentoo, Biscoe
Row index: 160, Gentoo, Biscoe
Row index: 161, Gentoo, Biscoe
Row index: 162, Gentoo, Biscoe
Row index: 163, Gentoo, Biscoe
Row index: 164, Gentoo, Biscoe
Row index: 165, Gentoo, Biscoe
Row index: 166, Gentoo, Biscoe
Row index: 167, Gentoo, Biscoe
Row index: 168, Gentoo, Biscoe
Row index: 169, Gentoo, Biscoe
Row index: 170, Gentoo, Biscoe
Row index: 171, Gentoo, Biscoe
Row index: 172, Gentoo, Biscoe
Row index: 173, Gentoo, Biscoe
Row index: 174, Gentoo, Biscoe
Row index: 175, Gentoo, Biscoe
Row index: 176, Gentoo, Biscoe
Row index: 177, Gentoo, Biscoe
Row index: 178, Gentoo, Biscoe
Row index: 179, Gentoo, Biscoe
Row index: 180, Gentoo, Biscoe
Row index: 181, Gentoo, Biscoe
Row index: 182, Gentoo, Biscoe
Row index: 183, Gentoo, Biscoe
Row index: 184, Gentoo, Biscoe
Row index: 185, Gentoo, Biscoe
Row index: 186, Gentoo, Biscoe
Row index: 187, Gentoo, Biscoe
Row index: 188, Gentoo, Biscoe
Row index: 189, Gentoo, Biscoe
Row index: 190, Gentoo, Biscoe
Row index: 191, Gentoo, Biscoe
Row index: 192, Gentoo, Biscoe
Row index: 193, Gentoo, Biscoe
Row index: 194, Gentoo, Biscoe
Row index: 195, Gentoo, Biscoe

Row index: 196, Gentoo, Biscoe
Row index: 197, Gentoo, Biscoe
Row index: 198, Gentoo, Biscoe
Row index: 199, Gentoo, Biscoe
Row index: 200, Gentoo, Biscoe
Row index: 201, Gentoo, Biscoe
Row index: 202, Gentoo, Biscoe
Row index: 203, Gentoo, Biscoe
Row index: 204, Gentoo, Biscoe
Row index: 205, Gentoo, Biscoe
Row index: 206, Gentoo, Biscoe
Row index: 207, Gentoo, Biscoe
Row index: 208, Gentoo, Biscoe
Row index: 209, Gentoo, Biscoe
Row index: 210, Gentoo, Biscoe
Row index: 211, Gentoo, Biscoe
Row index: 212, Gentoo, Biscoe
Row index: 213, Gentoo, Biscoe
Row index: 214, Gentoo, Biscoe
Row index: 215, Gentoo, Biscoe
Row index: 216, Gentoo, Biscoe
Row index: 217, Gentoo, Biscoe
Row index: 218, Gentoo, Biscoe
Row index: 219, Gentoo, Biscoe
Row index: 220, Gentoo, Biscoe
Row index: 221, Gentoo, Biscoe
Row index: 222, Gentoo, Biscoe
Row index: 223, Gentoo, Biscoe
Row index: 224, Gentoo, Biscoe
Row index: 225, Gentoo, Biscoe
Row index: 226, Gentoo, Biscoe
Row index: 227, Gentoo, Biscoe
Row index: 228, Gentoo, Biscoe
Row index: 229, Gentoo, Biscoe
Row index: 230, Gentoo, Biscoe
Row index: 231, Gentoo, Biscoe
Row index: 232, Gentoo, Biscoe
Row index: 233, Gentoo, Biscoe
Row index: 234, Gentoo, Biscoe
Row index: 235, Gentoo, Biscoe
Row index: 236, Gentoo, Biscoe
Row index: 237, Gentoo, Biscoe
Row index: 238, Gentoo, Biscoe
Row index: 239, Gentoo, Biscoe
Row index: 240, Gentoo, Biscoe
Row index: 241, Gentoo, Biscoe
Row index: 242, Gentoo, Biscoe
Row index: 243, Gentoo, Biscoe
Row index: 244, Gentoo, Biscoe
Row index: 245, Gentoo, Biscoe
Row index: 246, Gentoo, Biscoe
Row index: 247, Gentoo, Biscoe
Row index: 248, Gentoo, Biscoe
Row index: 249, Gentoo, Biscoe
Row index: 250, Gentoo, Biscoe
Row index: 251, Gentoo, Biscoe
Row index: 252, Gentoo, Biscoe
Row index: 253, Gentoo, Biscoe
Row index: 254, Gentoo, Biscoe
Row index: 255, Gentoo, Biscoe
Row index: 256, Gentoo, Biscoe
Row index: 257, Gentoo, Biscoe
Row index: 258, Gentoo, Biscoe
Row index: 259, Gentoo, Biscoe
Row index: 260, Gentoo, Biscoe
Row index: 261, Gentoo, Biscoe
Row index: 262, Gentoo, Biscoe
Row index: 263, Gentoo, Biscoe
Row index: 264, Gentoo, Biscoe
Row index: 265, Gentoo, Biscoe
Row index: 266, Gentoo, Biscoe
Row index: 267, Gentoo, Biscoe
Row index: 268, Gentoo, Biscoe
Row index: 269, Gentoo, Biscoe
Row index: 270, Gentoo, Biscoe
Row index: 271, Gentoo, Biscoe
Row index: 272, Gentoo, Biscoe
Row index: 273, Gentoo, Biscoe
Row index: 274, Gentoo, Biscoe
Row index: 275, Gentoo, Biscoe
Row index: 276, Chinstrap, Dream
Row index: 277, Chinstrap, Dream
Row index: 278, Chinstrap, Dream
Row index: 279, Chinstrap, Dream
Row index: 280, Chinstrap, Dream
Row index: 281, Chinstrap, Dream
Row index: 282, Chinstrap, Dream
Row index: 283, Chinstrap, Dream
Row index: 284, Chinstrap, Dream
Row index: 285, Chinstrap, Dream
Row index: 286, Chinstrap, Dream
Row index: 287, Chinstrap, Dream
Row index: 288, Chinstrap, Dream
Row index: 289, Chinstrap, Dream
Row index: 290, Chinstrap, Dream
Row index: 291, Chinstrap, Dream
Row index: 292, Chinstrap, Dream
Row index: 293, Chinstrap, Dream
Row index: 294, Chinstrap, Dream
Row index: 295, Chinstrap, Dream
Row index: 296, Chinstrap, Dream
Row index: 297, Chinstrap, Dream
Row index: 298, Chinstrap, Dream
Row index: 299, Chinstrap, Dream
Row index: 300, Chinstrap, Dream
Row index: 301, Chinstrap, Dream
Row index: 302, Chinstrap, Dream
Row index: 303, Chinstrap, Dream
Row index: 304, Chinstrap, Dream
Row index: 305, Chinstrap, Dream
Row index: 306, Chinstrap, Dream
Row index: 307, Chinstrap, Dream
Row index: 308, Chinstrap, Dream
Row index: 309, Chinstrap, Dream
Row index: 310, Chinstrap, Dream
Row index: 311, Chinstrap, Dream
Row index: 312, Chinstrap, Dream
Row index: 313, Chinstrap, Dream
Row index: 314, Chinstrap, Dream
Row index: 315, Chinstrap, Dream
Row index: 316, Chinstrap, Dream
Row index: 317, Chinstrap, Dream
Row index: 318, Chinstrap, Dream
Row index: 319, Chinstrap, Dream
Row index: 320, Chinstrap, Dream
Row index: 321, Chinstrap, Dream
Row index: 322, Chinstrap, Dream
Row index: 323, Chinstrap, Dream
Row index: 324, Chinstrap, Dream
Row index: 325, Chinstrap, Dream
Row index: 326, Chinstrap, Dream
Row index: 327, Chinstrap, Dream
Row index: 328, Chinstrap, Dream
Row index: 329, Chinstrap, Dream
Row index: 330, Chinstrap, Dream
Row index: 331, Chinstrap, Dream
Row index: 332, Chinstrap, Dream
Row index: 333, Chinstrap, Dream
Row index: 334, Chinstrap, Dream
Row index: 335, Chinstrap, Dream
Row index: 336, Chinstrap, Dream
Row index: 337, Chinstrap, Dream
Row index: 338, Chinstrap, Dream
Row index: 339, Chinstrap, Dream
Row index: 340, Chinstrap, Dream
Row index: 341, Chinstrap, Dream
Row index: 342, Chinstrap, Dream
Row index: 343, Chinstrap, Dream

This can be slow for very large dataframes, but is useful if you need to perform actions on individual rows.

Before we dive into working with real data, we're going to create some DataFrames from scratch to better understand how they work.

Selecting data in a `pandas` dataframe

As with series objects, pandas dataframes rows and columns are explicitly indexed, which means that every row and column has a label associated with it. You can think of the explicit indices as the being the names of the rows and the names of the columns.

Unfortunately, in pandas the syntax for subsetting rows v.s. columns is different and can get a little confusing, so let's go through several different use cases.

Selecting columns

We can always check the names of the columns in a pandas dataframe byt using the built-in .columns method, which simply lists the column index:

sumo.columns

Index(['wrestler', 'wins', 'rank'], dtype='object')

If we want to refer to a specific column, we can specify its index (enclosed in double quotes) inside of square brackets [] like so:

#Single column:
sumo["wrestler"]

0    Terunofuji
1           Ura
2        Shodai
3     Takanosho
Name: wrestler, dtype: object

If we want to refer to multiple columns, we need to pass the columns as a list by enclosing the column indices in square brackets, so you will end up with double brackets:

#Multiple columns (note the double []!):
sumo[["wrestler", "rank"]]

     wrestler         rank
0  Terunofuji     yokozuna
1         Ura  maegashira2
2      Shodai     komusubi
3   Takanosho  maegashira6

Selecting rows:

The syntax for selecting specific rows is slightly different. Let's first check the labels of the row index; to do this we use the .index method:

print(sumo.index)

RangeIndex(start=0, stop=4, step=1)

Here we can see that while the column index labels were strings, the row index labels are numerical values, in this case 0 thru 3. If we wanted to pull out the first row, we need to specify its index label (0) in combination with the .loc method (which is required for rows):

sumo.loc[0]

wrestler    Terunofuji
wins                13
rank          yokozuna
Name: 0, dtype: object

If we want to select multiple rows, like with columns we need to pass it as a list using the double brackets. If we want to specify a range of rows (i.e. from this row to that row), we don't use double brackets and instead use ::

print(sumo.loc[[0,1]])

     wrestler  wins         rank
0  Terunofuji    13     yokozuna
1         Ura     6  maegashira2

print(sumo.loc[0:2])

     wrestler  wins         rank
0  Terunofuji    13     yokozuna
1         Ura     6  maegashira2
2      Shodai    10     komusubi

Note that in this case the row index labels are numbers, but do not have to be numerical, and can have string labels similar to columns. Let's show how we could change the row index labels by taking the column with the wrestler's rank and setting it as the index label (note that the labels should be unique!):

sumo = sumo.set_index("rank")

print(sumo)

               wrestler  wins
rank
yokozuna     Terunofuji    13
maegashira2         Ura     6
komusubi         Shodai    10
maegashira6   Takanosho    12

sumo.loc["yokozuna"]

wrestler    Terunofuji
wins                13
Name: yokozuna, dtype: object

We also need to use .loc if we are referring to a specific row AND column, e.g.:

print(sumo.loc["komusubi", "wrestler"])

Shodai

If we want to purely use numerical indexing, we can use the .iloc() method. If you use .iloc(), you can index a DataFrame using just the coordinates of the cells (but remember to begin counting from 0).

# Select the first two rows and the first two columns

sumo.iloc[0:2, 0:2]

               wrestler  wins
rank
yokozuna     Terunofuji    13
maegashira2         Ura     6

There are many ways to select subsets of a dataframe. The rows and columns of a dataframe can be referred to either by their integer position or by their indexed name. Typically, for columns, you'll use the indexed name and can just do [] with the name of the column. For rows, if you want to use the integer position, you will use .iloc[]. If you want to use the index name, you will use .loc[].

For reference, here's a handy table on the best ways to index into a dataframe:

Action	Named index	Integer Position
Select single column	`df['column_name']`	`df.iloc[:, column_position]`
Select multiple columns	`df[['column_name1', 'column_name2']]`	`df.iloc[:, [column_position1, column_position2]]`
Select single row	`df.loc['row_name']`	`df.iloc[row_position]`
Select multiple rows	`df.loc[['row_name1', 'row_name2']]`	`df.iloc[[row_position1, row_position2]]`

Exercise: We'll use the penguins dataset from our initial example. 1) Print the 'species' column 2) Print the first five columns and first five rows 3) Print the columns "species", "island", and "sex" and the first ten rows of the dataframe

penguins.info()


RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   species            344 non-null    object
 1   island             344 non-null    object
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object
 7   year               344 non-null    int64
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB

# 1. Print the "species" column

# 2. Print the first 5 columns and first five rows

# 3. Print the columns "species", "island", and "sex" and the first 10 rows of the dataframe

Solution

# 1. Print the "species" column
print(penguins['species'])
print("...")

# 2. Print the first 5 columns and first five rows
print(penguins.iloc[0:5,0:5])
print("...")

# 3. Print the columns "species", "island", and "sex" and the first 10 rows of the dataframe
print(penguins.loc[0:10, ['species', 'island', 'sex']])

0         Adelie
1         Adelie
2         Adelie
3         Adelie
4         Adelie
         ...
339    Chinstrap
340    Chinstrap
341    Chinstrap
342    Chinstrap
343    Chinstrap
Name: species, Length: 344, dtype: object
...
  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm
0  Adelie  Torgersen            39.1           18.7              181.0
1  Adelie  Torgersen            39.5           17.4              186.0
2  Adelie  Torgersen            40.3           18.0              195.0
3  Adelie  Torgersen             NaN            NaN                NaN
4  Adelie  Torgersen            36.7           19.3              193.0
...
   species     island     sex
0   Adelie  Torgersen    male
1   Adelie  Torgersen  female
2   Adelie  Torgersen  female
3   Adelie  Torgersen     NaN
4   Adelie  Torgersen  female
5   Adelie  Torgersen    male
6   Adelie  Torgersen  female
7   Adelie  Torgersen    male
8   Adelie  Torgersen     NaN
9   Adelie  Torgersen     NaN
10  Adelie  Torgersen     NaN

Learning to read documentation

Before we dive deeper into learning about Pandas dataframes, we will learn how to read documentation. This is because libraries such as pandas have many features that we will not have the time to cover in detail. Instead, if you can read documentation efficiently, you can learn how to use these libraries on your own.

Programming effectively actually involves a lot of reading

Programming involves reading primarily documentation, but also code, search results, stackexchange queries, etc. These are just a few examples of what you'll read as you work on code. Reading the documentation of a package or library or software that you are using should probably be the first thing you do when you start using it. However, software docs pages are a much different sort of writing than we may be used to, if we're primarily used to reading journal articles, textbooks, and protocols. Knowing how and how much to read documentation is a skill that needs to be developed over time to suit your own needs. There's definitely no need to read every single page of documentation of a piece of software, especially for large libraries like pandas or seaborn.

There are a variety of ways software can be documented

You may be handed a single script from a colleague to perform some action and that script may have comments in the code detailing what it does or what certain lines do. Individual functions may have what is called a docstring, which is a string that occurs immediately after the function definition detailing how do use that function, inputs, and outputs. Another type of documentation is a docs page or API reference on a website for that software, such as the page for the seaborn's scatterplot function. Many software packages also have some introductory pages like vignettes or tutorials that guide you through the basics of the software. The Getting started tutorials of Pandas is a good example of this.

What documentation are we meant to read?

In general, documentation is meant to be a reference manual more than a textbook. A lot of documentation is really repetitive, because it has to exhaustively cover every single function, class, and use-case available to the user. I do not recommend reading documentation like a book or in any linear way. That's like learning a foreign language by reading the dictionary. For example, pandas has a variety of mathematical functions , but you are not required to look at the doc page of each of those. It is enough to know that it exists and when you do want to use a particular one, to check the page of that specific function. The most important parts of the documentation to read first are the tutorials/user guides, which introduce the basic functionality of the software with some example code. Often times, this code is exactly what you need to get started. If you get stuck, then it's time to read the docs pages for the specific commands you are using.

Anatomy of a docs page

Scientific articles typically have the same sections: Introduction, Methods, Results/Discussion. Similarly, docs pages for a function should all have some common components:

Function name and how to call it
- parameters in parentheses with any defaults showing
- positional parameters first, keyword parameters after asterisk
Description of function
Detailed parameters that can be passed to each function
- type of object that can be passed
- description of what the parameter does
Returns
- type of object(s) returned
- description of the object
Examples

Just the basics

If this is your first time encountering the function, glance at the function name and description and then go directly to the examples. This will help you understand if this function does what you think it does and give you a template to use it.

Exercise: You're looking for a way to replace NaN values with a specific dummy value. Which of the following functions is best suited for this task?

pandas.DataFrame.dropna() docs

pandas.DataFrame.fillna() docs

pandas.DataFrame.replace() docs

Solution

pandas.DataFrame.fillna() is the best suited for this task. This is a function created specifically to fill NaN values with a specified value. While you could use the .replace() function to replace NaN values, fillna() is more straightforward for this task. Meanwhile, dropna() only removes rows or columns with NaN values, which is not what we want.

Troubleshooting

Looking at a docs page is helpful for troubleshooting certain errors. In this next exercise, you will have to take a closer look at the parameters section of the docs page to find the solution.

Exercise: You have a dataframe where some numeric values are stored as strings, and you want to convert them to numeric values. Sometimes you also have non-numeric values stored in numeric columns, that you want to convert to NaN. However, when you try to use the pandas.to_numeric() function, you get an error. What could be the issue? How would you fix it? docs

my_series = pd.Series(["1", "2", "missing", "3"])

pd.to_numeric(my_series)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File pandas/_libs/lib.pyx:2407, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "missing"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[54], line 3
      1 my_series = pd.Series(["1", "2", "missing", "3"])
----> 3 pd.to_numeric(my_series)

File /opt/hostedtoolcache/Python/3.11.13/x64/lib/python3.11/site-packages/pandas/core/tools/numeric.py:235, in to_numeric(arg, errors, downcast, dtype_backend)
    233 coerce_numeric = errors not in ("ignore", "raise")
    234 try:
--> 235     values, new_mask = lib.maybe_convert_numeric(  # type: ignore[call-overload]
    236         values,
    237         set(),
    238         coerce_numeric=coerce_numeric,
    239         convert_to_masked_nullable=dtype_backend is not lib.no_default
    240         or isinstance(values_dtype, StringDtype)
    241         and values_dtype.na_value is libmissing.NA,
    242     )
    243 except (ValueError, TypeError):
    244     if errors == "raise":

File pandas/_libs/lib.pyx:2449, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "missing" at position 2

Solution

my_series = pd.Series(["1", "2", "missing", "3"])

pd.to_numeric(my_series, errors="coerce")

0    1.0
1    2.0
2    NaN
3    3.0
dtype: float64

Exploring

If you have data you want to import into pandas, it is worth reading the User Guide section on IO tools to see how to import different formats. Here you might find that if you have a large dataset that you access frequently, you can save it in a format that is optimized for speed, such as parquet or pickle. If you want to minimize the size of the file, you can save it as a compressed pickle format.

Python intensive, part 4

Introduction

What is Python and why do we need it/why are you taking this workshop?

Data analysis in Python

Jupyter basics

Python review

Object Methods

Object Attributes

Base Python Data Structures

Lists

Dictionaries

Importing libraries

Reading and writing data

Vocab

Reading files with Python

Reading data line by line

Writing data line by line

Pandas

pandas series

Initializing a series

Multi-indexed Series

pandas DataFrame

Initializing a DataFrame

Importing data to pandas

Looping through a dataframe

Selecting data in a pandas dataframe

Selecting columns

Selecting rows:

Learning to read documentation

Anatomy of a docs page

`pandas` series

`pandas` DataFrame

Importing data to `pandas`

Selecting data in a `pandas` dataframe