Whirlwind Tour of Python¶
Welcome to day 1 of our workshop!
Getting started: Installation¶
You can run this notebook by uploading it to Google Colab. You can also run it on your own computer by installing Jupyter Notebook and either running jupyter in your browser or using an IDE like VSCode.
To run this using Google Colab, go to https://colab.research.google.com/ and upload this notebook. That's it!
To run this locally, you will need to install python, anaconda, and the necessary libraries.
- If you are on Windows, install WSL. Once WSL is installed, you'll have a Linux terminal available to you in Windows. You can open this terminal by typing "wsl" in the search bar and clicking the app that appears. You'll also find your Linux distribution as a mounted drive in your file explorer.
- Install mamba, a package manager using the command line - Terminal for Mac or WSL for Windows.
1.1. For Mac, if you already have brew installed, install mamba using
brew install miniforge
and initialize it usingconda init zsh
. Then restart your terminal. If you don't have homebrew (i.e. the brew command doesn't exist), install brew first using/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
1.2. For Windows, download the Linux (x86_64) installer from the miniforge repository here and install withbash Miniforge3-Linux-x86_64.sh
. - Create a new environment using mamba with
mamba create -n pyworkshop numpy pandas matplotlib seaborn jupyter
and activate it withconda activate pyworkshop
. - You can now run the jupyter notebook by typing
jupyter notebook
in the terminal. This will open a browser window with the jupyter notebook interface. You can navigate to the folder where you saved this notebook and open it. - Alternatively, install VSCode and the Python extension. Then open this notebook in VSCode and run it with the kernel that belongs to the pyworkshop environment. How to guide here
What is python and why do we need it/why are you taking this workshop?¶
- Python is a programming language that is commonly used for data analysis
- R is another commonly used programming language, probably second to python
- Python is more general purpose than R, which is specifically for data analysis
- Programming languages are a way for humans to give the computer commands
- Regardless of how you collect data, it needs to be analyzed and code is the best way (Don't use excel!)
Jupyter basics¶
Jupyter notebooks are made up of cells of either markdown or code. This cell is a markdown cell, which means it's plaintext that has added formatting. Double clicking a cell allows you to edit it. To run a cell, you can press SHIFT+ENTER while having it selected, or press the Run button at the top of the cell (exact location depends on the editor you're using). When you run this markdown cell, it will render the formatting. The below cell is a code cell, which means when you run it, the code within will be executed and any output will be printed below. Because of the way we set up the kernel (the engine that executes code in a notebook), the code blocks will be running python code. Run the cells below.
# Run this cell to import the libraries we'll be using
# If you don't have the kernel loaded or installed, it will not work
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Run this cell to print a message to the screen
print("this is my code cell")
this is my code cell
Any variables that you assign in one cell will be available in other cells. But it will not be saved between sessions. If you close the notebook and reopen it, you will need to rerun the previous cells to get it back. Therefore, it's important to be aware of the state of your notebook and the order in which your cells were run.
my_string = "this is my code cell"
print(my_string)
this is my code cell
Jupyter notebooks can be exported to pdf or html, so that other people can view both the code and its output. It's a good format for handing in homeworks, for example, since you can show your work. In this notebook, there will be exercises with placeholders for the code that you will have to fill in. For these exercises, we encourage you to work with each other, use google, LLMs, and whatever other resources if you are stuck. It's not an exam, but just a way to get practice of the concepts. Afterwards, we will post the completed notebook on our website so you can have examples of solutions.
Python basics¶
Strings¶
In programming, we store values and objects in variables. The data that we work with are values or objects and can be things like tables or single numbers or sentences, etc. Variables are the names we assign to those objects. Our first topic is going to be about the main data types you'll encounter: numbers, text, and booleans. Text in programming is called a string. Strings are encoded in quotes (either single or double).
In python we use the =
to assign a value/object to a variable. Run the cell below to assign the value "this is my string" to the variable my_string
.
my_string = "this is my string"
print(my_string)
this is my string
If you need to quote a quotation, you need to "escape" the quote with a \
backslash. If you want to add a new line in a string, you can use \n
so that the string is printed on multiple lines. For large blocks of string, you can use triple quotes, """..."""
, and you won't have to add newlines in this case.
print("line one \n line two \nand a \"quotation\"")
line one line two and a "quotation"
print("""
first line
second line
""")
first line second line
Strings are indexed, meaning you can slice them and get sections of them. Python subscripting is always 0-indexed, meaning the first element is 0. As we'll see when we cover lists, you can use the []
to specify portions of the string, such as a specific position or use :
to specify an interval. Using :
is called slicing. When you slice, the beginning of the slice is included, but the end isn't. So [1:5]
will get you characters 2-4. You can even skip characters during slicing by adding another number after the colon to specify the interval to retrieve items. So [::2]
will get you every other character from the string. Indices can also be negative numbers, which will cause the counting to begin at the end of the object. For example, [-1]
gets the last character of the string.
print(my_string[0:4]) # this will print "this"
print(my_string[0:10:2]) # this will print "ti sm", every other character from 0 to 10
print(my_string[-3:]) # this gets the last 3 characters
this ti sm ing
Strings can be concatenated together using the +
operator and repeated using the *
operator.
my_string + " and another string" + 3 * " times 3"
'this is my string and another string times 3 times 3 times 3'
You will get an error if you try to combine string and numeric types.
%%script false --no-raise-error
my_string + 3
Another important characteristic of strings is that they are immutable, meaning you can't change the value of a string. You need to make a new string instead.
%%script false --no-raise-error
my_string[1] = "T" # this will give an error because strings are immutable
Numerical values¶
Besides strings, we often work with numerical values. Numerical values in python come in a variety of flavors, including int
for integeter, float
for floating point decimal.
Run the code below to assign the int 5 to the variable x.
x = 5
We can then use x in place of 5 in arithmetic operations. Python supports basic arithmetic operations such as +, -, *, /, and ** for exponentiation.
print(x + 5)
print(x)
10 5
print(x ** 2)
print(x)
25 5
Notice that the value of x remained the same after each operation. If you want to modify the value of x you need to reassign it.
x = x + 5
print(x)
10
A common operator that combines assignment and arithmetic is +=
. This adds the value on the right to the variable on the left and changes it to the new value. This type of operator is also available for subtraction -=
, multiplication *=
, and division /=
.
x += 5
print(x)
x /= 3
print(x)
15 5.0
Some other helpful operations include:
//
divides and then rounds down (floor division)%
is the modulo operator and gives the remainder of a division**
is the exponentiation operator==
checks if two values are equal!=
checks if two values are not equal>
and<
check if one value is greater or less than another
For other mathematical operators, you can import the math library and use the functions within it.
from math import pi, sqrt
print(pi)
print(sqrt(x))
3.141592653589793 2.23606797749979
Booleans¶
A final data type that we'll cover is the boolean. Booleans in python look like True
or False
(case sensitive). Notice these are words, but there are no quotes around them. That means python recognizes them as special values. Therefore, make sure that when you are naming variables, you don't use these words as variable names.
type(True)
bool
Strings, numbers, and booleans are examples of data types.
One of the most important concepts in programming is to always know what data type you are working with. This is because the operations you can perform on a variable depend on its data type and if you have the wrong data type, the program or function you're using may have unexpected behavior. Most likely, it will result in an error, but the error message may not be helpful. If you do get an error, the first thing to check is if the data you gave the function is formatted correctly and is the correct type.
If you're ever unsure of a variable's type, you can use the type()
function to check.
type(my_string)
str
type(x)
float
Objects¶
Python is an object oriented language, which means that everything is an object. Objects have attributes and methods. Attributes are things that describe the object, and methods are things that the object can do. For example, a string object has an attribute length
and a method upper()
. Below, you can see some examples of built-in methods for strings.
my_string = "this is my string"
print(my_string.upper()) # this makes everything uppercase
print(my_string.capitalize()) # This capitalizes the first letter
print(my_string.replace("this", "where")) # this replaces "this" with "where"
print(len(my_string)) # this gets the length of the string
print(my_string) # note that none of these methods changed the original string, rather it returned a new string
THIS IS MY STRING This is my string where is my string 17 this is my string
Numerical objects have attributes and methods as well. For example, the simple variable x that we created above has the method is_integer()
, which returns True if the number is an integer and False if it is not.
x.is_integer()
True
Question: What value did
x.is_integer()
return? What data type?
How do you know what methods and attributes an object has? You can use the dir()
function to list them all. You will also see a bunch of internal methods, which are the ones with the double underscores. These "dunder" methods control how the object behaves with other functions interact with them, such as how to return the length attribute when called with len()
. For our purposes, we don't have to worry about those.
print(dir(my_string))
['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'removeprefix', 'removesuffix', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
One of a string's methods is .format()
, which allows you to insert variables into a string. Run the code block below and observe how I can use the .format()
method directly on a string object. I did not need to assign the string to a variable first.
print("The value of x is: {}".format(x))
The value of x is: 5.0
Another way you might see formatting is with an f-string. This is a string that has an f
before the quotes and then you can insert variables directly into the string by putting them in curly braces.
print(f"This is a formatted string with x equal to {x}")
This is a formatted string with x equal to 5.0
For a full list of python's built-in types and their methods, you can go to the python documentation and navigate to the type of the object you're working on.
Functions¶
We've mentioned functions several times now, by what exactly are they? Functions are pieces of code referred to by a name that takes some input, called arguments and does something with them, such as returning some other value. Most functions are identified by the double parentheses ()
after their name. The arguments, if any, are placed inside the parentheses. When encountering a new function, it's important to understand the inputs and outputs to use it properly. Functions that are specific to an object is called a method.
You can get help on any function by running help(function_name)
.
help(my_string.upper)
Help on built-in function upper: upper() method of builtins.str instance Return a copy of the string converted to uppercase.
help(min)
Help on built-in function min in module builtins: min(...) min(iterable, *[, default=obj, key=func]) -> value min(arg1, arg2, *args, *[, key=func]) -> value With a single iterable argument, return its smallest item. The default keyword-only argument specifies an object to return if the provided iterable is empty. With two or more arguments, return the smallest argument.
Exercise: Fill in the code below to use the
my_string
variable to print "This is your string" instead of "This is my string". Remember, string are immutable, but you can create new strings from old ones. Don't forget to try and see if any of the class methods can help you! There are many ways to solve this problem. See string methods
# How would you change the string from "This is my string" to "This is your string"?
# Your solution should use the my_string in some way
my_string = "This is my string"
# Your code here
# Solution 1: use indexing and concatenation
new_string = my_string[:8] + "your" + my_string[10:]
print(new_string)
# Solution 2: use replace
new_string = my_string.replace("my", "your")
print(new_string)
# Solution 3: use split and join (chaining two methods)
new_string = "your".join(my_string.split("my"))
print(new_string)
This is your string This is your string This is your string
Python has a small number of built-in functions that are always available. They are listed here in the python documentation. Some functions you may encounter in the future are abs()
, dir()
, help()
, len()
, max()
, min()
, print()
, range()
, round()
, sum()
, and type()
.
Writing functions¶
Functions are created by using the def
keyword, followed by the name of the function and arguments in parentheses and a colon ():
. The "body" of the function follows below and is where you put the code you want to execute. Importantly, python uses indents/whitespace to delineate sections of code. So whereas in other languages you might use braces or keywords to define the beginning and end of the function body, python indents the whole block instead. Another component of the function is the return
keyword, which is used to specify the output and also ends the function. Note, that you can include print
statements within your function that are separate from the object that is returned.
def area_triangle(base, height):
"""Returns the area of a triangle if you know the base and height"""
print("This is printed when the function runs")
return 0.5 * base * height
area_triangle(4,4)
This is printed when the function runs
8.0
You can assign the return value of a function to a variable to refer to it later. In that case, the return will not be printed to screen.
my_area = area_triangle(42, 41)
This is printed when the function runs
print(my_area)
861.0
Exercise: Write a function called
num_words
that takes a string and returns the number of words in the string. You can use thesplit()
method to split the string into a list of words.
# Your code here
def num_words(sentence):
return len(sentence.split())
# This should print 7
print(num_words("This is a sentence with 7 words"))
7
Look at the two versions of the function below. What is the difference?
def square(x):
y = x * x
return y
square(10)
100
def square_print(x):
y = x * x
print(y)
square_print(10)
100
The difference is that one returns a values and the other prints it to the console. print
ing is not the same as return
ing. Printing outputs something to the "StdOut" which will not be able to be save to a variable later on. If you only print and don't return, None
will be returned instead.
my_num = square_print(10)
print("The result of {} squared is {}.".format(10, my_num))
100 The result of 10 squared is None.
Once a return statement is executed, the function ends. You cannot have a second return statement or any other code execute after it in a function. It is, however, possible to return more than one value. You simply need to separate the values you want to return with a comma. See below for an example of a function that returns both the original input and the calculated result.
def square_return(x):
y = x * x
return x, y
input, output = square_return(10)
print("The result of {} squared is {}.".format(input, output))
The result of 10 squared is 100.
Conditionals¶
In this section, we'll learn how to use conditional statements to control the flow of of our functions. Conditionals allow the function to make decisions based on the input or other results, or to perform the same operation on a group of inputs. The first conditional we'll explore is the if
statement. The if
statement will execute something if a condition is fulfilled. The syntax of the if statment is similar to that of functions. First, you start with the keywored if
, then the condition to be checked (this should be a boolean evaluation), and then a colon. The code to be executed if the condition is true is then indented. If the condition isn't true, you can use the else
keyword to execute a different set of code.
def is_even(x):
if x % 2 == 0:
return True
else:
return False
is_even(121)
False
If you have multiple conditions to test, you can use elif
(short for else if) to add conditions. You can then have the else
catch anything that didn't meet the previous conditions. The next exercise demonstrates how you can use conditionals as a rudimentary error check.
Exercise: Write a function
safe_divide
that takes two numbers. Write anelif
line to test if the second number is zero. If the second number is zero, print "can't divide by zero!" and returnNone
. If it's not zero, return the result of dividing the first number by the second number.
def safe_divide(x, y): # keep this line
if type(x) not in [int, float] or type(y) not in [int, float]: # keep this line
return "Inputs must be numerical" # keep this line
# Your code here
elif y == 0:
return "You can't divide by zero!"
else:
return x / y
You can also use consecutive if
statements in your functions. This will test each condition in order, regardless of the outcome of the previous condition.
Exercise: Write a function (
temp_convert
) that takes two arguments, a numbertemp
and stringconvert_to
. If the string is "F", convert the number from Celsius to Fahrenheit. If the string is "C", convert the number from Fahrenheit to Celsius. If the string is anything else, print "Invalid input" and returnNone
.
# Your code here
# Tip: you can give a function's arguments a default value, so that if when it is called,
# it doesn't need to be invoked. In the below function, `convert_to` is set to "F" by default.
def temp_convert(temp, convert_to = "F"): # keep this line
if convert_to == "F":
return temp * 9 / 5 + 32
if convert_to == "C":
return (temp - 32) * 5 / 9
else:
print("Invalid input")
return None
Loops¶
Loops are a way to repeat the same command multiple times. One common use for a loop is to create a function that accumulates, i.e. a counter that increases until some sequence is exhausted. The syntax of a loop is "for
variable in
iterable". The variable is the thing you are changing on each iteration of the loop. The iterable is the collection of things that will become variables. See the below function for an example:
# this function counts the total number of characters in a list of strings
# by adding each individual string's length to "c" in a for loop
def count_len(string_list):
c = 0
for string in string_list:
c += len(string)
return c
count_len(["this","is","my","list","of","strings"])
21
Another way to iterate is to create your own iterator. Let's say I want to run something for 10 steps, I can write:
for i in range(0,10):
print(i)
0 1 2 3 4 5 6 7 8 9
If I don't care about i, python convention is to use an underscore to keep from creating a variable.
for _ in range(0,5):
print("hi")
hi hi hi hi hi
We will cover loops in more detail in the next workshop.
Python Data Structures¶
Lists¶
Now that we've grasped data types, let's learn about data structures. Data structures are how collections of data are stored. We'll start with the most simple and flexible data structure, the list
. Python lists are 1-dimensional collections of objects. They can contain any type of object, and can contain a mixture of different types. Lists are created using the []
and are a flexible way to manually define an ordered series of things.
my_list = [1, 2, 3, 4, 5]
You can index into lists and slice them using the []
operator much like with strings.
print(my_list[2:]) # third item to the end
print(my_list[-2:]) # last two elements
[3, 4, 5] [4, 5]
You can check if objects are in the list, concatenate lists together, and other methods.
print(4 in my_list)
print(my_list + [3]) # this will not change my_list
my_list.append(100) # this will change my_list
print(my_list)
True [1, 2, 3, 4, 5, 3] [1, 2, 3, 4, 5, 100]
For the most part, we won't be doing anything fancy with lists beyond creating them and iterating through them with indices. Because they are a flexible data structure, they are not very efficient and many of the things we might want to do with a collection of items are better suited for other data types.
Exercise:
Write a function that takes a list of numbers and returns the sum of the numbers and also uses string interpolation to print the sum. Ex the input
[1, 2, 3]
should printYour sum is 6
and return 6. Bonus, use conditionals to only sum the numbers in case the list has strings or other objects in it.
# Your code here
# Solution 1 no checking
def sum_list(my_list):
my_sum = sum(my_list)
print("Your sum is {}".format(my_sum))
return my_sum
# Solution 2
def sum_list(my_list):
my_sum = 0
for item in my_list:
if type(item) == int or type(item) == float: # can also use isinstance(item, (int, float))
my_sum += item
print("Your sum is {}".format(my_sum))
return my_sum
sum_list([1, 1, 1, 1])
Your sum is 4
4
sum_list([1, 1, 1, "a"])
Your sum is 3
3
Dictionaries¶
Dictionaries are key-value pairs that are unordered. They are created using the {}
or with the dict()
function and are a way to store things for quick retrieval with keywords.
Below is an example of a simple dictionary of names and grades.
grades = {"Alice": 100, "Bob": 90, "Charlie": 80}
We can make it more complex by nesting additional information, like a dictionary of dictionaries.
grades = {"Alice": {"math": 100, "english": 90}, "Bob": {"math": 90, "english": 80}}
Dictionarys can hold a variety of data types, including lists and other dictionaries.
pets = {
"Buddy": {
"name": "Buddy",
"breed": "Bulldog",
"age": 4,
"vaccinated": True,
"favorite_foods": ["chicken", "peanut butter"],
},
"Mittens": {
"name": "Mittens",
"breed": "Persian cat",
"age": 2,
"vaccinated": False,
"favorite_foods": ["tuna", "chicken"],
"owner": {
"name": "Alice",
"contact": "555-0123"
}
},
"Polly": {
"name": "Polly",
"breed": "Parrot",
"age": 10,
"vaccinated": True,
"words_learned": ["hello", "bye", "Polly wants a cracker"],
}
}
You can get the top level items using the keys
method or the function list()
.
list(pets)
['Buddy', 'Mittens', 'Polly']
pets.keys()
dict_keys(['Buddy', 'Mittens', 'Polly'])
You can retrieve specific entries by using []
and the name of the field in quotes.
pets["Mittens"]
{'name': 'Mittens', 'breed': 'Persian cat', 'age': 2, 'vaccinated': False, 'favorite_foods': ['tuna', 'chicken'], 'owner': {'name': 'Alice', 'contact': '555-0123'}}
Note that you can store multiple data types in a dictionary, although the keys are usually strings. Below, you can see that you can store lists within dictionary entries.
type(pets["Polly"]["words_learned"])
list
Dictionaries are useful for when you have hierarchical data you want to retrieve quickly and don't need to do heavy processing on. You will most likely encounter dictionaries when importing a JSON formatted file.
Exercise: Use a dictionary within a function to calculate the scrabble score of a given string. The function will take a string and, based on an internal dictionary, returns the score. Here are the scrabble letter values as a dictionary:
scrabble_letter_values = {
'A': 1, 'B': 2, 'C': 3, 'D': 1, 'E': 1, 'F': 4, 'G': 2, 'H': 4, 'I': 1,
'J': 8, 'K': 5, 'L': 1, 'M': 3, 'N': 1, 'O': 1, 'P': 3, 'Q': 10, 'R': 1,
'S': 1, 'T': 1, 'U': 1, 'V': 4, 'W': 4, 'X': 8, 'Y': 4, 'Z': 10
}
def scrabble_score(word):
scrabble_letter_values = { # keep this dictionary line
'A': 1, 'B': 2, 'C': 3, 'D': 1, 'E': 1, 'F': 4, 'G': 2, 'H': 4, 'I': 1,
'J': 8, 'K': 5, 'L': 1, 'M': 3, 'N': 1, 'O': 1, 'P': 3, 'Q': 10, 'R': 1,
'S': 1, 'T': 1, 'U': 1, 'V': 4, 'W': 4, 'X': 8, 'Y': 4, 'Z': 10
}
# Your code here
# initialize score to 0
# loop through each letter in the word (hint: might want to make the word uppercase)
# use the dictionary to add the score of each letter to the score variable
# return the score
score = 0
for letter in word.upper():
score += scrabble_letter_values.get(letter, 0)
return score
scrabble_score("hello")
8
Exercise: Create a function
complement
that takes aDNA
string and returns the complement of the string. Within the function, you can use a dictionary to convert the bases. For bonus, add a parameterreverse
that if true, returns the reverse complement instead.
# Your code here
def complement(DNA, reverse = False):
dictionary = {'A':'T', 'T':'A', 'C':'G', 'G':'C'} # keep this line
# make an empty string to store the complement
# loop through each base in the DNA sequence
# add complement of the base to the string (you can use + to concat strings)
# BONUS: if reverse is True, return the reverse of the complement string
# return the complement
complement = ''
for base in DNA:
complement += dictionary[base]
if reverse:
return complement[::-1]
else:
return complement
complement("AATCTGAG", reverse = True)
'CTCAGATT'
Numpy arrays¶
One of the most common multi-dimensional data structures you'll encounter is the numpy array. The numpy array is a data structure that stores data in one or more dimensions. All the data in these arrays need to be of the same type. To create a numpy array, we must first import the numpy library (this needs to be installed in your environment if you haven't already). Then, within the np.array()
function, you can use list notation []
to create one-dimensional arrays, or use nested list notation [[]]
to create multi-dimensional arrays.
# An example of a numpy array
import numpy as np
my_array = np.array([1, 2, 3, 4, 5])
print(my_array)
[1 2 3 4 5]
# An example of a 2 dimensional numpy array
my_2d_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(my_2d_array)
[[1 2 3] [4 5 6] [7 8 9]]
You can check the shape of your array using the shape
attribute. Note that there is no ()
at the end.
print(my_array.shape) #Note a 1-d array is actually a single column with number of rows equal to number of elements
(5,)
print(my_2d_array.shape)
(3, 3)
You can create numpy arrays programmatically using built-in functions.
# Creating a sequence using arrange() and linspace()
my_array = np.arange(0, 100, 2) # every even number from 0 to 100
print(my_array)
my_array = np.linspace(0, 5, 10) # 10 evenly spaced numbers from 0 to 5
print(my_array)
[ 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98] [0. 0.55555556 1.11111111 1.66666667 2.22222222 2.77777778 3.33333333 3.88888889 4.44444444 5. ]
# Creating a 3x5 numpy array with zeros
my_array = np.zeros((3, 5))
print(my_array)
[[0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.]]
# Creating a 3x5 numpy array with random integers
my_array = np.random.randint(0, 10, (3, 5))
print(my_array)
[[8 9 3 8 1] [6 6 8 3 3] [8 6 6 8 3]]
Navigating arrays (indexing)¶
You can access elements of an array using indexing. In python, indices start a 0. You can use negative indices to start from the end of the array, colons :
to select a range of elements (slicing), and commas ,
to select elements from multiple dimensions.
# Example of indexing a numpy array
print(my_array[0, 0]) # first element
print(my_array[0, :]) # first row
print(my_array[:, 0]) # first column
print(my_array[0:2, 0:2]) # first two rows and columns
print(my_array[-1, -1]) # last element
8 [8 9 3 8 1] [8 6 8] [[8 9] [6 6]] 3
Numpy arrays have useful attributes and methods that you can use to manipulate the data. For example, you can use the sort()
method to sort the elements of the array.
my_array.sort(axis=1) # axis=0 sorts columns, axis=1 sorts rows
print(my_array) # note that the array is sorted in place and doesn't return anything
[[1 3 8 8 9] [3 3 6 6 8] [3 6 6 8 8]]
Exercise: Create a function
growth
that takes an initial population size, a growth rate, a number of growth cycles, and returns an array of the population size at each growth cycle. Assume an exponential growth rate.
# Your code here
# Psuedo code:
# define the function
# create an array of the right size
# set the initial population
# loop through the cycles
# calculate the new population and add it to the correct index in the array
# return the array
def growth(initial, rate, cycles):
pop = np.zeros(cycles + 1)
pop[0] = initial
for i in range(1,cycles + 1):
pop[i] += pop[i-1] * (1+rate)
return pop
growth(1, 1, 10)
array([1.000e+00, 2.000e+00, 4.000e+00, 8.000e+00, 1.600e+01, 3.200e+01, 6.400e+01, 1.280e+02, 2.560e+02, 5.120e+02, 1.024e+03])
# test out your function
growth_array = growth(1, 1, 10)
plt.plot(growth_array)
[<matplotlib.lines.Line2D at 0x1387bca70>]
Array operations¶
One of the benefits of using numpy arrays are the fast operations you can do on them. Today we'll cover the basic mathematical operations and boolean operations. What we won't have time for are all the matrix manipulations that arrays are known for, such as product, inverse, etc.
You can do arithmetic operations on arrays and apply them element-wise.
a = np.array([[1, 2], [3, 4]])
print(a - 1)
[[0 1] [2 3]]
b = np.array([[10, 20],[30, 40]])
a * b
array([[ 10, 40], [ 90, 160]])
You can load tab separated or other type of delimited files using the np.loadtxt()
function. In this case, I also used the skiprows
, usecols
, and max_rows
parameters to finetune which rows and columns I wanted to load. If I were to not skip the first row, I would get a first row comprising of NaN because np arrays require everything to be the same type and the first row are string headers.
I've also imported the first column of the csv as a string and then used pandas to convert it to a new data type, called datetime. This is a special data type that allows operations on dates.
# Open csv in numpy, skip first row and column
weather = np.loadtxt("https://informatics.fas.harvard.edu/resources/Workshops/2024-Fall/Python/data/prcp.csv", delimiter=",", skiprows=1, usecols=range(1,5))
weather_header = np.loadtxt("https://informatics.fas.harvard.edu/resources/Workshops/2024-Fall/Python/data/prcp.csv", delimiter=",", max_rows=1, dtype=str, usecols=range(1,5))
weather_date = np.loadtxt("https://informatics.fas.harvard.edu/resources/Workshops/2024-Fall/Python/data/prcp.csv", delimiter=",", skiprows=1, usecols=0, dtype=str)
weather_date = pd.to_datetime(weather_date, format="%m/%d/%y")
# header rows are in metric (mm of precipitation, and C for temperature)
print(weather_header)
print(weather[0:10,:])
print(weather_date[0:10])
print(weather.shape)
['PRCP' 'TAVG' 'TMAX' 'TMIN'] [[ 1. 11.4 12.8 6.1] [ 0. 6.5 11.1 2.2] [12.2 4.9 7.2 1.1] [ 8.1 5.2 7.2 4.4] [ 2.3 5.6 7.2 1.7] [ 4.8 1.6 2.2 0.6] [ 0. 3. 6.7 -0.5] [ 0. -0.2 3.3 -3.8] [ 0. 1.4 6.1 -2.1] [ 0. 2.7 5.6 -2.1]] DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05', '2023-01-06', '2023-01-07', '2023-01-08', '2023-01-09', '2023-01-10'], dtype='datetime64[ns]', freq=None) (365, 4)
Exercise: Using the
weather
array (daily weather data of Boston from 2023), convert all the temperatures from Celsius to Fahrenheit. Store the result in a new array calledfahrenheit
. Then, find the highest maximum temperature. (Hint: you can use use the functiontemp_convert()
we wrote earlier!)
# Your code here
fahrenheit = weather[:,1:] * 9 / 5 + 32
max(fahrenheit[:,1])
# Other solution
fahrenheit = temp_convert(weather[:,1:], convert_to="F")
max(fahrenheit[:,1])
93.02
Another way to perform operations on arrays is to use logical conditions. This creates a boolean array of the same size.
my_array > np.mean(my_array) # this will tell us which elements in my_array are greater than the mean
array([[False, False, True, True, True], [False, False, True, True, True], [False, True, True, True, True]])
We can then use the boolean array to extract the actual numbers that meet the condition. The way this works is we index into the array using []
and by using the boolean array, only values that were True are selected. It's important that the boolean array you use has the same dimensions as the array you're indexing into.
big_numbers = my_array > np.mean(my_array)
my_array[big_numbers]
array([8, 8, 9, 6, 6, 8, 6, 6, 8, 8])
Exercise: Find out which days had above average maximum temperatures from the
weather
object. Print only the rows of the data that meet this condition. Your output should have 176 rows
# Your code here
hot_days = weather[:, 2] > np.mean(weather[:, 2])
weather[hot_days, :].shape
(176, 4)
Exercise: Using the
weather
object, on what day of the year did Boston have the greatest range of temperature (max - min)? And what was min and max temp on that day? hint use the functionnp.argmax()
# Your code here
weather[np.argmax(weather[:, 2] - weather[:, 3]), 2:4]
array([ 0.6, -22.1])
Exercise: BONUS: Create a function that takes a range of days (start and end) and returns the average temperature for that range in farhenheit. hint you can use the
temp_convert()
you wrote earlier
# Your code here
def average_temp(start, end):
return temp_convert(np.mean(weather[start:end, 1]))
# Test your code. This should be 39.578
average_temp(0, 10)
39.578
You may have noticed when we loaded the dataset that we had to load the headers separately, due to np arrays forcing all elements to have the same data type. When you have more complicated data, such as a mix of numerical and categorical data, numpy arrays are not as useful. We're going to briefly introduce pandas dataframes, but then move on to plotting.
Pandas DataFrames¶
This is a super short introduction to pandas dataframes. The main difference between numpy arrays and pandas dataframes is that dataframes can store different data types in the same object. Dataframes also have explicit row and column indices, and behave more like a table that you would see in excel. Let's demonstrate by loading the weather data.
weather_df = pd.read_csv("https://informatics.fas.harvard.edu/resources/Workshops/2024-Fall/Python/data/prcp.csv")
weather_df.head()
DATE | PRCP | TAVG | TMAX | TMIN | |
---|---|---|---|---|---|
0 | 1/1/23 | 1.0 | 11.4 | 12.8 | 6.1 |
1 | 1/2/23 | 0.0 | 6.5 | 11.1 | 2.2 |
2 | 1/3/23 | 12.2 | 4.9 | 7.2 | 1.1 |
3 | 1/4/23 | 8.1 | 5.2 | 7.2 | 4.4 |
4 | 1/5/23 | 2.3 | 5.6 | 7.2 | 1.7 |
We can get a quick overview of the data using the info()
method. This tells us the number of rows, the columns, and the dtypes of each column. There's also a field for Non-null count, which illustrates that dataframes can handle missing data, unlike numpy arrays.
weather_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 365 entries, 0 to 364 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DATE 365 non-null object 1 PRCP 365 non-null float64 2 TAVG 365 non-null float64 3 TMAX 365 non-null float64 4 TMIN 365 non-null float64 dtypes: float64(4), object(1) memory usage: 14.4+ KB
We can index into the dataframe using the []
operator and the column name.
weather_df["PRCP"][0:10]
0 1.0 1 0.0 2 12.2 3 8.1 4 2.3 5 4.8 6 0.0 7 0.0 8 0.0 9 0.0 Name: PRCP, dtype: float64
Dataframes also have a handy describe()
method that gives us a quick overview of the numerical data in the dataframe.
weather_df.describe()
PRCP | TAVG | TMAX | TMIN | |
---|---|---|---|---|
count | 365.000000 | 365.000000 | 365.000000 | 365.000000 |
mean | 3.456164 | 12.183014 | 16.177808 | 8.427671 |
std | 9.337008 | 8.309695 | 8.875777 | 8.345570 |
min | 0.000000 | -17.800000 | -7.700000 | -23.200000 |
25% | 0.000000 | 5.000000 | 9.400000 | 1.100000 |
50% | 0.000000 | 11.700000 | 16.100000 | 8.300000 |
75% | 1.500000 | 19.800000 | 23.900000 | 15.600000 |
max | 78.000000 | 27.900000 | 33.900000 | 23.900000 |
DataFrames excel at data manipulation/transformation. For example, we can use the apply()
method to apply a function to each element of a column.
# one-liner to convert all the temperature columns to Fahrenheit
weather_df.loc[:, "TAVG":"TMIN"].apply(lambda x: temp_convert(x, convert_to="F"))
TAVG | TMAX | TMIN | |
---|---|---|---|
0 | 52.52 | 55.04 | 42.98 |
1 | 43.70 | 51.98 | 35.96 |
2 | 40.82 | 44.96 | 33.98 |
3 | 41.36 | 44.96 | 39.92 |
4 | 42.08 | 44.96 | 35.06 |
... | ... | ... | ... |
360 | 44.24 | 53.06 | 41.00 |
361 | 44.96 | 44.96 | 42.08 |
362 | 42.26 | 46.04 | 39.92 |
363 | 43.88 | 46.04 | 35.06 |
364 | 36.68 | 41.00 | 33.08 |
365 rows × 3 columns
But that's all the time we have for dataframes today. On the next workshop, we'll cover more advanced data manipulation and visualization with pandas
Summary of different data structures we've talked about today¶
We talked about 4 different ways to organize data and it can seem confusing as to why these different data structures exist and when you should use each type. Here's a summary comparing them:
Feature/Question | List | Dictionary | Numpy Array | Pandas DataFrame |
---|---|---|---|---|
What is it? | An ordered collection of items | An (semi)-unordered collection of key-value pairs | A multi-dimensional array | A 2D labeled table of data |
When to use? | When you have a collection of items of mixed types you want to keep together | When you need to map keys to values | For performing mathematical operations on (large) numerical data | For data manipulation on tabular data of mixed types |
Example usage | Making small, temporary collections as part of a larger program | Storing data by name | Working with matrices, simple plotting | Reading in files, cleaning data, plotting |
Mixed types? | Yes | Keys no; values can be mixed | No | Yes (between columns) |
How to access the data | By index | By key | By index or boolean mask | Column name, index, advanced filtering |
How to create | [] |
{} |
np.array() |
pd.DataFrame() |
Libraries required | None | None | numpy | pandas |
Plotting numpy arrays with matplotlib¶
Now that we have a basic understanding of how data are stored in python, it's time to understand how to go from data structures to visualization. The most common library for plotting in python is matplotlib. If you are familiar with ggplot2 from R, you will find that matplotlib is a more low-level plotting library and you will need to manually specify many of the details. However, it is possible to make complex plots with just matplotlib. Additionally, there is anotehr library called seaborn, which is built off of it and has a more user-friendly interface. For today, we'll just learn the basics of creating plots with matplotlib.
To begin with, we need to import the library and give it a nickname.
import matplotlib.pyplot as plt
The first thing we do is create the figure and axis objects. The figure object is the canvas that we will draw on and the axis object is the actual plot. We can then use the plot()
method of the axis object to create a line plot. The first argument is the x-axis and the second is the y-axis. If no x-axis is given, the plot will use the index of the array given.
fig, ax = plt.subplots()
ax.plot(weather_date, weather[:, 1])
[<matplotlib.lines.Line2D at 0x138833440>]
Then we can add labels for the axes and a title. Customizing the look of the plot is generally done by using methods of the axis object. These methods sometimes return objects such as the line object or the text object, which you can use to further customize the plot.
fig, ax = plt.subplots()
ax.plot(weather_date, weather[:, 1])
ax.set_xlabel("Date")
ax.set_ylabel("Temperature (C)")
ax.set_title("Weather in Boston 2023")
Text(0.5, 1.0, 'Weather in Boston 2023')
Let's add the min and max temperatures to the plot, in a different color so that we can distinguish between them.
fig, ax = plt.subplots()
# Add each line to the plot, changing the transparency and linestyle.
# Color will be automatically determined or can be set with the "color" argument
ax.plot(weather_date, weather[:, 1])
ax.plot(weather_date, weather[:, 2], alpha = 0.5, linestyle = ":")
ax.plot(weather_date, weather[:, 3], alpha = 0.5, linestyle = ":", color = "green")
# add a legend to the plot, passing a list of labels in the order the lines were added
# You can specify the location as well
ax.legend(["Average", "Max", "Min"], loc="lower center")
# Set the axis and title labels
ax.set_xlabel("Date")
ax.set_ylabel("Temperature (C)")
ax.set_title("Weather in Boston 2023")
Text(0.5, 1.0, 'Weather in Boston 2023')
Next, we will use the set
method to set multiple properties of the plot at once. We condensed the code to set the axis and title into one line by putting it in the ax.set()
method.
Another way to set properties is to the use the plt.setp()
method. This method takes an object of the plot and a list of properties to set. Here we extract the xticklabels from the axis object and use it to set the rotation of the x-axis labels.
Finally, we set the format of the x axis to only display the month and day.
This code block illustrates that the axis object is a container for all the elements of the plot. You can access and modify most properties of the plot using this obejct. For more information, check out the documentation on the axis API here.
fig, ax = plt.subplots()
# Add each line to the plot, changing the transparency and linestyle.
# Color will be automatically determined or can be set with the "color" argument
ax.plot(weather_date, weather[:, 1])
ax.plot(weather_date, weather[:, 2], alpha=0.5, linestyle = ":")
ax.plot(weather_date, weather[:, 3], alpha=0.5, linestyle = ":", color = "green")
# add a legend to the plot, passing a list of labels in the order the lines were added
# You can specify the location as well
ax.legend(["Average", "Max", "Min"], loc="lower center")
# Set the axis and title labels, condensed into one line
ax.set(xlabel="Date", ylabel="Temperature (C)", title="Weather in Boston 2023")
# rotate the x-axis labels for better readability
xlabels = ax.get_xticklabels() # get the object that holds the labels
plt.setp(xlabels, rotation = 45) # Use plt.setp to set the rotation of the labels
# modify the x-axis to show only the month and day
import matplotlib.dates as mdates
ax.xaxis.set_major_formatter(mdates.DateFormatter("%m-%d"))
In the next code block, we demonstrate how to create two plots on the same figure. In the plt.subplots()
method, we specify that we want subplots arranged in a 2x1 grid, so two rows and 1 column. We also modify the size of the figure using the figsize
parameter. Then, we add a title to the main figure using fig.suptitle
. This is because you can think of the fig
object as the canvas that holds all plots (axis objects) so if you want to modify something about that canvas, we use a method from fig.
Because we created multiple axis object, the ax
object actually becomes a list of axis objects and we can use indexing to access them. So we use ax[1]
to plot the precipitation data on the bottom and ax[0]
to plot the temperature on the bottom. If you want a figure with multiple subplots, you can use the coordinates of the each subplot to refer to them in the axis object.
# Create a figure with two subplots and make the figure size 10x8
fig, ax = plt.subplots(2, 1, figsize = (10, 8))
fig.suptitle("Weather in Boston 2023")
# Plot the temperature on top
ax[0].plot(weather_date, weather[:, 1])
ax[0].plot(weather_date, weather[:, 2], alpha = 0.5, linestyle = ":")
ax[0].plot(weather_date, weather[:, 3], alpha = 0.5, linestyle = ":")
ax[0].legend(["Average", "Max", "Min"], loc = "lower center")
# Set xticklabels to empty list to remove them
ax[0].set(ylabel="Temperature (C)", title = "Temperature", xticklabels = [])
# Plot the precipitation on the bottom
# Note the use of bar plots for precipitation
ax[1].bar(weather_date, weather[:, 0])
ax[1].set(ylabel = "Precipitation (mm)", xlabel = "Date", title = "Precipitation")
[Text(0, 0.5, 'Precipitation (mm)'), Text(0.5, 0, 'Date'), Text(0.5, 1.0, 'Precipitation')]
You can also plot multiple y axes on the same plot. This is done by creating a second axis object and using the twinx()
method to create a second y axis. We then save this secondary axis object so that we can refer to it later when changing the labels.
fig, ax = plt.subplots()
ax.bar(weather_date, weather[:,0])
ax.set(xlabel="Date", ylabel = "Precipitation (mm)", title = "Weather in Boston 2023")
ax2 = ax.twinx()
ax2.plot(weather_date, weather[:,1], color = "red")
ax2.set_ylabel("Temperature (C)")
Text(0, 0.5, 'Temperature (C)')
Exercise: Write a function called
moving_average
that calculates a moving average of a numpy array. The function should take a 1D array and a window size and return an array of the same size with the moving average. Note that the first few values (depending on window size) of the output should benan
because there are not enough values to calculate the average.
# Your code here
def moving_average(data, window): # keep
# Make placeholder for result
# loop through the length of the data, starting with window size
# calculate the average of that window and add it to the result
# return the result
result = np.full(len(data), np.nan)
# windows start from 0 to 0+window-1 and go to end - window + 1
for i in range(len(data) - window +1):
start = i
end = i + window
#print(f"Caculating sliding window from position {start} to position {end}")
#print(data[start:end])
mean = np.mean(data[start:end])
result[i + window - 1] = mean
return result
# Test your code
test_data = moving_average(np.array([0, 0, 0, 0, 5, 5, 5, 5, 5, 5, 5]), 5)
print(test_data[0:10]) # should return [ nan nan nan nan nan 1 2 3 4 5]
[nan nan nan nan 1. 2. 3. 4. 5. 5.]
Exercise: Use the moving average function to plot the moving average of the temperature data (average temperature
weather[:,1]
) on top of the original temperature data. Use a window size of 7 days. Make the original data less opaque so it fades into the background. Add labels and legends to the plot.
# Your code here
fig, ax = plt.subplots()
ax.plot(weather_date, weather[:,1], alpha = 0.2)
ax.plot(weather_date, moving_average(weather[:,1], 7))
ax.set(xlabel = "Date", ylabel = "Temperature (C)", title = "Weather in Boston 2023")
ax.legend(["Temperature", "7-day moving average"], loc = "lower center")
<matplotlib.legend.Legend at 0x139b91670>
Exercise: Take the words in this exercise and run it through the scrabble score function you wrote earlier. Then, plot the scrabble score of each word against the length of the word. You will need to split the sentence into a list of words not including the punctuation.
# Your code here
string = "Take the words in this exercise and run it through the scrabble score function you wrote earlier. Then, plot the scrabble score of each word against the length of the word. You will need to split the sentence into a list of words not including the commas and periods."
# Replace the punctuation in the string (commas, periods) with spaces and split the string into a list of words
string_list = string.replace("."," ").replace(","," ").split()
# check that punctuation is removed
string_list
# initialize an array of two columns with the number of words as rows
string_scores = np.ones((len(string_list), 2))
# loop through each word in the list
for i in range(len(string_list)):
# append the length of the word and the scrabble score to the array
string_scores[i] = [len(string_list[i]), scrabble_score(string_list[i])]
# plot the array.
fig, ax = plt.subplots() # keep
# use ax.scatter with the first column as x and the second column as y
ax.scatter(string_scores[:,0], string_scores[:,1])
# use ax.set() to set the labels and title
ax.set(xlabel = "Word length", ylabel = "Scrabble score", title = "Scrabble score of words in a sentence")
[Text(0.5, 0, 'Word length'), Text(0, 0.5, 'Scrabble score'), Text(0.5, 1.0, 'Scrabble score of words in a sentence')]
Exercise: Write a function that wraps everything above so that a person can submit a string and get a plot of the scrabble scores of their words. Also, print the sentence as the title of the plot. The function does not need to return anything.
# Your code here
def plot_scrabble(string):
string_list = string.replace("."," ").replace(","," ").split()
# check that punctuation is removed
string_list[-1]
# initialize an array of two columns with the number of words as rows
string_scores = np.ones((len(string_list), 2))
for i in range(len(string_list)):
string_scores[i] = [len(string_list[i]), scrabble_score(string_list[i])]
fig, ax = plt.subplots()
ax.scatter(string_scores[:,0], string_scores[:,1])
ax.set(xlabel = "Word length", ylabel = "Scrabble score", title = f"Scrabble score of words in \'{string}\''")
plot_scrabble("Test out this function with a sentence of your choice.")