## Getting started: Installation

## Installation

This workshop exists as a **Jupyter notebook**. You can participate in this workshop by using this notebook interactively simply by uploading it to Google Colab. Go to https://colab.research.google.com/ and upload this notebook. That's it! This is the recommended way for participating in this workshop. Skip the below instructions if you will be using Google Colab.

<br>

---

**See above for the recommended way to participate in this workshop. Only follow these instructions if Google Colab isn't working**

If for some reason Google Colab isn't working, or you prefer to run this locally, you will need to install Python, Anaconda, and the necessary libraries. You will have to follow these steps to do so. Note that some steps are only meant for specific operating systems.

0. If you are on Windows, [install WSL](https://learn.microsoft.com/en-us/windows/wsl/install). Once WSL is installed, you'll have a Linux terminal available to you in Windows. You can open this terminal by typing "wsl" in the search bar and clicking the app that appears. You'll also find your Linux distribution as a mounted drive in your file explorer.

1. Install mamba, a package manager using the command line - Terminal for Mac or WSL for Windows.

    1.1. For Mac, if you already have brew installed, install mamba using `brew install miniforge` and initialize it using `conda init zsh`. Then restart your terminal. If you don't have homebrew (i.e. the brew command doesn't exist), install brew first using `/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"`
    
    1.2. For Windows, download the Linux (x86_64) installer from the miniforge repository [here](https://github.com/conda-forge/miniforge) and install with `bash Miniforge3-Linux-x86_64.sh`.

2. Create a new environment using mamba with `mamba create -n pyworkshop numpy pandas matplotlib seaborn jupyter` and activate it with `conda activate pyworkshop`.

3. You can now run the jupyter notebook by typing `jupyter notebook` in the terminal. This will open a browser window with the jupyter notebook interface. You can navigate to the folder where you saved this notebook and open it.

4. Alternatively, install [VSCode](https://code.visualstudio.com/) and the Python extension. Then open this notebook in VSCode and run it with the kernel that belongs to the pyworkshop environment. [How to guide here](https://code.visualstudio.com/docs/datascience/jupyter-notebooks)

---

## What is python and why do we need it/why are you taking this workshop?

* Python is a programming language that is commonly used for data analysis
* R is another commonly used programming language, probably second to python
* Python is more general purpose than R, which is specifically for data analysis
* Programming languages are a way for humans to give the computer commands
* Regardless of how you collect data, it needs to be analyzed and code is the best way (Don't use excel!)

## Jupyter basics

Jupyter notebooks are text files that can be rendered as formatted text **and** run code given the proper setup (see Installation).

Text is split into **cells**. Double clicking a cell allows you to edit it.

For code cells, there is also an option to run the code. You can do this by pressing **SHIFT+ENTER** while having it selected, or press the **Run** button at the top of the cell (exact location depends on the editor you're using). Because of the way we set up the notebook the code cells will be running Python code.

For this workshop, we'll be asking you to follow along by running code cells and by doing coding exercises by writing or editing code in code cell.

**IMPORTANT**: Run the code cells below to **import** the **libraries** we'll be using during this workshop.

In [3]:
# Run this cell to import the libraries we'll be using
# If you don't have the kernel loaded or installed, it will not work

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

And run the following to demonstrate how the code blocks run and display code:

In [None]:
# Run this cell to print a message to the screen
print("this is my code cell")

Any **variables** that you assign in one cell will be available in other cells. But they will not be saved between sessions. If you close the notebook and re-open it, you will need to re-run the previous cells to get your variables back. Therefore, it's important to be aware of the state of your notebook and the order in which your cells were run.

In [None]:
my_string = "this is my code cell"

In [None]:
print(my_string)

Jupyter notebooks can be exported to pdf or html, so that other people can view both the code and its output. It's a good format for handing in homeworks, for example, since you can show your work. In this notebook, there will be exercises with placeholders for the code that you will have to fill in. For these exercises, we encourage you to work with each other, use google, LLMs, and whatever other resources if you are stuck. It's not an exam, but just a way to get practice of the concepts. Afterwards, we will post the completed notebook on our website so you can have examples of solutions.

## Python refresher

Let's begin by reviewing some of the terms we've covered in the past sessions that will come up again today.

| Term | Definition |
| --- | --- |
| Object | The thing itself (an instance of a class) |
| Variable | The name we give the object (a pointer to the object) |
| Class | The blueprint for the object (defines the attributes and methods of object) |
| Method | A function that belongs to an object |
| Attribute | A property of an object |
| Function | A piece of code that takes an input and gives an output/does something |
| Argument| The objects that are passed to the function for it to operate on |
| Library | Collections of python functions/capabilities that can be installed and loaded on top of base python

### Object Methods

Everything in python is an **object**, and depending on the type of object, they may have certain **methods** that can be called on them. For example, strings have a method called `upper()` that converts the string to uppercase.

In [None]:
my_string = "hello"
my_string.upper()


In the above code, `my_string` is a string object, and we are calling the `upper()` method on it. The method is called by using the `.` operator. Methods are functions that can only be used on objects of certain classes. You will often see methods strung together, like below:

In [None]:
my_string = "hello"
# This first makes the first letter uppercase, then swaps the cases of each letter
my_string.capitalize().swapcase()

### Object Attributes

We'll be learning about some more complex objects today, like **numpy arrays**, which also have attributes. **Attributes** are properties of an object that can be accessed using the `.` operator. For example, numpy arrays have an attribute called `shape` that tells you the dimensions of the array. The difference between an attribute and a method is that attributes are information about a given object, rather than a task being performed on the object. Practically, this means that, while both attributes and methods are called on an object with the dot operator `.`, attributes are accessed without parentheses, so you don't need to call them like functions.

In [None]:
# this makes a np array (we'll learn more about these shortly!)
my_array = np.array([1, 2, 3, 4, 5])

# this gets the size (total number of elements) of the array
my_array.size

## Base Python Data Structures

### Lists

**Lists** are one the most flexible of data structures in Python. They are created with `[]` and can contain any type of data. Each **element** in a list is separated by a comma. Lists are ordered and can be indexed, sliced, and concatenated just like strings. When lists are all numerical, they can also support mathematical operations like `max()` and `min()`. Lists can also be nested using another `[]` within the list.

Lists are our first introduction to a **mutable** data structure, meaning you can change a list without having to create a new one. Indeed, list methods may modify your data **in place** and/or **return** a new object. If the method modifies the object in place, its return value will be `None`. Modifying in place means you don't have to assign the result of the method to a new variable, while returning a new object means you do have to assign it. For example, `list.append(x)` updates the list in place, while `list.pop()` both returns the last element and removes it from the list in place.

Below are some useful operations and methods for lists. For a full list of methods, you can use `help()` on the list or consult the [docs](https://docs.python.org/3/library/stdtypes.html#list) page.

**Operations and methods for lists**

| Operation/Method | Description |
| --- | --- |
| `+` | Concatenation |
| `*` | Repetition |
| `[]`, `[:]` | Indexing, slicing |
| `.append(x)` | Add `x` to the end of the list |
| `.extend([x, y, z])` | Add `[x, y, z]` to the end of the list |
| `.insert(i, x)` | Add `x` at index `i` of the list |
| `.pop(i)` | Remove and return the element at index `i`, defaults to last element if none given |

**Use cases for lists**

Lists are a data structure that is always there in the background, being useful. We see them when creating simple ordered collections to iterate through, when we need to store a sequence of data to reference later, or when we need to collected a bunch of objects together. Think of lists as a small temporary transport for data. Lists are not good for large datasets (because it will be slow) or when you need to do a lot of mathematical operations (because it lacks functionality).

### Dictionaries

**Dictionaries** store **key:value pairs**. Keys must be immutable and are typically strings or numerical identifiers, while the values can be just about anything, including other dictionaries, lists, or individual values. You can create a dictionary with `{}` or with the `dict()` function. The two ways to create a dictionary are shown below:

```python
my_dict = {'a': 1, 'b': 2, 'c': 3}
my_dict = dict(("a", 1), ("b", 2), ("c", 3))
```

In recent versions of Python (3.8+), dictionaries keys maintain the order in which they were added, which is useful to maintain consistency when looping over the keys. However, in previous versions of Python, dictionaries were unordered. You can't index or slice dictionaries (since dictionary elements aren't accessed by index). But you can retrieve items by their key, e.g. `my_dict["a"]` or `my_dict.get("a")`. Like lists, dictionaries are mutable, so you can add, remove, or update the key:value pairs in place. Other methods return "View objects" that allow you to see the items in the dictionary, but won't allow you to modify the dictionary, however these objects can usually be easily converted to lists with the `list()` function. Here are some useful methods for dictionaries:

**Operations and methods for dictionaries**

| Operation/Method | Description |
| --- | --- |
| `[<key>]` | Retrieve value by key |
| `.keys()` | Returns a view object of the keys |
| `.values()` | Returns a view object of the values |
| `.items()` | Returns a view object of the key:value pairs |
| `.update(dict)` | Updates the dictionary with the key:value pairs from another dictionary |

**Use cases for dictionaries**

Dictionaries are a data structure that is more specialized for information that can be organized in a key:value pair way. You may see dictionaries being used to store associations between a name/ID and some characteristics, or to store a set of parameters for a function, or to organize a hierarchical grouping of information. Dictionaries are optimized for fast access to the values by key and for flexible organization of the data. Although you can edit the values of a dictionary, they aren't good for mathematical operations or for ordered data.

## Learning to read documentation



Today our first lesson will be about how to read documentation, because we are going to start using libraries such as **numpy** and **pandas**, which have many features that we do not have the time to cover in detail. Instead, if you can read documentation efficiently, you can learn how to use these libraries on your own.


**Programming effectively actually involves a lot of reading**

Programming involves reading primarily documentation, but also code, search results, stackexchange queries, etc. These are just a few examples of what you'll read as you work on code. Reading the documentation of a package or library or software that you are using should probably be the first thing you do when you start using it. However, software docs pages are a much different sort of writing than we may be used to, if we're primarily used to reading journal articles, textbooks, and protocols. Knowing how and how much to read documentation is a skill that needs to be developed over time to suit your own needs. There's definitely no need to read every single page of documentation of a piece of software, especially for large libraries like `numpy` or `matplotlib`.

**There are a variety of ways software can be documented**

You may be handed a single script from a colleague to perform some action and that script may have **comments** in the code detailing what it does or what certain lines do. Individual functions may have what is called a **docstring**, which is a string that occurs immediately after the function definition detailing how do use that function, inputs, and outputs. Another type of documentation is a docs page or **API reference** on a website for that software, such as the page for the seaborn's [scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) function. Many software packages also have some introductory pages like **vignettes** or **tutorials** that guide you through the basics of the software. The [Getting started tutorials](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html) of Pandas is a good example of this.

**What documentation are we meant to read?**

In general, documentation is meant to be a reference manual more than a textbook. A lot of documentation is really repetitive, because it has to exhaustively cover every single function, class, and use-case available to the user. I do not recommend reading documentation like a book or in any linear way. That's like learning a foreign language by reading the dictionary. For example, `numpy` has a variety of [mathematical functions](https://numpy.org/doc/stable/reference/routines.math.html), but you are not required to look at the doc page of each of those. It is enough to know that it exists and when you do want to use a particular one, to check the page of that specific function. The most important parts of the documentation to read first are the tutorials/user guides, which introduce the basic functionality of the software with some example code. Often times, this code is exactly what you need to get started. If you get stuck, then it's time to read the docs pages for the specific commands you are using.

### Anatomy of a docs page

Scientific articles typically have the same sections: Introduction, Methods, Results/Discussion. Similarly, docs pages for a function should all have some common components:

* Function name and how to call it
    * parameters in parentheses with any defaults showing
    * positional parameters first, keyword parameters after asterisk
* Description of function
* Detailed parameters that can be passed to each function
    * type of object that can be passed
    * description of what the parameter does
* Returns
    * type of object(s) returned
    * description of the object
* Examples

**Just the basics**

If this is your first time encountering the function, glance at the function name and description and then go directly to the examples. This will help you understand if this function does what you think it does and give you a template to use it.


>**Exercise:** Read the documentation page for plotting pie charts in matplotlib [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.pie.html). What is the minimal information you need to pass to the function to get a pie chart? What parameters are optional or already have defaults?


**Troubleshooting**

Looking at a docs page is helpful for troubleshooting certain errors.

>**Exercise:** Below is some code that is meant to plot a pie chart using some functions in the matplotlib library. However, it's not working. Can you figure out what's wrong?

In [None]:
labels = ['Python', 'Java', 'C++', 'Ruby']
sizes = [215, 130, 245, 210]

fig, ax = plt.subplots()

ax.pie(sizes, labs=labels, autopct='%1.1f%%', shadow=True, startangle=140)

# Aspect ratio to ensure the pie chart is circular.
ax.axis('equal')

ax.set_title('Programming Language Popularity')

**Exploring**

If you are trying to find a specific way to customize the pie chart, it is worth reading the entire list of parameters to see what options are available.

>**Discussion:** Read the pie plot function's [documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.pie.html) page. How does the function give you control over the look of your pie wedges?

## Refresher: importing libraries

Recall that we covered how to **import libraries** of functions previously. For instance, we can import the built-in `math` library and then use the functions it contains:

In [None]:
import math
print(math.log(100))

We need to type `math.` so Python knows where to look for the `log()` function.

We can also use an **alias** if we don't want to type `math.` every time we use a function from the library:

In [None]:
import math as m
print(m.log(100))

## Reading and writing data in base python

In this next section, we'll read data of bird names and bird sightings from two CSV files and then write out the total number of sightings for each bird to a new CSV file.

### Reading data line by line

A common way to read data in Python is line by line. This is useful when you have a file that is too large to fit into memory all at once. You can read the file line by line and process each line as you go. This is also useful when you need to parse a file that has a specific structure so that it can be read into a data structure like a numpy array or pandas dataframe. In this section, when we talk about files, we mean text files that contain data that exist on your local machine. This is different from the data structures we've been working with so far, which are in memory (in your python instance).

The syntax for opening a file is `open(filename, mode)`, where mode can be `r` for reading, `w` for writing, `a` for appending. Reading mode means that you can only read the file but not change it (on the disk). Writing mode means you are creating a new file or overwriting an existing file. Appending mode means you are adding to an existing file.

When you open a file, you can read it line by line using a `for` loop. While there are several ways to open a file in Python, the most efficient is with the `with` keyword, which will automatically close the file once you've done what you need. Here's an example of how that might look with a for loop to print each line of a file:

```python
with open('filename.txt', 'r') as file:
    for line in file:
        print(line)
```

The above code will print each line of the file to the console. Notice the use of the keyword `as`, similar to how we setup an **alias** when we import a library. Here it serves a similar purpose, giving us a name to refer to the file object we've opened. We can name it anything we want, but `file` was just an example. The `line` variable is how we refer to each line of the file as we iterate through it. Again, it's an arbitrary name.

This for loop is similar to how we iterate through a list. The only difference is that we're iterating through something that is being read from our local computer rather than an object in memory.

In the example, we are only printing the lines of the file, but just as with any for loop, you can do anything you want with each line, such as apply a function to it, split it up and store it in a data structure, or write different parts of it to a new file. So think of that `print(line)` line as a placeholder for whatever you want to do with the line.

### Vocab

|Term|Definition|
|---|---|
|File|A collection of data stored on a disk|
|Line|A string of characters that ends with a newline character|
|Newline character|A special character that indicates the end of a line, usually `\n`|
|Delimiter|A character that separates data fields in a line, usually a comma, tab, or space|
|Parsing|The process of extracting data from a file|
|Whitespace|Any character that represents a space, tab, or newline|
|Leading/trailing whitespace|Whitespace at the beginning or end of a string|


Let's work through a more concrete example of how we might read and then parse through a file. Let's suppose we have a list of taxon ids and bird names that we want to read into a **dictionary**. The file looks like this:

```
Anas rubripes,American Black Duck,6924
Fulica americana,American Coot,473
Spinus tristis,American Goldfinch,145310
Falco sparverius,American Kestrel,4665
```

First, run this block to download the file to the Jupyter notebook environment..

In [None]:
# This line downloads the file locally to the same folder as your notebook
!wget https://informatics.fas.harvard.edu/workshops/python-intensive/data/bird_names.csv

Then, in the code below we first read the file line by line, then strip the whitespace and split the line by a comma. Then, we will create a dictionary where the key is the taxon id and the value is the common name of the bird.

In [None]:
filename = 'bird_names.csv'

bird_names = dict()

with open(filename, 'r') as file:
    for line in file:
        line = line.strip().split(',')
        bird_names[line[2]] = line[1]

print(bird_names)

>**Exercise**: Rerun the code above and remove the `strip()` method. What happens?

Here are some handy functions when working with lines in files. These are all string methods, so you can use them on any string, including strings that are read from a file.

**Useful functions for reading files by line**

| Function | Description |
| --- | --- |
| `.strip()` | Removes leading and trailing whitespace and newlines from a string |
| `.split()` | Splits a string into a list of strings based on a delimiter |
| `.join()` | Joins a list of strings into a single string with a delimiter |
| `line[:]` | Indexing and slicing works on strings too |
| `.replace(old, new)` | Replaces all instances of `old` with `new` in a string |

**Useful special characters**

Special characters in files are often used as delimiters or to indicate the end of a line. The two most common special characters are:

| Character | Description |
| --- | --- |
| `\n` | Newline character |
| `\t` | Tab character |

>**Exercise**: Copy the code above and modify it so that the dictionary keys are the taxon ids and the values are another dictionary, with keys 'scientific_name' and 'common_name' and values the appropriate entries for that bird species.
>
> For example, a sample dictionary entry should look like this:
> ```
> {6924: {'scientific_name': 'Anas rubripes', 'common_name': 'American Black Duck'}}
> ```

In [None]:
filename = 'bird_names.csv'

# Your code here
bird_names = dict()

print(bird_names)

>**Exercise**: Why did we use a dictionary to store the data in the previous exercise? Think about what features of a dictionary make it a good choice or what features of lists or arrays make them a bad choice.

Below is an excerpt from a file of iNaturalist observations of birds in Cambridge, MA from the year 2023. We will loop through the file and count the number of observations of each species. We will also use the previously created dictionary to get the species names.

```csv
id,time_observed_at,taxon_id
145591043,2023-01-01 17:33:31 UTC,14886
145610149,2023-01-01 20:55:00 UTC,7004
145610383,2023-01-01 21:13:00 UTC,6993
145611915,2023-01-01 21:12:00 UTC,13858
```

Run the code block below to download the file to the Jupyter notebook environment.

In [None]:
# This line downloads the file locally to the same folder as your notebook
!wget https://informatics.fas.harvard.edu/workshops/python-intensive/data/bird_observations.csv

>**Exercise:** Work with a neighbor or two to do the following exercise:
> Loop through the file and count the number of observations of each species. After all the observations have been counted, print all the species names and the number of observations. You will need to use the dictionary you created in the previous exercise to get the species names. It's up to you what kind of data structure (if any) you want to use to store the counts.
>
> 1. Write out pseudocode for what you will do for each line in your birdfile
> 2. Try to turn the pseudocode into python code. If there's something you want to do, but don't know the syntax or function, raise your hand and we can help you.
> 3. Find out how many European Starlings were observed as proof that your code works.

In [None]:
filename = 'bird_observations.csv' #keep

bird_observations = dict()

with open(filename, 'r') as birdfile: #keep
    # skip the header
    next(birdfile) #keep
    for line in birdfile: #keep
        # your code here


print(bird_observations)

If you routinely find yourself reading delimited files, you might want to use the `csv` library. The `csv` library also has the ability to parse Excel files or read and write to/from dictionaries directly. For more information, here's the [doc page](https://docs.python.org/3/library/csv.html). Here's what the above code would look like using the `csv` module:

In [None]:
import csv

filename = 'bird_observations.csv'

bird_observations = dict()

with open(filename, 'r') as birdfile:
    # this line takes the place of us having to strip and split the lines
    reader = csv.reader(birdfile, delimiter=',')
    # skip the header
    next(reader)
    for row in reader:
        id = row[2]
        name = bird_names[id]['common_name']
        if name not in bird_observations:
            bird_observations[name] = 0
        bird_observations[name] += 1
print(bird_observations)

### Writing data by line

Writing data to a file is similar to reading data from a file. You can open a file in write mode and then write to it line by line using the `print()` method, but this time passing in the variable we've stored the opened file in (in our case the variable is unimaginatively named `file`). Here's an example of writing a list of strings to a file:

In [None]:
my_text = ['this is a test', 'this is another test', 'this is the final test']

with open('my_text.txt', 'w') as file:
    for line in my_text:
        print(line, file=file)

# reading it back
with open('my_text.txt', 'r') as file:
    for line in file:
        print(line)

>**Exercise:** Use the `csv` module to write the species counts to a new file. The file should have two columns: the species name and the number of observations. The file should be comma-delimited. How this is written may depend on how you stored the species counts.

In [None]:
# your code here



## Numpy

Numpy is an open-source **library** written in Python. It's main feature is the implementation of efficient **data structures** that can be used for large datasets and **functions** and **methods** for those data structures. This makes numpy well suited for scientific computing and data science.

Principal among these Numpy data structures is the **numpy array**.


### Importing Numpy

Since Numpy is an external library of functions, it needs to be **imported** before we can use it. Since Numpy is so widely used, it has become common practice to import it using the alias `np`.

> **Exercise**: Run the block of code below to import Numpy. You have to do this to be able to use any of the Numpy functions we'll talk about in the workship.

In [17]:
# Run this block to import numpy with the alias np
import numpy as np

### Numpy arrays

**Numpy arrays** are a data structure that only contain one type of data, typically numerical, and are N-dimensional (any number of dimensions). You can create numpy arrays using the `np.array()` function (after running `import numpy as np`) or by converting other data structures to an array using `np.asarray()` or other helper functions. There are also many functions that can create an array with pre-filled numbers, such as `np.zeros()` and `np.arange()`. An array is defined by its `shape`, which describes the number of elements in each dimension, also known as **axes**. The first axis is the number of rows, the second is the number of columns, and so on for higher dimensional data.

Today we will be spending a lot of time with numpy arrays as they are one of the main data structures for working with numerical data. We will learn how to navigate them, read and write from them, and also how to perform mathematical operations on them. We will also use numpy to create some visualizations.

**Use cases for numpy arrays**

The use cases for numpy arrays are very broad. They are used in scientific computing, machine learning, data analysis, and more. They are optimized for fast mathematical operations and are very efficient in terms of memory usage. They are also very flexible in terms of the number of dimensions they can have, so you can store a lot of data in a single numpy array. They are not good for storing mixed data types or for storing data that is not numerical.

### Generating numpy arrays

Numpy has a variety of functions that can generate arrays for you. Some of the most common ones are `np.zeros()`, `np.ones()`, `np.arange()`, and `np.linspace()`.

>**Exercise:** Read the documentation page for [np.zeros()](https://numpy.org/doc/stable/reference/generated/numpy.zeros.html). What is the minimal information you need to pass to the function to get an array of zeros? What parameters are optional or already have defaults?
>
> Once you have read the documentation, produce an array of zeros with 3 rows and 4 columns in the code block below.

In [None]:
# your code here



>**Exercise:** Read the documentation page for [np.arange()](https://numpy.org/doc/stable/reference/generated/numpy.arange.html).
>
> Once you have read the documentation, produce an array of the even numbers from 100 to 200 in the code block below.

In [None]:
# your code here



Did you notice anything unexpected in the output above? Look at the last value of the result: it's 198, not 200. One might expect that, because the function call says to create an array that ends at 200, that value would be included. The fact that it doesn't has to do with python's zero-indexing behavior. To actually get 200 to be included, you'd need to set the second argument to 201. If in a scientific application you need an array that ends at (and includes) a specific value, be sure your function call is set up properly.

>**Exercise:** `size` and `shape` are attributes of numpy arrays that tells you how many elements are in the array or how many elements are in each dimension. You can access it by calling `array.size`, (where `array` is the name of your array). It can also be chained with the `arange()` function. For example, `np.arange(10).size` will return 10.
>
> `print()` the sizes and shapes of the two arrays you created above.
>
> What do you notice about the way the `size` and `shape` attributes are returned?

In [None]:
print(np.zeros([3,4]).size)
print(np.zeros([3,4]).shape)

print(np.arange(100,200,2).size)
print(np.arange(100,200,2).shape)

When you print the size of an array, you will get a single number of the total number of elements in the array. When you print the shape of an array, you will get a **tuple** of the number of elements in each dimension. However, if the array is one-dimensional, it will still return a tuple, but with a single element.

Recall that a **tuple** is a data structure that is similar to a list, but is immutable. This means that you cannot change the elements of a tuple after it is created. Tuples are created by using parentheses instead of square brackets. Tuples are often returned by functions that need to return multiple values.

>**Exercise:** `reshape()` is a method that can be called on a numpy array to change its shape. Use it to change the array you created using `np.arange(100,200,2)` to have 10 rows and 5 columns. [reshape() documentation](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html)
>
> What are the restrictions on the shape you can pass to `reshape()`?

In [None]:
# your code here



### Difference between numpy arrays and lists

A one-dimensional numpy array is called a `vector` and it looks very much like a list, but with the additional restriction that **only one data type can be stored in it**. So why do we want to use numpy arrays instead of lists? Run the code blocks below to see the main reason numpy is used more for data analysis than lists. Annotate each line of code with a comment explaining what it does.

In [4]:
my_numpy_array = np.arange(1, 1000000)
my_list = list(range(1, 1000000))

In [None]:
%timeit np.sum(my_numpy_array)

In [None]:
%timeit sum(my_list)

In the above code, we used the "magic command" `%timeit` to time how long it takes to run a line of code. This command runs the code multiple times and gives you the average time it took to run. We used it to compare the time it takes to sum the elements of a list and the time it takes to sum the elements of a numpy array.

Note that the `%timeit` command is specific to Jupyter notebooks. If you are running this code in a script, you would have to use the `timeit` module to time your code.

We can see that using the numpy-specific `np.sum` function on a numpy array is much faster that using the base python sum function on a list. Each operation runs in microseconds instead of milliseconds.

In [None]:
import sys

print(f"Memory used by numpy array: {my_numpy_array.nbytes/1024/1024} MB")
list_size = (sys.getsizeof(my_list)+ sum(sys.getsizeof(item) for item in my_list)) / 1024 / 1024
print(f"Memory used by list: {list_size} MB")

In the above code, we printed out the memory usage of the numpy array and the list. This reveals that the numpy array also takes up less memory to store the same amount of information as a list.

In summary, numpy, which is made to do calculations, is an improvement over base python lists because it is faster and more memory efficient.

If you want some more analysis, you can check out this [stackoverflow](https://stackoverflow.com/questions/10922231/pythons-sum-vs-numpys-numpy-sum) thread.

## Numpy operations

### Broadcasting

You can perform mathematical operations on arrays and they'll propagate to each element. This is called **broadcasting**. In order for the element-wise operation to work, the two objects you're operating with either have to have the same shape or one of them has to be a **scalar**, meaning a single number. Numpy also has functions that allow you to operate on the entire array, such as `np.sum()`, `np.mean()`, etc.

Consider the following operation on two numpy arrays:


In [None]:
predicted_array = np.array([1, 2, 3, 4, 5])
expected_array = np.array([2, 4, 1, 5, 5])

print(predicted_array - expected_array)

By using the subtraction operator `-`, which we learned is designed to subtract one single number from one other single number, we have actually performed the operation automatically on each number in the arrays. The first number in `predicted` is `1` and the first number in `expected` is `2`, so the first number in the resulting output is `2 - 1` or `-1`. The second number in `predicted` is `2` and the second number in `expected` is `4`, so the second number in the resulting output is `2 - 4` or `-2`. And so on for the remaining numbers in the arrays.

What if we tried to do this with normal Python lists?

In [None]:
predicted_list = [1, 2, 3, 4, 5]
expected_list = [2, 4, 1, 5, 5]

print(predicted_list - expected_list)

An error, stating that we can't use the subtraction operator `-` on lists. To achieve the same result as the numpy arrays, we'd have to loop over the lists by index, a much slower process that requires us to write more code.

If provided a scalar, instead of a second array, the arithmetic operation will be performed on each element in the first array using the same number on each one:

In [None]:
single_number = 3
print(predicted_array - 3)

Here, each number in `predicted_array` has `3` subtracted from it.

Again, this is something we'd have to use a loop to do with regular lists. This is a huge advantage of numpy arrays over native Python data structures like lists.

Below is a practical example that combines the array/array broadcasting and array/scalar broadcasting. Here, we calculate the mean squared error of two 1D arrays using the formula $\frac{1}{n}\sum_{i=1}^{n}(predicted_i - expected_i)^2$.

In [None]:
predicted = np.array([1, 2, 3, 4, 5])
expected = np.array([2, 4, 1, 5, 5])
mse = (1/len(predicted)) * np.sum(np.square(predicted - expected))
print(mse)

Breaking down the code, you can see what is produced at each step

In [None]:
print(predicted - expected)
print(np.square(predicted - expected))
print(np.sum(np.square(predicted - expected)))
print(1/len(predicted))

And of course 0.2 * 10 is 2.0, the same result as the single expression above..

### Slicing and indexing numpy arrays

**Slicing** arrays is a powerful tool for **extracting subsets of data**. The syntax for slicing is `array[start:stop:step]`, similar to what we learned about slicing strings and lists. If you don't specify a start, it defaults to index 0 (the first element of the array), if you don't specify a stop, it defaults to the end of the array, and if you don't specify a step, it defaults to 1. When you specify a start and stop, the range is inclusive of the start and exclusive of the stop. Usually, you don't need to specify the step, so you can omit the second colon.

You can also use negative indices to count from the end of the array. When you use negative indices, inclusive of the start and exclusive of the stop still applies.

Here are some examples of slicing one-dimensional and two-dimensional arrays. See if you can predict the output before running the code. (Do these on the board)

In [None]:
# Examples of slicing one-dimensional arrays
arr1 = np.array([1, 2, 3, 4, 5])
print(arr1[0:2])
print(arr1[1:3])
print(arr1[:])
print(arr1[2:])
print(arr1[-1:])
print(arr1[:-1])
print(arr1[-3:-1])
print(arr1[::2])
print(arr1[::-1])

#### Slicing multi-dimensional arrays

Multi-dimensional numpy arrays are sliced and indexed with the same syntax as single dimensional arrays (vectors), but you need to separate the dimensions with a comma. For example, `array[i, j]` will return the element at row `i` and column `j`. In two dimensions, the first axis is the rows and the second axis is the columns. So now the syntax is `array[row_start:row_stop:row_step, col_start:col_stop:col_step]`.

Let's practice slicing the following array of 25 elements reshaped into a 5x5 array:

In [None]:
arr = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10],
    [11, 12, 13, 14, 15],
    [16, 17, 18, 19, 20],
    [21, 22, 23, 24, 25]
])
print(arr)

>**Exercise**: Perform the following operations on the `arr` array:
>
> 1. Extract the first row
> 2. Extract the last column
> 3. Extract the first three rows
> 4. Extract the central 3x3 square
> 5. Extract the last column using positive indexing, and then do it again using negative indexing
> 6. Extract every other column (and all rows)
> 7. Extract every other element of the first row. You should get [1, 3, 5]
> 8. Extract the last row and reverse it. You should get [25, 24, 23, 22, 21]

In [None]:
# 1. Extract the first row of the array
# Your code here


In [None]:
# 2. Extract the last column
# Your code here


In [None]:
# 3. Extract the first three rows
# Your code here


In [None]:
# 4. Extract the central 3x3 square
# Your code here


In [None]:
# 5. Extract the last column using positive indexing, then do it again using negative indexing
# Your code here


In [None]:
# 6. Extract every other column (and all rows)
# Your code here


In [None]:
# 7. Extract every other element of the first row. You should get [1, 3, 5]
# Your code here


In [None]:
# 8. Extract the last row and reverse it. You should get [25, 24, 23, 22, 21]
# Your code here


>**Bonus Exercises**: Here are some more challenging slicing exercises:
>
> B1. Extract every other column and every other row, starting with the number 1. You should get:
> ```
> [[ 1  3  5]
>  [11 13 15]
>  [21 23 25]]
> ```
> B2. Extract a checkerboard pattern starting from the number 2. You should get:
> ```
> [[ 2  4]
>  [12 14]
>  [22 24]]
> ```

In [None]:
# B1. Extract every other column and every other row, starting with the number 1.
# Your code here


In [None]:
# B2. Extract a checkerboard pattern starting from the number 2.
# Your code here


**IMPORTANT**: Slicing arrays creates a **view** of the original array, not a copy. This is known as "passing by reference". That means if you use a slice of an array and modify it, the original array will also be modified.

In [None]:
arr = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10],
    [11, 12, 13, 14, 15],
    [16, 17, 18, 19, 20],
    [21, 22, 23, 24, 25]
])

print("This is the original array")
print(arr)
print("This is my slice of the array")
slice = arr[1:4,1:4]
print(slice)
print("I'm going to change the slice")
slice[:,:] = 999
print(slice)
print("This is the original array")
print(arr)


When you do a calculation on a slice of an array, the result will be a new array.

In [None]:
arr = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10],
    [11, 12, 13, 14, 15],
    [16, 17, 18, 19, 20],
    [21, 22, 23, 24, 25]
])
print("This is the original array")
print(arr)
print("This is my slice of the array")
slice = arr[1:4,1:4]
print(slice)
print("add 100 to each element of the slice")
slice = slice + 100
print(slice)
print("This is the original array")
print(arr)

### Using boolean masks to filter arrays

You can use boolean expresssions to **filter** arrays. Here's an example:

In [None]:
# Getting the even numbers from the array
arr = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10],
    [11, 12, 13, 14, 15],
    [16, 17, 18, 19, 20],
    [21, 22, 23, 24, 25]
])

arr[arr % 2 == 0]

Under the hood, numpy creates a boolean mask that is the same shape as the array. The mask is `True` where the condition is met and `False` where it is not. You can then use the mask to filter the array.

In [None]:
print(arr % 2 == 0)
print(arr)

When using boolean masks in this way, it's important to note that the original shape of the array is not preserved. So this is useful if you want to filter out elements and do something with that collection of elements, but not if you want to replace certain elements with others.

If you want to replace elements in an array based on a condition, you can use the `np.where()` function. This function takes three arguments: the condition, the value to use if the condition is `True`, and the value to use if the condition is `False`.

In the code block below, we will replace all of the odd numbers in the array with the number 0.

In [None]:
np.where(arr % 2 == 0, arr, 0)

>**Exercise**: Perform the following operations on the `arr` array using boolean masks:
>
> 1. Extract all values less than the mean
> 2. Trim the array by removing the maximum and mininum value
> 3. Replace the maximum and minimum values with the mean of the entire array

In [None]:
# Extract all values less than the mean
arr = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10],
    [11, 12, 13, 14, 15],
    [16, 17, 18, 19, 20],
    [21, 22, 23, 24, 25]
])

# Your code here


In [None]:
# Trim the array by removing the maximum and mininum value. Use np.min() and np.max() to find the min and max values
# Hint: it might be easiest to explicitly create the boolean mask for this
# The operator & is the element-wise AND operator, it returns True if both elements in an array are True
# Your code here


>Compare your solution above with a neighbor. Did they do it differently? Find at least one person who did it differently.

In [None]:
# Replace the maximum and minimum values with the mean of the entire array
# Hint: you can use the mask you created above and np.where()
# Your code here

## Numpy math functions and additional capabilities

The numpy library isn't just about arrays. It also has a lot of **mathematical functions** that can be applied to arrays. Some of these functions may seem like duplicates of base python functions, but they are optimized for operating on multi-dimensional arrays whereas the base python functions are not. Other functions are unique to numpy and offer more advanced mathematical capabilities.

You can find a list of all mathematical functions in the numpy library [here](https://numpy.org/doc/stable/reference/routines.math.html).

In [None]:
arr = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10],
    [11, 12, 13, 14, 15],
    [16, 17, 18, 19, 20],
    [21, 22, 23, 24, 25]
])

In [None]:
# Getting the mean of the array along the columns
np.mean(arr, axis=0)

In [None]:
# getting the mean along the rows
np.mean(arr, axis=1)

In [None]:
# getting the standard deviation along the columns
np.std(arr, axis=0)

In [None]:
# getting histogram of the array
np.histogram(arr, bins=5, range=(1, 25))
# if you don't want to hardcode the range, you can use the min and max functions
np.histogram(arr, bins=5, range=(np.min(arr), np.max(arr)))

## Reading and writing data with numpy

Typically, we're not going to be able to manually enter data into our code's data structures (*e.g.* `my_data = [2, 5, 2, 7, 4]). This would be tedious and time-consuming, given the size of our datasets. This would also make our code useful only for a single dataset at a time, which defeats one of the main purposes of programming!

Instead, we will have data stored as plain text files on our computer or hosted somewhere on the web. These files may be formatted in different ways (*e.g.* CSV files, tab-delimited files, compressed genome sequence files). While there are native Python functions to read plain text files, it is up to us as programmers to write code to accommodate the multitude of formats.

Numpy has two main **functions** for loading in text-based data such as CSVs: `np.loadtxt()` and `np.genfromtxt()`. The main difference between the two is that `np.genfromtxt()` can handle missing data, while `np.loadtxt()` cannot. So, to play it safe, it's usually best to use `np.genfromtxt()`.

Remember that numpy arrays can only contain one type of data, so if your file contains headers or mixed data types, you'll need to use additional options to tell numpy how to handle them.

We will now import a numerical dataset of red wine quality ratings and various chemical properties. It has a header row and is separated by semi-colons. Here's a preview of the data:

```
fixed acidity;volatile acidity;citric acid;residual sugar;chlorides;free sulfur dioxide;total sulfur dioxide;density;pH;sulphates;alcohol;quality
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
```

In [None]:
# import using genfromtxt and skip the header
wines_array = np.genfromtxt('https://informatics.fas.harvard.edu/resources/Workshops/2024-Fall/Python/data/winequality-red.csv', delimiter=';', skip_header = 1)
# import the header separately
# we specify the dtype as str so that numpy doesn't try to convert the header to a number
wines_header = np.genfromtxt('https://informatics.fas.harvard.edu/resources/Workshops/2024-Fall/Python/data/winequality-red.csv', delimiter=';', max_rows=1, dtype=str)
print(wines_header)
print(wines_array)
print(wines_array.shape)

If all of that looks confusing, just remember that `np.genfromtxt` is a **function**, which is just another block of code somewhere on the computer. Here, we've imported this function from the numpy library, which we've called `np`. Functions take **arguments**, which we learn about from **reading documentation**. Functions **return** values.

In this case, the function takes arguments like (importantly) the path to the file to read and other information about how to read the file. The function then reads the file in the background and returns a numpy array, which we've just learned how to work with.

>**Exercise**: Use `max` or `np.max` to find the max quality rating in the dataset and extract the rows with that rating. Save it to a new array called `best_wines`.

In [None]:
# your code here


# this should return (18, 12)
print(best_wines.shape)

In [None]:
print(best_wines)

Now that we have 18 top wines, let's save them to a new file. We can use the `np.savetxt()` function to save it as a human-readable delimited file. Instead of using semi-colons, we'll use commas to separate the values. The next few lines (`with open()`) are just to print the contents of the file to the screen.

In [None]:
# Saving the array to a csv
np.savetxt("best_wines.csv", best_wines, delimiter=",")
# Reading it back
with open('best_wines.csv', 'r') as file:
    for line in file:
        print(line.strip())

What happened to the numbers? The default format for `np.savetxt()` is to save the data as a floating point number with 8 decimal places. However, our original csv did not have that much precision. This is because numpy internally loaded our data in this format but just didn't display the whole thing for us. We can change a number's **display format** using the `fmt` argument. Here's how you would save the data with 3 decimal places. You will also notice that we took the `wines_header` variable and joined it into a single string with commas. This is because `np.savetxt()` expects a single string for the header row.

In [None]:
np.savetxt("best_wines.csv", best_wines, delimiter=",", fmt='%.3f', header=','.join(wines_header))
with open('best_wines.csv', 'r') as file:
    for line in file:
        print(line.strip())

This way of saving the data prepends the header row with a `#` character to indicate that it is a comment. That way, if you were to load the data back in with numpy, it would ignore the header row automatically. If you don't want this behavior, you can set the `comments` argument to an empty string. There are a lot of formatting options if you need things to be precisely formatted in print, but we won't go into the details today.

>**Exercise**: Now that we've got the data, let's do one exercise focused on manipulating the data within numpy arrays. Remember what we've learned about slicing arrays and broadcasting operations.
>
> Every chemical property on the wine dataset is measured in different units. Let's normalize all the data so that it is on the same scale. To do this, we will first subtract the mean (`np.mean`) of each column from each element in the column, then divide by the standard deviation of the column (`np.std`). The quality column (last column) is not a chemical property, so we will not normalize it. Save the normalized data + original quality column to a new array.

In [None]:
# Your code here

# make a reference to the wines array without the last column

# calculate the mean of each column

# calculate the standard deviation of each column

# calculate the z-scores and assign them to a new array, z_scores

# add back the quality column to the z_scores array (hint: use np.column_stack)

print(z_scores)


## Numpy is a foundational library

Numpy forms the basis of many other python libraries, and can be used as-is to analyze a surprising range of data.

Here is an example tutorial that use numpy to perform image analysis.

The package skimage is used for image processing. Images are easily represented as numpy arrays, where each pixel is a value in the array. You can represent color images at 3D arrays, where each dimension is a different color channel. Take a look at this [tutorial](https://scikit-image.org/docs/stable/user_guide/numpy_images.html) to get an introduction to image processing.

Numpy is the basis of machine learning libraries like sklearn, pandas, scipy, and inspires libraries like tensorflow. For a more in depth tutorial on numpy, check out the [visual guide to numpy](https://betterprogramming.pub/numpy-illustrated-the-visual-guide-to-numpy-3b1d4976de1d)

# Wrap-up

1.   **Reading documentation** is an important skill when using new functions, especially from **imported libraries**.
2.   **Numpy** is an **external library** that implements efficient data structures for large computations called **arrays**. Arrays can be single dimensional (*e.g.* one row of data), in which case they are called **vectors**, but can also be **multi-dimensional** Operations can be performed quickly on arrays with **broadcasting**.
3.   Arrays can be **filtered** with boolean masks.
4.   Data can be **read** directly from text files into Python data structures and Numpy arrays. Likewise, data can be **written** from Python to text files.ff

Next time, we will learn how to use pandas, a library that is built on top of numpy and is specifically designed for data analysis. We will learn how to read into dataframes, how to manipulate data, and how to create visualizations.