Python intensive, part 6
Putting it all together: exploring storm data
For our final section, we are going to do some exploratory analysis of a real world dataset, coding in pairs or groups as we did on the previous days. There are several different "paths" you can choose to follow based on what you are interested in working on, so either come to a consensus within your group, or group up based on shared interest! First, let's download our data:
# This line downloads the file locally to the same folder as your notebook
!wget https://raw.githubusercontent.com/harvardinformatics/python-intensive/refs/heads/main/data/indiana_storms_full.csv
import os
# This line stores the local file path as a Python string variable
storms_file = 'indiana_storms_full.csv'
if not os.path.isfile(storms_file):
storms_file = 'data/indiana_storms_full.csv'
--2025-09-11 19:57:04-- https://raw.githubusercontent.com/harvardinformatics/python-intensive/refs/heads/main/data/indiana_storms_full.csv Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response...
200 OK Length: 952491 (930K) [text/plain] Saving to: ‘indiana_storms_full.csv’ indiana_storms_full 0%[ ] 0 --.-KB/s
indiana_storms_full 100%[===================>] 930.17K --.-KB/s in 0.05s 2025-09-11 19:57:05 (18.3 MB/s) - ‘indiana_storms_full.csv’ saved [952491/952491]
BEGIN_YEARMONTH BEGIN_DAY BEGIN_TIME END_YEARMONTH END_DAY END_TIME \ 0 201501 8 1700 201501 9 1200 1 201501 5 2200 201501 6 830 2 201501 5 2200 201501 6 900 3 201501 5 2200 201501 6 900 4 201501 5 2215 201501 6 930 EPISODE_ID EVENT_ID STATE STATE_FIPS ... END_RANGE END_AZIMUTH \ 0 91570 548696 INDIANA 18 ... NaN NaN 1 92441 555468 INDIANA 18 ... NaN NaN 2 92441 555469 INDIANA 18 ... NaN NaN 3 92441 555470 INDIANA 18 ... NaN NaN 4 92441 555471 INDIANA 18 ... NaN NaN END_LOCATION BEGIN_LAT BEGIN_LON END_LAT END_LON \ 0 NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN EPISODE_NARRATIVE \ 0 A quick-moving Alberta Clipper brought a quick... 1 An Alberta Clipper brought a quick round of sn... 2 An Alberta Clipper brought a quick round of sn... 3 An Alberta Clipper brought a quick round of sn... 4 An Alberta Clipper brought a quick round of sn... EVENT_NARRATIVE DATA_SOURCE 0 Periods of snow during the late afternoon hour... CSV 1 Snowfall accumulation observations ranged from... CSV 2 Snowfall accumulation observations ranged from... CSV 3 Snowfall accumulation observations ranged from... CSV 4 Snowfall accumulation of 5.5 inches was report... CSV [5 rows x 51 columns]
RangeIndex: 1315 entries, 0 to 1314 Data columns (total 51 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BEGIN_YEARMONTH 1315 non-null int64 1 BEGIN_DAY 1315 non-null int64 2 BEGIN_TIME 1315 non-null int64 3 END_YEARMONTH 1315 non-null int64 4 END_DAY 1315 non-null int64 5 END_TIME 1315 non-null int64 6 EPISODE_ID 1315 non-null int64 7 EVENT_ID 1315 non-null int64 8 STATE 1315 non-null object 9 STATE_FIPS 1315 non-null int64 10 YEAR 1315 non-null int64 11 MONTH_NAME 1315 non-null object 12 EVENT_TYPE 1315 non-null object 13 CZ_TYPE 1315 non-null object 14 CZ_FIPS 1315 non-null int64 15 CZ_NAME 1315 non-null object 16 WFO 1315 non-null object 17 BEGIN_DATE_TIME 1315 non-null object 18 CZ_TIMEZONE 1315 non-null object 19 END_DATE_TIME 1315 non-null object 20 INJURIES_DIRECT 1315 non-null int64 21 INJURIES_INDIRECT 1315 non-null int64 22 DEATHS_DIRECT 1315 non-null int64 23 DEATHS_INDIRECT 1315 non-null int64 24 DAMAGE_PROPERTY 1011 non-null object 25 DAMAGE_CROPS 1009 non-null object 26 SOURCE 1315 non-null object 27 MAGNITUDE 577 non-null float64 28 MAGNITUDE_TYPE 390 non-null object 29 FLOOD_CAUSE 246 non-null object 30 CATEGORY 0 non-null float64 31 TOR_F_SCALE 24 non-null object 32 TOR_LENGTH 24 non-null float64 33 TOR_WIDTH 24 non-null float64 34 TOR_OTHER_WFO 3 non-null object 35 TOR_OTHER_CZ_STATE 3 non-null object 36 TOR_OTHER_CZ_FIPS 3 non-null float64 37 TOR_OTHER_CZ_NAME 3 non-null object 38 BEGIN_RANGE 871 non-null float64 39 BEGIN_AZIMUTH 871 non-null object 40 BEGIN_LOCATION 871 non-null object 41 END_RANGE 871 non-null float64 42 END_AZIMUTH 871 non-null object 43 END_LOCATION 871 non-null object 44 BEGIN_LAT 871 non-null float64 45 BEGIN_LON 871 non-null float64 46 END_LAT 871 non-null float64 47 END_LON 871 non-null float64 48 EPISODE_NARRATIVE 1315 non-null object 49 EVENT_NARRATIVE 1079 non-null object 50 DATA_SOURCE 1315 non-null object dtypes: float64(11), int64(15), object(25) memory usage: 524.1+ KB
Data Cleaning: subsetting columns
This is a file that contains data from the National Weather Service. It contains data on storm events in the state of Indiana in 2015. There is quite a lot of information in there with 50 columns in the dataset, so before we do any analysis let's do a bit of cleanup!
We are primarily interested in type of storm event "EVENT_TYPE", the county in which the event occurred "CZ_NAME", and the time of the event "BEGIN_DATE_TIME", "END_DATE_TIME", "CZ_TIMEZONE". Let's also throw in there "BEGIN_LOCATION", "END_LOCATION", "BEGIN_LAT", "BEGIN_LON", "END_LAT", and "END_LON" just in case. So the first thing we need to do is subset the data to only include these columns.
storms_df = storms_df[["EVENT_TYPE", "CZ_NAME", "BEGIN_DATE_TIME", "END_DATE_TIME", "CZ_TIMEZONE", "BEGIN_LOCATION", "END_LOCATION", "BEGIN_LAT", "BEGIN_LON", "END_LAT", "END_LON"]]
Now, let's check if there are any missing values in the columns we are interested in. If there are, we should remove the rows with missing values.
RangeIndex: 1315 entries, 0 to 1314 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EVENT_TYPE 1315 non-null object 1 CZ_NAME 1315 non-null object 2 BEGIN_DATE_TIME 1315 non-null object 3 END_DATE_TIME 1315 non-null object 4 CZ_TIMEZONE 1315 non-null object 5 BEGIN_LOCATION 871 non-null object 6 END_LOCATION 871 non-null object 7 BEGIN_LAT 871 non-null float64 8 BEGIN_LON 871 non-null float64 9 END_LAT 871 non-null float64 10 END_LON 871 non-null float64 dtypes: float64(4), object(7) memory usage: 113.1+ KB
Only some storm events have a begin and end location, but at least there aren't any that only have a begin and not and end. Data looks good! The next thing we need to do is to convert the "BEGIN_DATE_TIME" and "END_DATE_TIME" columns to datetime objects. Additionally, these data are spread out over two time zones and we want to convert it to a single time zone for clarity. This will make it easier to work with these columns later on. For more information on the proper way to manage dates and times in Python, see the section in our Python Healthy Habits document "Working with dates and times".
# Convert all date time columns to datetime objects with correct timezone parsing
# I wrote a regex to extract just the number from the timezone column and then made it an ISO8601 compliant timezone string
# Pandas does not support multiple time zones in the same column, so we will convert all times to UTC and then convert to one of the local time zones
storms_df['BEGIN_DATE_TIME'] = pd.to_datetime(storms_df["BEGIN_DATE_TIME"]+"-0"+storms_df["CZ_TIMEZONE"].str.extract(r'([0-9]+)')[0]+"00", format="mixed", utc=True).dt.tz_convert("-05:00")
storms_df['END_DATE_TIME'] = pd.to_datetime(storms_df["END_DATE_TIME"]+"-0"+storms_df["CZ_TIMEZONE"].str.extract(r'([0-9]+)')[0]+"00", format="mixed", utc=True).dt.tz_convert("-05:00")
Here's a breakdown of the code I used to convert the "BEGIN_DATE_TIME" and "END_DATE_TIME" columns to datetime objects:
storms_df['BEGIN_DATE_TIME'] = pd.to_datetime(storms_df["BEGIN_DATE_TIME"]+"-0"+storms_df["CZ_TIMEZONE"].str.extract(r'([0-9]+)')[0]+"00", format="mixed", utc=True).dt.tz_convert("-05:00")
| Code | Explanation |
|---|---|
pd.to_datetime() |
This function converts a string or a list of strings to a datetime object. However, I first need to convert the string into something that is can parse. |
storms_df["BEGIN_DATE_TIME"] |
This column is alredy in a format that pandas can parse, but it is missing the time zone information. |
+"-0"+storms_df["CZ_TIMEZONE"].str.extract(r'([0-9]+)')[0]+"00" |
This adds the time zone offset to the string by concatenating -0 and the regex match for the time zone offset. Plus, it adds 00 to the end to represent the minutes. |
format="mixed" |
This tells pandas to try to parse the string in a mixed format, which means it will attempt to infer the format of the date and time. |
utc=True |
This tells pandas to treat the datetime object as UTC time, which is a standard time zone that is not affected by daylight savings time. |
.dt.tz_convert("-05:00") |
This converts the datetime object to the specified time zone, which is Eastern Standard Time (EST) in this case. |
Here's what the times look like now:
0 2015-01-08 17:00:00-05:00 1 2015-01-05 22:00:00-05:00 2 2015-01-05 22:00:00-05:00 3 2015-01-05 22:00:00-05:00 4 2015-01-05 22:15:00-05:00 Name: BEGIN_DATE_TIME, dtype: datetime64[ns, UTC-05:00]
RangeIndex: 1315 entries, 0 to 1314 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EVENT_TYPE 1315 non-null object 1 CZ_NAME 1315 non-null object 2 BEGIN_DATE_TIME 1315 non-null datetime64[ns, UTC-05:00] 3 END_DATE_TIME 1315 non-null datetime64[ns, UTC-05:00] 4 CZ_TIMEZONE 1315 non-null object 5 BEGIN_LOCATION 871 non-null object 6 END_LOCATION 871 non-null object 7 BEGIN_LAT 871 non-null float64 8 BEGIN_LON 871 non-null float64 9 END_LAT 871 non-null float64 10 END_LON 871 non-null float64 dtypes: datetime64[ns, UTC-05:00](2), float64(4), object(5) memory usage: 113.1+ KB
Warm-up exercises
Here are some warm-up exercises to get started working with data that is a mix of categorical and datetime data. One helpful tip to working with datetime objects is to get the series of the datetime object and then use the .dt accessor to access the datetime properties. You can find all the properties of the datetime object here .
0 9 1 6 2 6 3 6 4 6 ... 1310 357 1311 357 1312 365 1313 357 1314 1 Name: END_DATE_TIME, Length: 1315, dtype: int32
0 January 1 January 2 January 3 January 4 January ... 1310 December 1311 December 1312 December 1313 December 1314 January Name: END_DATE_TIME, Length: 1315, dtype: object
The next three exercises are one-liners what will use the method .value_counts() to get the number of rows for each unique value in a column. This is a useful method to get a sense of the distribution of the data in a column. Below, we demonstrate by counting the number of each event type. (Compare this to the groupby().size() method we used last time)
EVENT_TYPE Thunderstorm Wind 383 Hail 187 Winter Weather 162 Heavy Snow 131 Flood 125 Flash Flood 121 Dense Fog 40 Winter Storm 33 Heavy Rain 27 Extreme Cold/Wind Chill 26 Tornado 24 Heat 15 Cold/Wind Chill 12 Excessive Heat 6 Frost/Freeze 6 Strong Wind 5 Ice Storm 3 Lightning 3 High Wind 2 Lake-Effect Snow 2 Funnel Cloud 1 Blizzard 1 Name: count, dtype: int64
Exercise: Are storms more common on weekends or weekdays? YES/NO. Display your answer by counting the number of events that happend on each day of the week. Don't worry about sorting the days of the week, just display the counts. Look at the datetime object documentation to see how to get the day of the week from the
.dtaccessor.
Solution
BEGIN_DATE_TIME Monday 275 Wednesday 208 Friday 205 Sunday 188 Saturday 166 Tuesday 154 Thursday 119 Name: count, dtype: int64
Exercise: Which events tend to span more than one county? Display the counts of each event type that spans more than one county. (Hint: Events which span more than one county will have non-null values for
BEGIN_LOCATIONandEND_LOCATION.)
Solution
EVENT_TYPE Thunderstorm Wind 383 Hail 187 Flood 125 Flash Flood 121 Heavy Rain 27 Tornado 24 Lightning 3 Funnel Cloud 1 Name: count, dtype: int64
Exercise: Which events tend to be multi-day events? Display the counts of each event type that has a duration of more than 1 day (24 hours).
You will need to use the
pd.Timedelta()function in your filtering criteria. Docs here and here
Solution
# Find duration of events
storms_df['EVENT_DURATION'] = storms_df["END_DATE_TIME"] - storms_df["BEGIN_DATE_TIME"]
# filter for events that last more than one day and count them
storms_df[storms_df['EVENT_DURATION'] > pd.Timedelta(days=1)]["EVENT_TYPE"].value_counts()
EVENT_TYPE Flood 48 Heavy Snow 24 Excessive Heat 6 Winter Storm 4 Winter Weather 3 Blizzard 1 Flash Flood 1 Extreme Cold/Wind Chill 1 Lake-Effect Snow 1 Name: count, dtype: int64
Exercise: Create two lists: a list of all county names and a list of all event types. How many counties and event types are there? You will be using the Series methods
.unique()and.tolist()here.
Solution
county_list = storms_df["CZ_NAME"].unique().tolist()
print(county_list)
print(f"There are {len(county_list)} counties in the dataset")
event_list = storms_df["EVENT_TYPE"].unique().tolist()
print(event_list)
print(f"There are {len(event_list)} event types in the dataset")
['NOBLE', 'MADISON', 'DELAWARE', 'HENRY', 'RANDOLPH', 'MARION', 'STEUBEN', 'LAGRANGE', 'WHITLEY', 'KOSCIUSKO', 'HANCOCK', 'FAYETTE', 'WAYNE', 'FRANKLIN', 'DEARBORN', 'MARSHALL', 'LA PORTE', 'ELKHART', 'WHITE', 'GIBSON', 'PIKE', 'POSEY', 'SPENCER', 'VANDERBURGH', 'WARRICK', 'CASS', 'PULASKI', 'MIAMI', 'FULTON', 'STARKE', 'ST. JOSEPH', 'ALLEN', 'WABASH', 'HUNTINGTON', 'WELLS', 'ADAMS', 'GRANT', 'BLACKFORD', 'JAY', 'PORTER', 'TIPPECANOE', 'RIPLEY', 'UNION', 'SWITZERLAND', 'CLINTON', 'HOWARD', 'TIPTON', 'BOONE', 'HAMILTON', 'DE KALB', 'ORANGE', 'CRAWFORD', 'BROWN', 'JOHNSON', 'PERRY', 'CLARK', 'HARRISON', 'MARTIN', 'WASHINGTON', 'DAVIESS', 'RUSH', 'DUBOIS', 'MONROE', 'KNOX', 'FLOYD', 'VERMILLION', 'HENDRICKS', 'MORGAN', 'JEFFERSON', 'SCOTT', 'JENNINGS', 'PARKE', 'OWEN', 'PUTNAM', 'WARREN', 'NEWTON', 'GREENE', 'CARROLL', 'SHELBY', 'MONTGOMERY', 'FOUNTAIN', 'JACKSON', 'Vanderburgh', 'Warrick', 'JASPER', 'SULLIVAN', 'DECATUR', 'VIGO', 'LAKE', 'LAWRENCE', 'BARTHOLOMEW', 'Spencer', 'Gibson', 'OHIO', 'Pike', 'Posey', 'BENTON', 'CLAY', 'Elkhart'] There are 99 counties in the dataset ['Winter Weather', 'Heavy Snow', 'Ice Storm', 'Thunderstorm Wind', 'Extreme Cold/Wind Chill', 'Flash Flood', 'Hail', 'Heavy Rain', 'Flood', 'Tornado', 'Lightning', 'Heat', 'Funnel Cloud', 'Excessive Heat', 'Dense Fog', 'Cold/Wind Chill', 'Winter Storm', 'Lake-Effect Snow', 'Blizzard', 'Frost/Freeze', 'Strong Wind', 'High Wind'] There are 22 event types in the dataset
Exercise: Function-writing warmup. Write a function called
get_event_countythat takes as input the storms dataframe object, a county name (as a string), and an event type (as a string). The function should return a dataframe that contains only that county and event type.This function is ideally a one-liner that filters the dataframe based on two criteria. Remember to enclose each condition in
()and use the&operator to combine them.
# test your code
# should return a dataframe with 7 rows
get_event_county(storms_df, "ELKHART", "Thunderstorm Wind")
Solution
def get_event_county(storms, county, event):
return storms[(storms["EVENT_TYPE"] == event) & (storms["CZ_NAME"] == county)]
get_event_county(storms_df, "ELKHART", "Thunderstorm Wind")
EVENT_TYPE CZ_NAME BEGIN_DATE_TIME \ 572 Thunderstorm Wind ELKHART 2015-07-18 17:55:00-05:00 574 Thunderstorm Wind ELKHART 2015-07-18 17:58:00-05:00 583 Thunderstorm Wind ELKHART 2015-07-18 18:15:00-05:00 591 Thunderstorm Wind ELKHART 2015-07-18 18:35:00-05:00 592 Thunderstorm Wind ELKHART 2015-07-13 20:47:00-05:00 608 Thunderstorm Wind ELKHART 2015-07-18 17:50:00-05:00 971 Thunderstorm Wind ELKHART 2015-08-02 23:25:00-05:00END_DATE_TIME CZ_TIMEZONE BEGIN_LOCATION \ 572 2015-07-18 17:58:00-05:00 EST-5 JAMESTOWN 574 2015-07-18 18:01:00-05:00 EST-5 ELKHART 583 2015-07-18 18:18:00-05:00 EST-5 BRISTOL 591 2015-07-18 18:38:00-05:00 EST-5 ELKHART MIDWAY ARPT 592 2015-07-13 20:50:00-05:00 EST-5 MILLERSBURG 608 2015-07-18 17:53:00-05:00 EST-5 ELKHART 971 2015-08-02 23:26:00-05:00 EST-5 GOSHEN
END_LOCATION BEGIN_LAT BEGIN_LON END_LAT END_LON \ 572 JAMESTOWN 41.6400 -86.0200 41.6400 -86.0200 574 ELKHART 41.7026 -85.9467 41.7026 -85.9467 583 BRISTOL 41.6900 -85.8200 41.6900 -85.8200 591 ELKHART MIDWAY ARPT 41.6100 -85.8700 41.6100 -85.8700 592 MILLERSBURG 41.5100 -85.7300 41.5100 -85.7300 608 ELKHART 41.6800 -85.9800 41.6800 -85.9800 971 GOSHEN 41.5800 -85.8300 41.5800 -85.8300
EVENT_DURATION 572 0 days 00:03:00 574 0 days 00:03:00 583 0 days 00:03:00 591 0 days 00:03:00 592 0 days 00:03:00 608 0 days 00:03:00 971 0 days 00:01:00
Now that we have imported our data, cleaned it up and begun to explore it, let's split off into groups! We have a few paths we can take:
- A guided path: We will guide you through writing functions that summarize and print the data in different ways. This will help you practice writing functions, filtering data, and practice a tiny bit of plotting.
- A few self-guided exploratory questions that you can approach however you want, but focus on different skills:
- Question 1: What is the most dangerous county in Indiana? We provide a separate table of county info so you can normalize the data by county area. You will need to learn how to merge two dataframes together.
- Question 2: What is the best time to visit Indiana if you want to take cool pictures of clouds? This will help you practice summarizing data and, depending on how you approach it, work with date time objects of various duration.
- Question 3: Create a plot that compares the duration in hours of each common event type. This will help you practice creating new columns and plotting data.
Path 1: function writing practice
- Write a function called
storm_by_countythat takes as input the storms object, a LIST of county names, and returns a dataframe of all of the events that occurred in a given county and the date they began. Include an optional argumentstorm_typethat by default isNoneand includes all event types, but if given a list, only includes those events.
# test your function
# should return 11 rows of data
storm_by_county(storms_df, ["ELKHART", "ST. JOSEPH"], ["Thunderstorm Wind"])
Solution
def storm_by_county(storms, county, storm_type=None):
if storm_type:
matches = storms[(storms['CZ_NAME'].isin(county)) & (storms['EVENT_TYPE'].isin(storm_type))]
else:
matches = storms[(storms['CZ_NAME'].isin(county))]
return matches
storm_by_county(storms_df, ["OHIO"])
print("...")
storm_by_county(storms_df, ["ELKHART", "ST. JOSEPH"], ["Thunderstorm Wind"])
...
EVENT_TYPE CZ_NAME BEGIN_DATE_TIME \ 281 Thunderstorm Wind ST. JOSEPH 2015-06-11 15:26:00-05:00 480 Thunderstorm Wind ST. JOSEPH 2015-06-08 16:30:00-05:00 571 Thunderstorm Wind ST. JOSEPH 2015-07-18 17:43:00-05:00 572 Thunderstorm Wind ELKHART 2015-07-18 17:55:00-05:00 574 Thunderstorm Wind ELKHART 2015-07-18 17:58:00-05:00 582 Thunderstorm Wind ST. JOSEPH 2015-07-18 18:04:00-05:00 583 Thunderstorm Wind ELKHART 2015-07-18 18:15:00-05:00 591 Thunderstorm Wind ELKHART 2015-07-18 18:35:00-05:00 592 Thunderstorm Wind ELKHART 2015-07-13 20:47:00-05:00 608 Thunderstorm Wind ELKHART 2015-07-18 17:50:00-05:00 971 Thunderstorm Wind ELKHART 2015-08-02 23:25:00-05:00END_DATE_TIME CZ_TIMEZONE BEGIN_LOCATION \ 281 2015-06-11 15:26:00-05:00 EST-5 ROSELAND 480 2015-06-08 16:30:00-05:00 EST-5 GRANGER 571 2015-07-18 17:46:00-05:00 EST-5 GILMER PARK 572 2015-07-18 17:58:00-05:00 EST-5 JAMESTOWN 574 2015-07-18 18:01:00-05:00 EST-5 ELKHART 582 2015-07-18 18:07:00-05:00 EST-5 GLENWOOD 583 2015-07-18 18:18:00-05:00 EST-5 BRISTOL 591 2015-07-18 18:38:00-05:00 EST-5 ELKHART MIDWAY ARPT 592 2015-07-13 20:50:00-05:00 EST-5 MILLERSBURG 608 2015-07-18 17:53:00-05:00 EST-5 ELKHART 971 2015-08-02 23:26:00-05:00 EST-5 GOSHEN
END_LOCATION BEGIN_LAT BEGIN_LON END_LAT END_LON \ 281 ROSELAND 41.7226 -86.2559 41.7226 -86.2559 480 GRANGER 41.7400 -86.1400 41.7400 -86.1400 571 GILMER PARK 41.6285 -86.2971 41.6285 -86.2971 572 JAMESTOWN 41.6400 -86.0200 41.6400 -86.0200 574 ELKHART 41.7026 -85.9467 41.7026 -85.9467 582 GLENWOOD 41.6700 -86.0800 41.6700 -86.0800 583 BRISTOL 41.6900 -85.8200 41.6900 -85.8200 591 ELKHART MIDWAY ARPT 41.6100 -85.8700 41.6100 -85.8700 592 MILLERSBURG 41.5100 -85.7300 41.5100 -85.7300 608 ELKHART 41.6800 -85.9800 41.6800 -85.9800 971 GOSHEN 41.5800 -85.8300 41.5800 -85.8300
EVENT_DURATION 281 0 days 00:00:00 480 0 days 00:00:00 571 0 days 00:03:00 572 0 days 00:03:00 574 0 days 00:03:00 582 0 days 00:03:00 583 0 days 00:03:00 591 0 days 00:03:00 592 0 days 00:03:00 608 0 days 00:03:00 971 0 days 00:01:00
- Create a function called
display_stormsthat takes a storms object, iterates through each row, and prints out each event as a sentence, including the event type, the county, and the date and time of the event. This exercise helps with iterating over data. An example output would be: "A thunderstorm occurred in Marion County on 2015-06-15 13:45:00". (as a bonus, you can try and format the date and time to be more human-readable)
HINT: You will need to use the .iterrows() method
- Test your function by:
- Printing all storm events in TIPPECANOE county
- Printing all "Flood" and "Flash Flood" events in the following counties: MARION, MONROE, SPENCER, VERMILLION
- Print all "Tornado" events in the whole state, i.e. all the counties.
## Test your storm_by_county function
# 3.1: Display all storm events in TIPPECANOE county.
print("--- 3.1 ---")
display_storms(storm_by_county(storms_df, ["TIPPECANOE"]))
# 3.2: Display all Flood and Flash Flood events in the following counties: MARION, MONROE, SPENCER, VERMILLION
print("--- 3.2 ---")
counties_to_check = ["MARION", "MONROE", "SPENCER", "VERMILLION"]
storm_types = ["Flood", "Flash Flood"]
display_storms(storm_by_county(storms_df, counties_to_check, storm_types))
# 3.3: Display all Tornado events that occurred in the state.
print("--- 3.3 ---")
county_list = storms_df["CZ_NAME"].unique()
display_storms(storm_by_county(storms_df, county_list, storm_type=["Tornado"]))
Solution
def display_storms(storms):
for index, event in storms.iterrows():
print("A", event["EVENT_TYPE"], "occurred on", event["BEGIN_DATE_TIME"], "in", event["CZ_NAME"], "county.")
# 3.1: Display all storm events in TIPPECANOE county.
print("--- 3.1 ---")
display_storms(storm_by_county(storms_df, ["TIPPECANOE"]))
# 3.2: Display all Flood and Flash Flood events in the following counties: MARION, MONROE, SPENCER, VERMILLION
print("--- 3.2 ---")
counties_to_check = ["MARION", "MONROE", "SPENCER", "VERMILLION"]
storm_types = ["Flood", "Flash Flood"]
display_storms(storm_by_county(storms_df, counties_to_check, storm_types))
# 3.3: Display all Tornado events that occurred in the state.
print("--- 3.3 ---")
county_list = storms_df["CZ_NAME"].unique()
display_storms(storm_by_county(storms_df, county_list, storm_type=["Tornado"]))
--- 3.1 --- A Heavy Snow occurred on 2015-01-05 21:00:00-05:00 in TIPPECANOE county. A Hail occurred on 2015-04-08 16:15:00-05:00 in TIPPECANOE county. A Hail occurred on 2015-04-08 16:25:00-05:00 in TIPPECANOE county. A Thunderstorm Wind occurred on 2015-06-12 16:37:00-05:00 in TIPPECANOE county. A Thunderstorm Wind occurred on 2015-06-12 16:40:00-05:00 in TIPPECANOE county. A Thunderstorm Wind occurred on 2015-06-21 00:15:00-05:00 in TIPPECANOE county. A Thunderstorm Wind occurred on 2015-06-07 19:15:00-05:00 in TIPPECANOE county. A Hail occurred on 2015-06-08 19:50:00-05:00 in TIPPECANOE county. A Thunderstorm Wind occurred on 2015-06-07 18:51:00-05:00 in TIPPECANOE county. A Thunderstorm Wind occurred on 2015-06-21 00:10:00-05:00 in TIPPECANOE county. A Thunderstorm Wind occurred on 2015-06-21 00:27:00-05:00 in TIPPECANOE county. A Thunderstorm Wind occurred on 2015-07-13 09:50:00-05:00 in TIPPECANOE county. A Tornado occurred on 2015-07-17 15:33:00-05:00 in TIPPECANOE county. A Flood occurred on 2015-07-17 15:30:00-05:00 in TIPPECANOE county. A Hail occurred on 2015-07-17 15:04:00-05:00 in TIPPECANOE county. A Thunderstorm Wind occurred on 2015-07-17 15:04:00-05:00 in TIPPECANOE county. A Tornado occurred on 2015-07-17 15:38:00-05:00 in TIPPECANOE county. A Hail occurred on 2015-09-18 21:34:00-05:00 in TIPPECANOE county. A Thunderstorm Wind occurred on 2015-09-18 21:42:00-05:00 in TIPPECANOE county. A Hail occurred on 2015-09-18 21:43:00-05:00 in TIPPECANOE county. A Thunderstorm Wind occurred on 2015-09-18 21:43:00-05:00 in TIPPECANOE county. --- 3.2 --- A Flash Flood occurred on 2015-06-22 05:20:00-05:00 in SPENCER county. A Flood occurred on 2015-04-08 16:36:00-05:00 in MARION county. A Flood occurred on 2015-06-08 02:16:00-05:00 in VERMILLION county. A Flash Flood occurred on 2015-06-25 08:00:00-05:00 in VERMILLION county. A Flood occurred on 2015-06-26 22:45:00-05:00 in MARION county. A Flood occurred on 2015-06-19 10:23:00-05:00 in MONROE county. A Flash Flood occurred on 2015-07-20 23:25:00-05:00 in MARION county. A Flash Flood occurred on 2015-07-21 00:14:00-05:00 in MARION county. A Flash Flood occurred on 2015-07-26 18:26:00-05:00 in MARION county. A Flash Flood occurred on 2015-07-07 14:20:00-05:00 in MARION county. A Flash Flood occurred on 2015-07-14 01:20:00-05:00 in MARION county. A Flood occurred on 2015-07-17 18:29:00-05:00 in MARION county. A Flash Flood occurred on 2015-07-20 23:20:00-05:00 in MARION county. A Flood occurred on 2015-12-27 11:30:00-05:00 in MARION county. A Flood occurred on 2015-12-30 22:00:00-05:00 in SPENCER county. --- 3.3 --- A Tornado occurred on 2015-04-07 15:27:00-05:00 in PERRY county. A Tornado occurred on 2015-05-30 19:47:00-05:00 in MARION county. A Tornado occurred on 2015-06-07 16:28:00-05:00 in JASPER county. A Tornado occurred on 2015-06-07 16:37:00-05:00 in JASPER county. A Tornado occurred on 2015-06-19 15:40:00-05:00 in PERRY county. A Tornado occurred on 2015-07-13 20:25:00-05:00 in PUTNAM county. A Tornado occurred on 2015-07-17 15:33:00-05:00 in TIPPECANOE county. A Tornado occurred on 2015-07-13 20:27:00-05:00 in PUTNAM county. A Tornado occurred on 2015-07-12 02:38:00-05:00 in JOHNSON county. A Tornado occurred on 2015-07-13 19:59:00-05:00 in BENTON county. A Tornado occurred on 2015-04-07 14:20:00-05:00 in VANDERBURGH county. A Tornado occurred on 2015-04-07 14:28:00-05:00 in WARRICK county. A Tornado occurred on 2015-04-07 14:47:00-05:00 in WARRICK county. A Tornado occurred on 2015-04-07 14:54:00-05:00 in WARRICK county. A Tornado occurred on 2015-04-07 14:56:00-05:00 in SPENCER county. A Tornado occurred on 2015-07-13 20:15:00-05:00 in WARREN county. A Tornado occurred on 2015-07-17 15:38:00-05:00 in TIPPECANOE county. A Tornado occurred on 2015-06-07 17:31:00-05:00 in STARKE county. A Tornado occurred on 2015-12-23 15:42:00-05:00 in JOHNSON county. A Tornado occurred on 2015-12-23 15:55:00-05:00 in HAMILTON county. A Tornado occurred on 2015-12-23 16:55:00-05:00 in WAYNE county. A Tornado occurred on 2015-12-23 16:39:00-05:00 in SPENCER county. A Tornado occurred on 2015-12-23 16:13:00-05:00 in DECATUR county. A Tornado occurred on 2015-12-23 16:14:00-05:00 in RUSH county.
- Make a histogram that displays the count of the events which contain "flood" in the following counties: ["MARION", "MONROE", "SPENCER", "VERMILLION", "TIPPECANOE"]. The x axis will be the different "flood" event types, and the color of the bar will be the county. The y axis will be the count of the events.
HINT: First, you will have to find a way to get all values of EVENT_TYPE that contain the word "flood" (case insensitive) and make that into a list. Then, you can pass that list and your county list to your storm_by_county() function. There are many ways to get the values that contain "flood". One way is to use str.contains(), but will involve chaining a few methods and doing some dataframe subsetting. Another way is to use a for loop and an if statement and may be more readable.
When you make a histogram using sns.histplot(), you can use the multiple='dodge' argument to make the bars side by side.
Solution
# import seaborn
import seaborn as sns
# str.contains method
storm_list = storms_df["EVENT_TYPE"][storms_df["EVENT_TYPE"].str.contains("flood", case=False)].unique().tolist()
# for loop method
storm_list = []
for event in storms_df["EVENT_TYPE"].dropna().unique():
if "flood" in event.lower():
storm_list.append(event)
# make a subset
storms_subset = storm_by_county(storms_df, county=["MARION", "MONROE", "SPENCER", "VERMILLION", "TIPPECANOE"], storm_type = storm_list)
# plot the subset
sns.histplot(storms_subset, x="EVENT_TYPE", hue="CZ_NAME", multiple="dodge", shrink=0.8)
- Write another function called
summarise_stormswhich takes as input the storms object you created, a list of counties, and an optional list ofstorm_typethat displays only certain storm types (aka the same arguments asstorm_by_county). Instead of returning every event, it returns a dataframe summarizing the number of occurences of each event in those counties. HINT You may want to use your previous functionstorm_by_countyto avoid repeating code!
- Test your function by:
- Printing the number of all events in PIKE county without printing information about each one
- Printing the total number of "Thunderstorm Wind" events in the following counties: ELKHART, LA PORTE, BOONE
## Test your modified storm_by_county function
# 5.1 Display the number of all events in PIKE county
print("--- 5.1 ---")
print(summarise_storms(storms_df, ["PIKE"]))
# 5.2 Display the number of Thunderstorm Wind events in the following counties: ELKHART, LA PORTE, BOONE
print("--- 5.2 ---")
print(summarise_storms(storms_df, ["ELKHART", "LA PORTE", "BOONE"], ["Thunderstorm Wind"]))
Solution
def summarise_storms(storms, county, storm_type=None):
df = storm_by_county(storms, county, storm_type).groupby(["CZ_NAME","EVENT_TYPE"]).size().reset_index()
return df
# 5.1 Display the number of all events in PIKE county
print("--- 5.1 ---")
print(summarise_storms(storms_df, ["PIKE"]))
# 5.2 Display the number of Thunderstorm Wind events in the following counties: ELKHART, LA PORTE, BOONE
print("--- 5.2 ---")
print(summarise_storms(storms_df, ["ELKHART", "LA PORTE", "BOONE"], ["Thunderstorm Wind"]))
--- 5.1 --- CZ_NAME EVENT_TYPE 0 0 PIKE Cold/Wind Chill 2 1 PIKE Dense Fog 5 2 PIKE Flood 6 3 PIKE Frost/Freeze 1 4 PIKE Heavy Snow 1 5 PIKE Winter Storm 2 6 PIKE Winter Weather 3 --- 5.2 --- CZ_NAME EVENT_TYPE 0 0 BOONE Thunderstorm Wind 7 1 ELKHART Thunderstorm Wind 7 2 LA PORTE Thunderstorm Wind 5
- Use your function to summarize the total counts of each weather event in each county.
Solution
all_weather = storms_df["EVENT_TYPE"].unique().tolist()
all_counties = storms_df["CZ_NAME"].unique().tolist()
summarise_storms(storms_df, all_counties, all_weather)
CZ_NAME EVENT_TYPE 0 0 ADAMS Extreme Cold/Wind Chill 1 1 ADAMS Hail 1 2 ADAMS Heavy Rain 1 3 ADAMS Heavy Snow 1 4 ADAMS Winter Weather 5 .. ... ... .. 473 WHITLEY Heavy Snow 2 474 WHITLEY Winter Weather 5 475 Warrick Dense Fog 1 476 Warrick Excessive Heat 1 477 Warrick Heat 3[478 rows x 3 columns]
Self-guided exploratory questions
Here are a few questions we've come up with that can be approached in multiple different ways.
Question 1: What is the most dangerous county in Indiana, assuming danger only comes from the storm events? You will need to account for the area of each county to answer this question. We've provided code to get the table from wikipedia that lists county information.
!pip install lxml
county_info = pd.read_html("https://en.wikipedia.org/wiki/List_of_counties_in_Indiana", storage_options={"User-Agent": "Mozilla/5.0"})
county_info = county_info[1]
Collecting lxml
Downloading lxml-6.0.1-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl.metadata (3.8 kB) Downloading lxml-6.0.1-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl (5.2 MB) [?25l [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/5.2 MB[0m [31m?[0m eta [36m-:--:--[0m
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m84.7 MB/s[0m [33m0:00:00[0m [?25h
Installing collected packages: lxml
Successfully installed lxml-6.0.1
You will need to do some data cleaning to get the county names to match up and to extract the area of each county. Then you will need to merge the two dataframes together.
County Area[3][12] 0 Adams County 339 sq mi (878 km2) 1 Allen County 657 sq mi (1,702 km2) 2 Bartholomew County 407 sq mi (1,054 km2) 3 Benton County 406 sq mi (1,052 km2) 4 Blackford County 165 sq mi (427 km2) .. ... ... 87 Washington County 514 sq mi (1,331 km2) 88 Wayne County 402 sq mi (1,041 km2) 89 Wells County 368 sq mi (953 km2) 90 White County 505 sq mi (1,308 km2) 91 Whitley County 336 sq mi (870 km2) [92 rows x 2 columns]
Solution
# cleaning county_info
county_info["County"] = county_info["County"].str.replace(" County", "").str.upper()
county_info["Area_sq_mi"] = pd.to_numeric(county_info["Area[3][12]"].str.extract(r'([0-9,]+)')[0])
# merge the two dataframes
storms_merged = pd.merge(storms_df, county_info, left_on="CZ_NAME", right_on="County")
Solution
# grouping by county and area
df = storms_merged.groupby(["CZ_NAME", "Area_sq_mi"])["EVENT_TYPE"].count().reset_index()
print(df.head())
# normalizing number of events by area
df["normalized"] = df["EVENT_TYPE"] / df["Area_sq_mi"]
# sorting by normalized value
df.sort_values("normalized", ascending=False)
CZ_NAME Area_sq_mi EVENT_TYPE 0 ADAMS 339 9 1 ALLEN 657 31 2 BARTHOLOMEW 407 7 3 BENTON 406 4 4 BLACKFORD 165 6
CZ_NAME Area_sq_mi EVENT_TYPE normalized 79 VANDERBURGH 233 34 0.145923 20 FLOYD 148 19 0.128378 46 MARION 396 41 0.103535 9 CLARK 373 37 0.099196 33 HUNTINGTON 383 37 0.096606 .. ... ... ... ... 10 CLAY 358 2 0.005587 26 GREENE 543 3 0.005525 7 CARROLL 372 2 0.005376 44 LAWRENCE 449 2 0.004454 74 SULLIVAN 447 1 0.002237[90 rows x 4 columns]
Question 2: What is the best time to visit Indiana if you want to take cool pictures of clouds? The main idea of this question is to summarize the data in such a way that you can tell which month or week or your choice of span of days has the highest concentration of events of interest.
Solution
# decide which weather events probably have cool clouds
cool_clouds = ["Thunderstorm wind", "Tornado", "Lightning", "Heavy Rain"]
# filter for cool clouds
df_clouds = storms_df[storms_df["EVENT_TYPE"].isin(cool_clouds)]
# group by month and number of events
df_clouds.groupby(df_clouds["BEGIN_DATE_TIME"].dt.month)["EVENT_TYPE"].count().reset_index()
BEGIN_DATE_TIME EVENT_TYPE 0 1 1 1 2 1 2 3 4 3 4 9 4 5 1 5 6 14 6 7 12 7 8 1 8 10 2 9 11 1 10 12 8
Question 3: Create a plot that compares the duration in hours of each common event type. Some events occur only infrequently, so your first step would be to filter those out. Then, you will need to decide what type of plot to make. Take a look at the seaborn gallery . If you are having trouble deciding how to represent your data, take a look at this infographic
Solution
# get event duration. We've already done this, but the code is reproduced before
storms_df["EVENT_DURATION"] = storms_df["END_DATE_TIME"] - storms_df["BEGIN_DATE_TIME"]
# filter out events which only occurred <10 times in the year
storms_filtered = storms_df.groupby("EVENT_TYPE").filter(lambda x: len(x) > 10)
# group by event type and calculate average duration
print(storms_filtered.groupby("EVENT_TYPE")["EVENT_DURATION"].mean())
storms_filtered.groupby("EVENT_TYPE")["EVENT_DURATION"].describe()
EVENT_TYPE Cold/Wind Chill 0 days 09:00:00 Dense Fog 0 days 07:27:00 Extreme Cold/Wind Chill 0 days 12:20:46.153846153 Flash Flood 0 days 02:13:29.256198347 Flood 4 days 01:25:38.880000 Hail 0 days 00:01:21.818181818 Heat 0 days 05:24:00 Heavy Rain 0 days 02:08:44.444444444 Heavy Snow 0 days 16:16:49.923664122 Thunderstorm Wind 0 days 00:00:41.984334203 Tornado 0 days 00:02:35 Winter Storm 0 days 16:39:05.454545454 Winter Weather 0 days 12:19:15.185185185 Name: EVENT_DURATION, dtype: timedelta64[ns]
count mean \ EVENT_TYPE Cold/Wind Chill 12 0 days 09:00:00 Dense Fog 40 0 days 07:27:00 Extreme Cold/Wind Chill 26 0 days 12:20:46.153846153 Flash Flood 121 0 days 02:13:29.256198347 Flood 125 4 days 01:25:38.880000 Hail 187 0 days 00:01:21.818181818 Heat 15 0 days 05:24:00 Heavy Rain 27 0 days 02:08:44.444444444 Heavy Snow 131 0 days 16:16:49.923664122 Thunderstorm Wind 383 0 days 00:00:41.984334203 Tornado 24 0 days 00:02:35 Winter Storm 33 0 days 16:39:05.454545454 Winter Weather 162 0 days 12:19:15.185185185std min \ EVENT_TYPE Cold/Wind Chill 0 days 02:05:20.154737286 0 days 07:00:00 Dense Fog 0 days 03:16:24.370352945 0 days 04:00:00 Extreme Cold/Wind Chill 0 days 04:00:33.883213306 0 days 03:00:00 Flash Flood 0 days 03:20:21.201559833 0 days 00:00:00 Flood 7 days 03:59:43.207680028 0 days 00:00:00 Hail 0 days 00:01:23.851565542 0 days 00:00:00 Heat 0 days 00:49:41.083216358 0 days 05:00:00 Heavy Rain 0 days 03:40:57.113426770 0 days 00:00:00 Heavy Snow 0 days 06:48:08.251044631 0 days 08:00:00 Thunderstorm Wind 0 days 00:01:27.395291215 0 days 00:00:00 Tornado 0 days 00:02:03.745882671 0 days 00:00:00 Winter Storm 0 days 05:00:36.440386566 0 days 11:00:00 Winter Weather 0 days 05:43:50.386658814 0 days 00:01:00
25% 50% 75% \ EVENT_TYPE Cold/Wind Chill 0 days 07:00:00 0 days 09:00:00 0 days 11:00:00 Dense Fog 0 days 05:00:00 0 days 05:00:00 0 days 10:00:00 Extreme Cold/Wind Chill 0 days 12:00:00 0 days 12:00:00 0 days 12:00:00 Flash Flood 0 days 00:00:00 0 days 01:51:00 0 days 03:30:00 Flood 0 days 01:30:00 0 days 02:18:00 3 days 19:00:00 Hail 0 days 00:00:00 0 days 00:02:00 0 days 00:02:00 Heat 0 days 05:00:00 0 days 05:00:00 0 days 05:00:00 Heavy Rain 0 days 00:00:00 0 days 00:00:00 0 days 03:00:00 Heavy Snow 0 days 11:37:30 0 days 15:00:00 0 days 15:00:00 Thunderstorm Wind 0 days 00:00:00 0 days 00:00:00 0 days 00:01:00 Tornado 0 days 00:01:00 0 days 00:02:00 0 days 00:03:15 Winter Storm 0 days 12:30:00 0 days 15:00:00 0 days 20:00:00 Winter Weather 0 days 07:15:00 0 days 11:00:00 0 days 17:00:00
max EVENT_TYPE Cold/Wind Chill 0 days 11:00:00 Dense Fog 0 days 13:00:00 Extreme Cold/Wind Chill 1 days 06:00:00 Flash Flood 1 days 03:35:00 Flood 29 days 09:00:00 Hail 0 days 00:10:00 Heat 0 days 07:00:00 Heavy Rain 0 days 12:15:00 Heavy Snow 1 days 10:00:00 Thunderstorm Wind 0 days 00:11:00 Tornado 0 days 00:08:00 Winter Storm 1 days 02:00:00 Winter Weather 1 days 06:00:00
Solution
# plot the distribution of event durations as a stripplot with points using seaborn
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.dates as mdates
storms_no_flood = storms_filtered[~storms_filtered["EVENT_TYPE"].str.contains("Flood")]
# convert EVENT_DURATION to number of hours
storms_no_flood["EVENT_DURATION_HOURS"] = storms_no_flood["EVENT_DURATION"].dt.total_seconds() / 3600
fig, ax = plt.subplots()
ax = sns.stripplot(data=storms_no_flood, x="EVENT_DURATION_HOURS", y="EVENT_TYPE")
plt.show()
/tmp/ipykernel_2217/1940921173.py:11: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy storms_no_flood["EVENT_DURATION_HOURS"] = storms_no_flood["EVENT_DURATION"].dt.total_seconds() / 3600
Solution
# plot the distribution of event durations as a stripplot with points using seaborn
# separately plotting EVENT_TYPE = flood
fig, ax = plt.subplots()
storms_flood_only = storms_filtered[storms_filtered["EVENT_TYPE"].str.contains("Flood")]
# convert EVENT_DURATION to number of hours
storms_flood_only["EVENT_DURATION_HOURS"] = storms_flood_only["EVENT_DURATION"].dt.total_seconds() / 3600
ax = sns.stripplot(data=storms_flood_only, x="EVENT_DURATION_HOURS", y="EVENT_TYPE")
plt.show()
/tmp/ipykernel_2217/42002734.py:11: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy storms_flood_only["EVENT_DURATION_HOURS"] = storms_flood_only["EVENT_DURATION"].dt.total_seconds() / 3600