Working with Geospatial datasets in Python
Geospatial plotting is a powerful technique for uncovering spatial patterns, identifying regional disparities, and communicating insights through interactive maps. It is an essential skill for data scientists working with location-based data.
This lesson introduces geospatial analysis and visualization using Python, focusing on the Chicago Divvy bike-sharing dataset as a case study.
This project was inspired by an MCDC practicum course taught in Winter 2022 by Dr. Arend Kuyper. In collaboration with faculty liaison Dr. Amanda Stathopoulos, the course partnered with the Chicago-based nonprofit organization Equiticity, represented by community partner Oboi Reed. Students worked in teams to deliver on Equiticity’s Request for Data Science Services through MCDC. Results from this and other MCDC student–community partnership projects are summarized on the MCDC Past Projects website. MCDC was funded by the U.S. National Science Foundation.
This lesson was developed by Dr. Lizhen Shi for MCDC pathway courses.
1 Guided Practice
1.1 Data Clearning and Preparation
Real-world datasets often require cleaning before analysis. In this section, we prepare the Divvy bike-sharing dataset for geospatial plotting by addressing missing values, standardizing station coordinates, and removing invalid trips
1.1.1 Import Libraries
Let’s start by importing the necessary Python libraries
Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# ignore warnings
import warnings
'ignore')
warnings.filterwarnings(
%matplotlib inline
1.1.2 Load the Dataset
We’ll use trip data from May 2024, available on the Divvy website. This dataset contains records of individual bike trips, including geospatial coordinates.
Code
# Load the data
= pd.read_csv('202405-divvy-tripdata.csv')
divvy_0524 print(divvy_0524.shape)
divvy_0524.head()
(609493, 13)
ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casual | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7D9F0CE9EC2A1297 | classic_bike | 2024-05-25 15:52:42 | 2024-05-25 16:11:50 | Streeter Dr & Grand Ave | 13022 | Clark St & Elm St | TA1307000039 | 41.892278 | -87.612043 | 41.902973 | -87.631280 | casual |
1 | 02EC47687411416F | classic_bike | 2024-05-14 15:11:51 | 2024-05-14 15:22:00 | Sheridan Rd & Greenleaf Ave | KA1504000159 | Sheridan Rd & Loyola Ave | RP-009 | 42.010587 | -87.662412 | 42.001044 | -87.661198 | casual |
2 | 101370FB2D3402BE | classic_bike | 2024-05-30 17:46:04 | 2024-05-30 18:09:16 | Streeter Dr & Grand Ave | 13022 | Wabash Ave & 9th St | TA1309000010 | 41.892278 | -87.612043 | 41.870769 | -87.625734 | member |
3 | E97E396331ED6913 | electric_bike | 2024-05-17 20:21:54 | 2024-05-17 20:40:32 | Streeter Dr & Grand Ave | 13022 | Sheffield Ave & Wellington Ave | TA1307000052 | 41.892270 | -87.611946 | 41.936253 | -87.652662 | member |
4 | 674EDE311C543165 | classic_bike | 2024-05-22 18:52:20 | 2024-05-22 18:59:04 | Larrabee St & Division St | KA1504000079 | Clark St & Elm St | TA1307000039 | 41.903486 | -87.643353 | 41.902973 | -87.631280 | casual |
The dataset has 609,493 rows and 13 columns, each representing a bike trip. Here’s a breakdown of the columns:
Field name | Data type | Description |
---|---|---|
ride_id | String | Unique ID for each ride |
rideable_type | String | Bike types |
started_at | Timestamp | Trip start day and time |
ended_at | Timestamp | Trip end day and time |
start_station_name | String | Trip start station |
start_station_id | String | ID of the start station |
end_station_name | String | Trip end station |
end_station_id | String | ID of the end station |
start_lat | Float | Latitude of the start station |
start_lng | Float | Longitude of the start station |
end_lat | Float | Latitude of the end station |
end_lng | Float | Longitude of the end station |
member_casual | String | Rider type |
The geospatial-related columns are:
start_lat
andstart_lng
represent the coordinates where a bike is picked up,end_lat
, andend_lng
represent the coordinates where a bike is returned.start_station_name
represents the trip start stationstart_station_id
is the ID of the start stationend_station_name
represents the trip end stationend_station_ID
is the ID of the end station
1.1.3 Handling Missing Values
let’s check for missing values within the dataset.
Code
sum() divvy_0524.isnull().
ride_id 0
rideable_type 0
started_at 0
ended_at 0
start_station_name 109048
start_station_id 109048
end_station_name 112731
end_station_id 112731
start_lat 0
start_lng 0
end_lat 784
end_lng 784
member_casual 0
dtype: int64
There are 784 observations missing end_lat
and end_lng
. Let’s explore whether these observations have a station name listed so that we can use the station name to impute its coordinates.
Code
# show start_station_name, start_lat, and start_lng all null
'start_station_name'].isnull()) & ((divvy_0524['start_lat'].isnull()) | (divvy_0524['start_lng'].isnull()))]
divvy_0524[(divvy_0524[
ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casual |
---|
The result is an empty DataFrame, indicating that rows missing end_lat
and end_lng
also lack end_station_name
and end_station_id
. Since imputation isn’t possible, we’ll drop these rows:
Code
# drop rows where end_station_name, end_lat, and end_lng are all null
= divvy_0524[ ~((divvy_0524['end_station_name'].isnull()) & ((divvy_0524['end_lat'].isnull()) | (divvy_0524['end_lng'].isnull())))]
divvy_0524
# rechekc the null values
sum() divvy_0524.isnull().
ride_id 0
rideable_type 0
started_at 0
ended_at 0
start_station_name 109048
start_station_id 109048
end_station_name 111947
end_station_id 111947
start_lat 0
start_lng 0
end_lat 0
end_lng 0
member_casual 0
dtype: int64
We successfully dropped the instances where both end_lat
and end_lng
are missing. Next, let’s explore the mapping between station names and their coordinates for our missing value imputation.
Code
= pd.concat([divvy_0524[['start_station_name', 'start_lat', 'start_lng']].drop_duplicates().rename(columns={'start_station_name': 'station_name', 'start_lat': 'lat', 'start_lng': 'lng'}),
locations 'end_station_name', 'end_lat', 'end_lng']].drop_duplicates().rename(columns={'end_station_name': 'station_name', 'end_lat': 'lat', 'end_lng': 'lng'})])
divvy_0524[[
= locations.drop_duplicates()
locations print(locations.shape)
locations.head()
(195676, 3)
station_name | lat | lng | |
---|---|---|---|
0 | Streeter Dr & Grand Ave | 41.892278 | -87.612043 |
1 | Sheridan Rd & Greenleaf Ave | 42.010587 | -87.662412 |
3 | Streeter Dr & Grand Ave | 41.892270 | -87.611946 |
4 | Larrabee St & Division St | 41.903486 | -87.643353 |
5 | Sheridan Rd & Greenleaf Ave | 42.010571 | -87.662456 |
Code
# get the distinct station names names
'station_name'].unique().size locations[
1361
195,676 unique coordinates correspond to 1361 station names. That’s an average of 195 coordinates per station. Let’s find out which station has the most cooridinates next
Code
'station_name').size().sort_values(ascending=False).head(10) locations.groupby(
station_name
Streeter Dr & Grand Ave 1831
Clinton St & Washington Blvd 1550
Michigan Ave & Oak St 1372
DuSable Lake Shore Dr & Monroe St 1307
Kingsbury St & Kinzie St 1304
Wells St & Concord Ln 1269
Clark St & Elm St 1264
Clinton St & Madison St 1223
Wells St & Elm St 1220
Sheffield Ave & Waveland Ave 1190
dtype: int64
1.1.4 Standardizing Station Coordinates
Divvy stations can have multiple coordinates for the same station due to GPS variations. For geospatial plotting, we’ll use a single coordinate per station by calculating the median latitude and longitude:
Code
# Calculate median lat and lng for each station_name
= locations.groupby('station_name').agg({'lat': 'median', 'lng': 'median'}).reset_index()
median_locations ={'lat': 'median_lat', 'lng': 'median_lng'}, inplace=True)
median_locations.rename(columnsprint(median_locations.shape)
median_locations.head()
(1360, 3)
station_name | median_lat | median_lng | |
---|---|---|---|
0 | 2112 W Peterson Ave | 41.991168 | -87.683617 |
1 | 21st St & Pulaski Rd | 41.853457 | -87.724529 |
2 | 63rd St Beach | 41.780958 | -87.576320 |
3 | 900 W Harrison St | 41.874755 | -87.649817 |
4 | Aberdeen St & Jackson Blvd | 41.877773 | -87.654831 |
If two coordinates are very close, it’s likely that they belong to the same Divvy station. For the remaining missing values, let’s use the nearest neighbor’s station name based on coordinates to fill in the missing names.
Code
# define the nearestneighbor method
from sklearn.neighbors import NearestNeighbors
def make_nn(df):
# Ensure the DataFrame has the necessary columns
if not all(col in df.columns for col in ['median_lat', 'median_lng']):
raise ValueError("DataFrame must contain 'lat' and 'lng' columns")
# Extract the coordinates from the DataFrame
= df[['median_lat', 'median_lng']].values
coordinates
# Create and fit the NearestNeighbors model
= NearestNeighbors(n_neighbors=1, algorithm='ball_tree').fit(coordinates)
nbrs
return nbrs
= make_nn(median_locations)
nbrs
def find_nearest_station(nbrs, df, lat, lng):
# Find the nearest neighbor to the given (lat, lng)
= nbrs.kneighbors([[lat, lng]])
distance, index
# Get the nearest station's information
= df.iloc[index[0][0]]['station_name']
nearest_station
return nearest_station
Code
# check whether the method can return the nearest station name
41.892278, -87.612043) find_nearest_station(nbrs, median_locations,
'Streeter Dr & Grand Ave'
Code
= divvy_0524.reset_index(drop=True) divvy_0524
Code
# use the nearest neighbor method to impute the missing values in the `start_station_name` and `end_station_name` in the dataset
from tqdm import tqdm
# impute start_station_name
for i in tqdm(range(len(divvy_0524))):
if pd.isnull(divvy_0524.loc[i, 'start_station_name']):
'start_station_name'] = find_nearest_station(nbrs, median_locations, divvy_0524.loc[i, 'start_lat'], divvy_0524.loc[i, 'start_lng'])
divvy_0524.loc[i, # impute end_station_name
if pd.isnull(divvy_0524.loc[i, 'end_station_name']):
'end_station_name'] = find_nearest_station(nbrs, median_locations, divvy_0524.loc[i, 'end_lat'], divvy_0524.loc[i, 'end_lng']) divvy_0524.loc[i,
100%|██████████| 608709/608709 [05:57<00:00, 1701.02it/s]
The assert
statement in Python is used as a debugging aid that tests a condition. If the condition is True, the program continues to execute normally. If the condition is False, an AssertionError
is raised, and the program stops executing. assert is commonly used to check assumptions made by the program and to catch bugs early in the development process.
In this case, after the imputation, there should be no missing values in the start_staion_name
and end_station_name
columns.
Code
# use assert to check the imputation results
assert divvy_0524['start_station_name'].isnull().sum() == 0
assert divvy_0524['end_station_name'].isnull().sum() == 0
Let’s merge the median_lat
, and median_lng
to the divvy dataset for plotting
Code
= pd.merge(divvy_0524, median_locations, left_on='start_station_name', right_on='station_name', how='left')
divvy_0524 = divvy_0524.drop(columns=['station_name'])
divvy_0524 ={'median_lat': 'start_station_median_lat', 'median_lng': 'start_station_median_lng'}, inplace=True)
divvy_0524.rename(columns
= pd.merge(divvy_0524, median_locations, left_on='end_station_name', right_on='station_name', how='left')
divvy_0524 = divvy_0524.drop(columns=['station_name'])
divvy_0524 ={'median_lat': 'end_station_median_lat', 'median_lng': 'end_station_median_lng'}, inplace=True) divvy_0524.rename(columns
1.1.5 Data Cleaning: Trip Distance Exploration
Trip distance is a critical metric, even though it isn’t an original feature in the dataset. Fortunately, we have the start and end coordinates for each trip, allowing us to calculate distances based on them.
Analyzing trip distances in the Divvy dataset is essential for several reasons, including understanding user behavior, assessing system efficiency, and evaluating the overall performance of the bike-sharing service.
To calculate the trip distance using the start and end coordinates, we can employ the Haversine formula, which determines the great-circle distance between two points on the Earth’s surface based on their latitude and longitude. Alternatively, we can use the geopy library in Python, which provides a straightforward interface for geospatial calculations.
Hee is an implementation of the Haversine
formula in python
Code
def haversine(lat1, lon1, lat2, lon2):
# Convert degrees to radians
= map(np.radians, [lat1, lon1, lat2, lon2])
lat1, lon1, lat2, lon2
# Compute differences
= lat2 - lat1
dlat = lon2 - lon1
dlon
# Haversine formula
= np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
a = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
c
# Radius of Earth in kilometers and conversion to miles
= 6371.0
R_km = R_km * 0.621371
R_mi
# Compute distance
= R_mi * c
distance
return distance
'distance'] = haversine(divvy_0524['start_lat'], divvy_0524['start_lng'], divvy_0524['end_lat'], divvy_0524['end_lng']) divvy_0524[
Code
# plot the histgram to check the distribution
'distance'].hist(bins=100) divvy_0524[
Code
# check the statistics of the distance
'distance'].describe() divvy_0524[
count 608709.000000
mean 1.386598
std 1.267535
min 0.000000
25% 0.556175
50% 1.020599
75% 1.825678
max 20.840069
Name: distance, dtype: float64
Code
# bin the distance into 4 groups, min-0.99, 1-1.99, 2-4.99, 5-max()
'distance_bins'] = pd.cut(divvy_0524['distance'], bins=[0, 0.99, 1.99, 4.99, divvy_0524['distance'].max()], labels=['0-0.99', '1-1.99', '2-4.99', '5-max'])
divvy_0524[
# plot the distance_bins
'distance_bins'].value_counts().plot(kind='bar')
divvy_0524[
# add the title of the plot
'distance_bins'].value_counts().plot(kind='bar', title='Distance Bins') divvy_0524[
Code
# check the percentage of value counts of the distance_bins
'distance_bins'].value_counts(normalize=True) divvy_0524[
distance_bins
0-0.99 0.452076
1-1.99 0.311665
2-4.99 0.213208
5-max 0.023052
Name: proportion, dtype: float64
Next, let’s determine the percentage of trips that start and end at the same station, indicating a distance of zero. These trips can fall into two categories:
- Valid trips: Trips where bikes are picked up and returned to the same station.
- Invalid trips: These are likely noise in the data, such as when someone unlocks a bike but then decides not to rent it. In such scenarios, the trip duration is also very short.
For the second case, we need to remove the noise before we proceed with our analysis
Code
# show the distance equals to 0
'distance'] == 0] divvy_0524[divvy_0524[
ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casual | start_station_median_lat | start_station_median_lng | end_station_median_lat | end_station_median_lng | distance | distance_bins | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11 | 95CA09985B75FA22 | classic_bike | 2024-05-27 18:14:19 | 2024-05-27 19:09:30 | McClurg Ct & Ohio St | TA1306000029 | McClurg Ct & Ohio St | TA1306000029 | 41.892592 | -87.617289 | 41.892592 | -87.617289 | member | 41.892658 | -87.617321 | 41.892658 | -87.617321 | 0.0 | NaN |
13 | A08FB7FA82FCD72E | classic_bike | 2024-05-09 17:38:03 | 2024-05-09 17:41:37 | Indiana Ave & 26th St | TA1307000005 | Indiana Ave & 26th St | TA1307000005 | 41.845687 | -87.622481 | 41.845687 | -87.622481 | member | 41.845708 | -87.622545 | 41.845708 | -87.622545 | 0.0 | NaN |
14 | 4E578174D37FBFE5 | classic_bike | 2024-05-09 17:42:22 | 2024-05-09 17:42:49 | Indiana Ave & 26th St | TA1307000005 | Indiana Ave & 26th St | TA1307000005 | 41.845687 | -87.622481 | 41.845687 | -87.622481 | member | 41.845708 | -87.622545 | 41.845708 | -87.622545 | 0.0 | NaN |
15 | 80903806220B461B | classic_bike | 2024-05-28 19:42:27 | 2024-05-28 20:22:45 | McClurg Ct & Ohio St | TA1306000029 | McClurg Ct & Ohio St | TA1306000029 | 41.892592 | -87.617289 | 41.892592 | -87.617289 | member | 41.892658 | -87.617321 | 41.892658 | -87.617321 | 0.0 | NaN |
16 | 855A39B703AEEF8D | classic_bike | 2024-05-31 12:02:35 | 2024-05-31 12:02:58 | Stetson Ave & South Water St | TA1308000029 | Stetson Ave & South Water St | TA1308000029 | 41.886835 | -87.622320 | 41.886835 | -87.622320 | member | 41.886770 | -87.622422 | 41.886770 | -87.622422 | 0.0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
608122 | 8BFD4D2FF581C242 | electric_bike | 2024-05-21 19:36:07 | 2024-05-21 19:41:15 | University Ave & 57th St | NaN | University Ave & 57th St | NaN | 41.790000 | -87.600000 | 41.790000 | -87.600000 | casual | 41.791475 | -87.599920 | 41.791475 | -87.599920 | 0.0 | NaN |
608123 | 764B4FDBDFCBC442 | electric_bike | 2024-05-22 14:34:41 | 2024-05-22 14:56:53 | DuSable Lake Shore Dr & Belmont Ave | NaN | DuSable Lake Shore Dr & Belmont Ave | NaN | 41.940000 | -87.640000 | 41.940000 | -87.640000 | casual | 41.940740 | -87.639154 | 41.940740 | -87.639154 | 0.0 | NaN |
608134 | 5D055CB420A4FE10 | classic_bike | 2024-05-21 22:10:52 | 2024-05-21 23:11:42 | Avenue O & 134th St | 20214 | Avenue O & 134th St | 20214 | 41.651799 | -87.540696 | 41.651799 | -87.540696 | casual | 41.651881 | -87.539622 | 41.651881 | -87.539622 | 0.0 | NaN |
608140 | AF072EB6C20AD421 | electric_bike | 2024-05-21 19:02:16 | 2024-05-21 19:02:25 | Dearborn Pkwy & Delaware Pl | NaN | Dearborn Pkwy & Delaware Pl | NaN | 41.900000 | -87.630000 | 41.900000 | -87.630000 | casual | 41.898946 | -87.629925 | 41.898946 | -87.629925 | 0.0 | NaN |
608145 | 87452852EC07FE5F | electric_bike | 2024-05-22 18:01:00 | 2024-05-22 18:01:44 | DuSable Lake Shore Dr & Monroe St | NaN | DuSable Lake Shore Dr & Monroe St | NaN | 41.880000 | -87.620000 | 41.880000 | -87.620000 | casual | 41.881059 | -87.616777 | 41.881059 | -87.616777 | 0.0 | NaN |
36084 rows × 19 columns
Among these trips, we will identify those with a duration of less than 1 minute and a trip distance of 0. Let’s filter out these anomalies
Code
# set the start and end time to datetime
'started_at'] = pd.to_datetime(divvy_0524['started_at'])
divvy_0524['ended_at'] = pd.to_datetime(divvy_0524['ended_at'])
divvy_0524[
# create a new column duration in minutes
'duration'] = (divvy_0524['ended_at'] - divvy_0524['started_at']).dt.total_seconds() / 60 divvy_0524[
Code
# extract the trips that whose duration is less than or equal to 1 minute and distance is 0
'duration'] <= 1) & (divvy_0524['distance'] == 0)] divvy_0524[(divvy_0524[
ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casual | start_station_median_lat | start_station_median_lng | end_station_median_lat | end_station_median_lng | distance | distance_bins | duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14 | 4E578174D37FBFE5 | classic_bike | 2024-05-09 17:42:22 | 2024-05-09 17:42:49 | Indiana Ave & 26th St | TA1307000005 | Indiana Ave & 26th St | TA1307000005 | 41.845687 | -87.622481 | 41.845687 | -87.622481 | member | 41.845708 | -87.622545 | 41.845708 | -87.622545 | 0.0 | NaN | 0.450000 |
16 | 855A39B703AEEF8D | classic_bike | 2024-05-31 12:02:35 | 2024-05-31 12:02:58 | Stetson Ave & South Water St | TA1308000029 | Stetson Ave & South Water St | TA1308000029 | 41.886835 | -87.622320 | 41.886835 | -87.622320 | member | 41.886770 | -87.622422 | 41.886770 | -87.622422 | 0.0 | NaN | 0.383333 |
262 | 9F76CEEAA6B34C0B | classic_bike | 2024-05-12 16:04:57 | 2024-05-12 16:05:15 | Streeter Dr & Grand Ave | 13022 | Streeter Dr & Grand Ave | 13022 | 41.892278 | -87.612043 | 41.892278 | -87.612043 | member | 41.892276 | -87.612041 | 41.892276 | -87.612041 | 0.0 | NaN | 0.300000 |
631 | 32B4023E9923173E | classic_bike | 2024-05-18 22:37:50 | 2024-05-18 22:38:16 | Ashland Ave & Blackhawk St | 13224 | Ashland Ave & Blackhawk St | 13224 | 41.907066 | -87.667252 | 41.907066 | -87.667252 | member | 41.907060 | -87.667226 | 41.907060 | -87.667226 | 0.0 | NaN | 0.433333 |
632 | 408786CFCC0E004F | classic_bike | 2024-05-23 13:39:18 | 2024-05-23 13:39:52 | Cornell Ave & Hyde Park Blvd | KA1503000007 | Cornell Ave & Hyde Park Blvd | KA1503000007 | 41.802406 | -87.586924 | 41.802406 | -87.586924 | member | 41.802421 | -87.586920 | 41.802421 | -87.586920 | 0.0 | NaN | 0.566667 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
608112 | 35F4266BF2D0AB22 | electric_bike | 2024-05-22 18:12:38 | 2024-05-22 18:12:40 | DuSable Lake Shore Dr & Monroe St | NaN | DuSable Lake Shore Dr & Monroe St | NaN | 41.880000 | -87.620000 | 41.880000 | -87.620000 | casual | 41.881059 | -87.616777 | 41.881059 | -87.616777 | 0.0 | NaN | 0.033333 |
608117 | DF9291196964FB93 | electric_bike | 2024-05-22 07:35:36 | 2024-05-22 07:35:48 | Green St & Madison St | NaN | Green St & Madison St | NaN | 41.880000 | -87.650000 | 41.880000 | -87.650000 | casual | 41.881861 | -87.649189 | 41.881861 | -87.649189 | 0.0 | NaN | 0.200000 |
608120 | 2D857C9FCF2434CF | electric_bike | 2024-05-21 19:40:39 | 2024-05-21 19:41:23 | Sedgwick St & Webster Ave | NaN | Sedgwick St & Webster Ave | NaN | 41.920000 | -87.640000 | 41.920000 | -87.640000 | casual | 41.922176 | -87.638982 | 41.922176 | -87.638982 | 0.0 | NaN | 0.733333 |
608140 | AF072EB6C20AD421 | electric_bike | 2024-05-21 19:02:16 | 2024-05-21 19:02:25 | Dearborn Pkwy & Delaware Pl | NaN | Dearborn Pkwy & Delaware Pl | NaN | 41.900000 | -87.630000 | 41.900000 | -87.630000 | casual | 41.898946 | -87.629925 | 41.898946 | -87.629925 | 0.0 | NaN | 0.150000 |
608145 | 87452852EC07FE5F | electric_bike | 2024-05-22 18:01:00 | 2024-05-22 18:01:44 | DuSable Lake Shore Dr & Monroe St | NaN | DuSable Lake Shore Dr & Monroe St | NaN | 41.880000 | -87.620000 | 41.880000 | -87.620000 | casual | 41.881059 | -87.616777 | 41.881059 | -87.616777 | 0.0 | NaN | 0.733333 |
10339 rows × 20 columns
Let’s remove them from the divvy set
Code
# drop the trips that whose duration is less than or equal to 1 minute and distance is 0
= divvy_0524[~((divvy_0524['duration'] <= 1) & (divvy_0524['distance'] == 0))]
divvy_0524 divvy_0524.shape
(598370, 20)
Code
# check the percentage of valid trips that pick up and drop off at the same station
'start_station_name'] == divvy_0524['end_station_name']].shape[0] / divvy_0524.shape[0] divvy_0524[divvy_0524[
0.06118956498487558
It turns out that only 6% of the trips start and end at the same station.
Next, let’s examine the overall distribution of trip distances again
Code
# check the distance distribution of trips that pick up and drop off at different stations
'distance'].describe() divvy_0524[
count 598370.000000
mean 1.410557
std 1.265152
min 0.000000
25% 0.582939
50% 1.032578
75% 1.849402
max 20.840069
Name: distance, dtype: float64
Code
# plot the distance distribution of trips that pick up and drop off at different stations
'start_station_name'] != divvy_0524['end_station_name']]['distance'].hist(bins=100) divvy_0524[divvy_0524[
As can be seen from the histgram, the majority of trips are short, with distances of less than 5 miles.
Let’s save our clearned data to a csv file for further geospatial plotting next
Code
'202405-divvy-tripdata-cleaned.csv', index=False) divvy_0524.to_csv(
1.2 Geospatial Plotting
1.2.1 Python Packages for Geospatial Plotting
There are several widely used Python packages designed for geospatial plotting. In this lesson, we focus on two popular libraries:
- GeoPandas – Static Plotting
- Extends the functionality of the pandas library to support spatial data types and geospatial operations.
- Allows for easy reading, writing, and manipulation of spatial data, including shapefiles and GeoJSON formats.
- Primarily used for data wrangling and creating basic static maps using
matplotlib
.
- Integrates seamlessly with
matplotlib
, handling both tabular and vector-based geospatial data.
- Extends the functionality of the pandas library to support spatial data types and geospatial operations.
- Folium – Interactive Plotting
- A powerful library for building interactive maps, built on top of the
Leaflet.js
framework.
- Supports various map tiles and overlays, including GeoJSON layers for rendering vector data.
- Ideal for creating web-based, zoomable, and clickable maps.
- A powerful library for building interactive maps, built on top of the
1.2.2 Static Plots with GeoPandas
A shapefile is a widely used format for storing the geometric location and attribute information of geographic features. Commonly used in GIS (Geographic Information Systems), shapefiles enable spatial data analysis and visualization.
1.2.2.1 📁 Components of a Shapefile
A shapefile actually consists of multiple files that share the same base name but have different extensions. Key components include:
.shp
: Stores the geometry (points, lines, or polygons)
.shx
: Index file for the geometry data
.dbf
: Stores associated attribute data in tabular form (similar to a spreadsheet)
.prj
: Contains coordinate system and projection information
1.2.2.2 🗺️ Geometry Types in Shapefiles
Shapefiles can represent different types of geographic features:
- Points: e.g., Divvy stations, landmarks
- Lines: e.g., roads, bike lanes
- Polygons: e.g., neighborhood boundaries, parks
Each geometric feature can be linked to attribute data, such as names, types, or demographic values.
The .prj
file is especially important for ensuring proper alignment of spatial features by defining the coordinate reference system.
1.2.3 Loading a Shapefile
We will use the Chicago community area shapefile to map the locations where Divvy stations are situated. This provides the foundation for joining spatial data with community-level attributes.
📍 The Chicago Community Area shapefile can be downloaded from the Chicago Data Portal.
Code
import geopandas as gpd
# Create figure and axis
= plt.subplots(figsize=(15, 10))
fig, ax
# Plot your GeoDataFrame
= gpd.read_file('chicago_boundaries\geo_export_26bce2f2-c163-42a9-9329-9ca6e082c5e9.shp')
chicago ='community', ax=ax, legend=True, legend_kwds={'ncol': 2, 'bbox_to_anchor': (2, 1)})
chicago.plot(column
# Add title (optional)
'Chicago Community Areas')
plt.title(
# Show the plot
plt.show()
1.2.4 Adding the divvy stations to the plot
In this step, we’ll create a scatter plot to visualize the Divvy stations, where each point represents a station’s location.
We’ll use the median latitude and longitude of each station for positioning. Once the scatter plot is created, we’ll overlay it on top of the Chicago community area map to provide geographic context.
Code
# read the csv file'divvy_2013.csv' into pandas pandas dataframe
= pd.read_csv('202405-divvy-tripdata-cleaned.csv')
data
# drop the duplicates in the column 'from_station_id', 'from_station_name', 'latitude_start', 'longitude_start'
= data[[ 'start_station_name', 'start_station_median_lat', 'start_station_median_lng']].drop_duplicates()
start_stations
start_stations.head()
start_station_name | start_station_median_lat | start_station_median_lng | |
---|---|---|---|
0 | Streeter Dr & Grand Ave | 41.892276 | -87.612041 |
1 | Sheridan Rd & Greenleaf Ave | 42.010592 | -87.662428 |
4 | Larrabee St & Division St | 41.903535 | -87.643344 |
6 | Damen Ave & Wellington Ave | 41.935895 | -87.678456 |
7 | Aberdeen St & Monroe St | 41.880372 | -87.655610 |
Code
import geopandas as gpd
# Adding the stations to the plot
= plt.subplots(figsize=(15, 10))
fig, ax
= gpd.read_file('chicago_boundaries\geo_export_26bce2f2-c163-42a9-9329-9ca6e082c5e9.shp')
chicago ='community', ax=ax, legend=True, legend_kwds={'ncol': 2, 'bbox_to_anchor': (2, 1)})
chicago.plot(column
'start_station_median_lng'], start_stations['start_station_median_lat'], s=10, alpha=0.5, color='black', marker='o')
plt.scatter(start_stations[# data[['longitude_start', 'latitude_start']].drop_duplicates().plot(ax=ax, color='red', markersize=10, marker='o')
# Add title (optional)
'Chicago Community Areas')
plt.title(
# Show the plot
plt.show()
Let’s change the chicago shapefile
Code
import geodatasets
= gpd.read_file(geodatasets.get_path("geoda.chicago_commpop"))
chicago
# Plot the data
= plt.subplots(figsize=(15, 10))
fig, ax =ax)
chicago.boundary.plot(ax'start_station_median_lng'], start_stations['start_station_median_lat'], s=10, alpha=0.5, color='black', marker='o')
plt.scatter(start_stations['Chicago Community Areas') plt.title(
Text(0.5, 1.0, 'Chicago Community Areas')
There are 14 stations located outside the community map. According to the Divvy website, the system covers both the Chicago area and Evanston. Based on the map, it appears that these out-of-map stations are situated in Evanston.
Next, we will add the community names to the Divvy trip dataset and verify this information.
Code
from shapely.geometry import Point
def find_community(gdf, lat, long):
"""
Given a GeoDataFrame with a geometry column of polygons and a community column,
find the community a point (lat, long) belongs to.
Parameters:
- gdf (GeoDataFrame): The GeoDataFrame with polygons and a community column.
- lat (float): The latitude of the point.
- long (float): The longitude of the point.
Returns:
- str: The community the point belongs to, or None if no community is found.
"""
= Point(long, lat) # Note that Point takes (longitude, latitude)
point
for _, row in gdf.iterrows():
if row['geometry'].contains(point):
return row['community']
return None
Code
'community'] = start_stations.apply(lambda row: find_community(chicago, row['start_station_median_lat'], row['start_station_median_lng']), axis=1)
start_stations[ start_stations.head()
start_station_name | start_station_median_lat | start_station_median_lng | community | |
---|---|---|---|---|
0 | Streeter Dr & Grand Ave | 41.892276 | -87.612041 | NEAR NORTH SIDE |
1 | Sheridan Rd & Greenleaf Ave | 42.010592 | -87.662428 | ROGERS PARK |
4 | Larrabee St & Division St | 41.903535 | -87.643344 | NEAR NORTH SIDE |
6 | Damen Ave & Wellington Ave | 41.935895 | -87.678456 | NORTH CENTER |
7 | Aberdeen St & Monroe St | 41.880372 | -87.655610 | NEAR WEST SIDE |
Code
sum() start_stations.isnull().
start_station_name 0
start_station_median_lat 0
start_station_median_lng 0
community 14
dtype: int64
Based on the Divvy data description and geospatial plotting, the 14 stations missing community information belong to Evanston. Let’s proceed with imputing the community data for Evanston.
Code
# fill the missing values in the column 'community' with 'Evanston'
'community'] = start_stations['community'].fillna('Evanston')
start_stations[
sum() start_stations.isnull().
start_station_name 0
start_station_median_lat 0
start_station_median_lng 0
community 0
dtype: int64
Code
# save the start_stations to a csv file
'start_stations_with_community.csv', index=False) start_stations.to_csv(
Once we add the community names to the Divvy dataset, let’s explore which community has the most trips associated with it.
Code
# add the community to the divvy set
=data.merge(start_stations[['start_station_name','community']], left_on='start_station_name', right_on='start_station_name', how='left')
data
data.head()
ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | ... | end_lng | member_casual | start_station_median_lat | start_station_median_lng | end_station_median_lat | end_station_median_lng | distance | distance_bins | duration | community | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7D9F0CE9EC2A1297 | classic_bike | 2024-05-25 15:52:42 | 2024-05-25 16:11:50 | Streeter Dr & Grand Ave | 13022 | Clark St & Elm St | TA1307000039 | 41.892278 | -87.612043 | ... | -87.631280 | casual | 41.892276 | -87.612041 | 41.902838 | -87.631606 | 1.234844 | 1-1.99 | 19.133333 | NEAR NORTH SIDE |
1 | 02EC47687411416F | classic_bike | 2024-05-14 15:11:51 | 2024-05-14 15:22:00 | Sheridan Rd & Greenleaf Ave | KA1504000159 | Sheridan Rd & Loyola Ave | RP-009 | 42.010587 | -87.662412 | ... | -87.661198 | casual | 42.010592 | -87.662428 | 42.001143 | -87.661277 | 0.662281 | 0-0.99 | 10.150000 | ROGERS PARK |
2 | 101370FB2D3402BE | classic_bike | 2024-05-30 17:46:04 | 2024-05-30 18:09:16 | Streeter Dr & Grand Ave | 13022 | Wabash Ave & 9th St | TA1309000010 | 41.892278 | -87.612043 | ... | -87.625734 | member | 41.892276 | -87.612041 | 41.870736 | -87.625740 | 1.644567 | 1-1.99 | 23.200000 | NEAR NORTH SIDE |
3 | E97E396331ED6913 | electric_bike | 2024-05-17 20:21:54 | 2024-05-17 20:40:32 | Streeter Dr & Grand Ave | 13022 | Sheffield Ave & Wellington Ave | TA1307000052 | 41.892270 | -87.611946 | ... | -87.652662 | member | 41.892276 | -87.612041 | 41.936300 | -87.652654 | 3.690219 | 2-4.99 | 18.633333 | NEAR NORTH SIDE |
4 | 674EDE311C543165 | classic_bike | 2024-05-22 18:52:20 | 2024-05-22 18:59:04 | Larrabee St & Division St | KA1504000079 | Clark St & Elm St | TA1307000039 | 41.903486 | -87.643353 | ... | -87.631280 | casual | 41.903535 | -87.643344 | 41.902838 | -87.631606 | 0.621883 | 0-0.99 | 6.733333 | NEAR NORTH SIDE |
5 rows × 21 columns
Code
= data['community'].value_counts().reset_index()
community_trip_counts = ['community', 'trip_count']
community_trip_counts.columns 'log_trip_count'] = community_trip_counts['trip_count'].apply(lambda x: 0 if x == 0 else np.log(x))
community_trip_counts[ community_trip_counts
community | trip_count | log_trip_count | |
---|---|---|---|
0 | NEAR NORTH SIDE | 109137 | 11.600359 |
1 | LINCOLN PARK | 77608 | 11.259426 |
2 | LOOP | 66632 | 11.106940 |
3 | LAKE VIEW | 64769 | 11.078582 |
4 | NEAR WEST SIDE | 61773 | 11.031222 |
... | ... | ... | ... |
72 | AVALON PARK | 31 | 3.433987 |
73 | EDISON PARK | 26 | 3.258097 |
74 | CALUMET HEIGHTS | 24 | 3.178054 |
75 | BURNSIDE | 8 | 2.079442 |
76 | RIVERDALE | 4 | 1.386294 |
77 rows × 3 columns
Let’s next explore which of the community has the most number of stations
Code
= data[['start_station_name', 'community']].drop_duplicates()['community'].value_counts().reset_index()
community_station_counts = ['start_station_name', 'station_count']
community_station_counts.columns community_station_counts.head()
start_station_name | station_count | |
---|---|---|
0 | NEAR WEST SIDE | 64 |
1 | NEAR NORTH SIDE | 55 |
2 | WEST TOWN | 47 |
3 | AUSTIN | 42 |
4 | LOOP | 40 |
1.2.5 Interactive Plotting with Folium
In addition to static plotting, geopandas
supports interactive geospatial visualizations through the folium
library.
The explore()
method in geopandas
provides a simple interface for creating interactive maps, closely mirroring the API of the standard plot()
method used for static plotting. You can use explore()
directly on a GeoSeries
or GeoDataFrame
to generate interactive maps powered by folium
.
Key Features of explore()
Here are some key capabilities of the explore()
method:
- Interactive Map Display
- Calling
explore()
on aGeoDataFrame
opens an interactive map directly within a Jupyter notebook.
- Users can pan, zoom, and explore the spatial features (points, lines, or polygons) on the map.
- Calling
- Layer Control
- Geometries are automatically added as layers, each styled based on its type.
- You can view and toggle different layers for enhanced exploration.
- Geometries are automatically added as layers, each styled based on its type.
- Tooltips and Attribute Popups
- Hovering over a feature reveals tooltip information, typically drawn from attribute columns in the data.
- This makes it easy to inspect the properties of individual geographic features.
- Hovering over a feature reveals tooltip information, typically drawn from attribute columns in the data.
- Search and Filter
- Built-in search and filter tools allow users to explore specific attribute values or subsets of data directly on the map.
- Customization Options
- The
explore()
method supports various customization options (e.g., color, column selection, tooltip formatting), enabling you to tailor the map for your audience or use case.
- The
This method is particularly useful for creating web-friendly, interactive maps with minimal code—ideal for exploratory data analysis, public dashboards, or teaching materials.
Code
import folium
Code
# use the geopandas explore default settings
= gpd.read_file(geodatasets.get_path("geoda.chicago_commpop"))
chicago
chicago.explore()
Code
# Customerize the explore settings
= gpd.read_file(geodatasets.get_path("geoda.chicago_commpop"))
chicago
= chicago.explore(
m ="POP2010", # make choropleth based on "POP2010" column
column="naturalbreaks", # use mapclassify's natural breaks scheme
scheme=True, # show legend
legend=10, # use 10 bins
k=False, # hide tooltip
tooltip=["POP2010", "POP2000"], # show popup (on-click)
popup=dict(colorbar=False), # do not use colorbar
legend_kwds="chicago", # name of the layer in the map
name
)
m
The explore()
method returns a folium.Map
object, which can also be passed directly (as you do with ax
in plot()
). You can then use folium functionality directly on the resulting map. Next, let’s add the divvy station plot.
Code
type(m)
folium.folium.Map
1.2.5.1 Adding the divvy station on the Interactive Folium.Map
Code
# Helper function for adding the description to the station
def row_to_html(row):
= pd.DataFrame(row).T
row_df = [col.capitalize() for col in row_df.columns]
row_df.columns return row_df.to_html(index=False)
Code
'start_station_name', 'start_station_median_lat', 'start_station_median_lng']] data[[
start_station_name | start_station_median_lat | start_station_median_lng | |
---|---|---|---|
0 | Streeter Dr & Grand Ave | 41.892276 | -87.612041 |
1 | Sheridan Rd & Greenleaf Ave | 42.010592 | -87.662428 |
2 | Streeter Dr & Grand Ave | 41.892276 | -87.612041 |
3 | Streeter Dr & Grand Ave | 41.892276 | -87.612041 |
4 | Larrabee St & Division St | 41.903535 | -87.643344 |
... | ... | ... | ... |
598365 | Clarendon Ave & Leland Ave | 41.967942 | -87.650045 |
598366 | Wabash Ave & Roosevelt Rd | 41.867179 | -87.625985 |
598367 | DuSable Lake Shore Dr & Belmont Ave | 41.940740 | -87.639154 |
598368 | Green St & Washington Blvd | 41.883201 | -87.648782 |
598369 | Morgan St & 31st St | 41.837776 | -87.651147 |
598370 rows × 3 columns
Code
# Extracting the latitude, longitude, and station name for plotting, and also counting the number of trips from each station
= data.groupby(['start_station_name', 'start_station_median_lat', 'start_station_median_lng'])['ride_id'].count().reset_index()
grouped_df 'ride_id', ascending=False).head())
display(grouped_df.sort_values(={'start_station_name':'title', 'start_station_median_lat':'latitude', 'start_station_median_lng':'longitude', 'ride_id':'count'}, inplace=True)
grouped_df.rename(columns'description'] = grouped_df.apply(lambda row: row_to_html(row), axis=1)
grouped_df[= gpd.points_from_xy(grouped_df['longitude'], grouped_df['latitude'])
geometry = gpd.GeoDataFrame(grouped_df, geometry=geometry)
geo_df # Optional: Assign Coordinate Reference System (CRS)
= "EPSG:4326" # WGS84 coordinate system
geo_df.crs geo_df
start_station_name | start_station_median_lat | start_station_median_lng | ride_id | |
---|---|---|---|---|
1217 | Streeter Dr & Grand Ave | 41.892276 | -87.612041 | 9282 |
271 | DuSable Lake Shore Dr & Monroe St | 41.881059 | -87.616777 | 6577 |
1314 | Wilton Ave & Belmont Ave | 41.940116 | -87.652973 | 5845 |
1235 | University Ave & 57th St | 41.791475 | -87.599920 | 5196 |
478 | LaSalle St & Illinois St | 41.890853 | -87.631665 | 4937 |
title | latitude | longitude | count | description | geometry | |
---|---|---|---|---|---|---|
0 | 2112 W Peterson Ave | 41.991168 | -87.683617 | 119 | <table border="1" class="dataframe">\n <thead... | POINT (-87.68362 41.99117) |
1 | 21st St & Pulaski Rd | 41.853457 | -87.724529 | 19 | <table border="1" class="dataframe">\n <thead... | POINT (-87.72453 41.85346) |
2 | 63rd St Beach | 41.780958 | -87.576320 | 174 | <table border="1" class="dataframe">\n <thead... | POINT (-87.57632 41.78096) |
3 | 900 W Harrison St | 41.874755 | -87.649817 | 882 | <table border="1" class="dataframe">\n <thead... | POINT (-87.64982 41.87476) |
4 | Aberdeen St & Jackson Blvd | 41.877773 | -87.654831 | 1471 | <table border="1" class="dataframe">\n <thead... | POINT (-87.65483 41.87777) |
... | ... | ... | ... | ... | ... | ... |
1333 | Woodlawn Ave & 58th St | 41.789404 | -87.596487 | 987 | <table border="1" class="dataframe">\n <thead... | POINT (-87.59649 41.7894) |
1334 | Woodlawn Ave & 75th St | 41.759160 | -87.595751 | 10 | <table border="1" class="dataframe">\n <thead... | POINT (-87.59575 41.75916) |
1335 | Woodlawn Ave & Lake Park Ave | 41.814078 | -87.597017 | 218 | <table border="1" class="dataframe">\n <thead... | POINT (-87.59702 41.81408) |
1336 | Yates Blvd & 75th St | 41.758705 | -87.566409 | 32 | <table border="1" class="dataframe">\n <thead... | POINT (-87.56641 41.75871) |
1337 | Yates Blvd & 93rd St | 41.726178 | -87.566361 | 4 | <table border="1" class="dataframe">\n <thead... | POINT (-87.56636 41.72618) |
1338 rows × 6 columns
you can add a hover tooltip
(sometimes referred to as a tooltip
or tooltip
popup) for each point on your Folium
map. This tooltip
will appear when you hover over the markers on the map, providing additional information without needing to click on them. Here’s how you can modify your existing code to include hover tooltips
:
Code
= gpd.read_file(geodatasets.get_path("geoda.chicago_commpop"))
chicago
= chicago.explore(
m ="POP2010", # make choropleth based on "POP2010" column
column="naturalbreaks", # use mapclassify's natural breaks scheme
scheme=True, # show legend
legend=10, # use 10 bins
k=False, # hide tooltip
tooltip=["POP2010", "POP2000"], # show popup (on-click)
popup=dict(colorbar=False), # do not use colorbar
legend_kwds="chicago", # name of the layer in the map
name
)
geo_df.explore(=m, # pass the map object
m="red", # use red color on all points
color=dict(radius=5, fill=True), # make marker radius 10px with fill
marker_kwds="description", # show "name" column in the tooltip
tooltip=dict(labels=False), # do not show column label in the tooltip
tooltip_kwds="divstation", # name of the layer in the map
name
)
m
1.2.5.2 Assigning the station size based on its popularity
Next, we will categorize the number of trips associated with each station into 10 bins and assign point sizes accordingly. This will visually indicate the popularity of each station based on its size. Hovering over a station marker will display the station name and click it will show additional station information.
To do this, we need to use folium.CircleMarker, which is a versatile function in Folium for creating interactive maps with customizable circle markers. It allows you to visually represent geospatial data points and provide additional information through popups, enhancing the interactivity and usability of your map visualizations in Python. Adjust parameters and explore further customization options to suit your specific mapping needs.
Code
def add_bins(df):
# Binning the 'count' column into 10 bins
= np.linspace(df['count'].min(), df['count'].max(), 10)
bins 'bin'] = np.digitize(df['count'], bins)
df[
# Assigning colors to bins
= plt.get_cmap('viridis') # You can choose any colormap here
cmap 'color'] = df['bin'].apply(lambda x: cmap(x / len(bins)))
df[
# Assigning sizes to bins (for example, increasing size with bin number)
'size'] = df['bin'].apply(lambda x: 2 + x * 1) # Adjust the size calculation as needed
df[return df
Code
= add_bins(geo_df)
geo_df geo_df
title | latitude | longitude | count | description | geometry | bin | color | size | |
---|---|---|---|---|---|---|---|---|---|
0 | 2112 W Peterson Ave | 41.991168 | -87.683617 | 119 | <table border="1" class="dataframe">\n <thead... | POINT (-87.68362 41.99117) | 1 | (0.282623, 0.140926, 0.457517, 1.0) | 3 |
1 | 21st St & Pulaski Rd | 41.853457 | -87.724529 | 19 | <table border="1" class="dataframe">\n <thead... | POINT (-87.72453 41.85346) | 1 | (0.282623, 0.140926, 0.457517, 1.0) | 3 |
2 | 63rd St Beach | 41.780958 | -87.576320 | 174 | <table border="1" class="dataframe">\n <thead... | POINT (-87.57632 41.78096) | 1 | (0.282623, 0.140926, 0.457517, 1.0) | 3 |
3 | 900 W Harrison St | 41.874755 | -87.649817 | 882 | <table border="1" class="dataframe">\n <thead... | POINT (-87.64982 41.87476) | 1 | (0.282623, 0.140926, 0.457517, 1.0) | 3 |
4 | Aberdeen St & Jackson Blvd | 41.877773 | -87.654831 | 1471 | <table border="1" class="dataframe">\n <thead... | POINT (-87.65483 41.87777) | 2 | (0.253935, 0.265254, 0.529983, 1.0) | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1333 | Woodlawn Ave & 58th St | 41.789404 | -87.596487 | 987 | <table border="1" class="dataframe">\n <thead... | POINT (-87.59649 41.7894) | 1 | (0.282623, 0.140926, 0.457517, 1.0) | 3 |
1334 | Woodlawn Ave & 75th St | 41.759160 | -87.595751 | 10 | <table border="1" class="dataframe">\n <thead... | POINT (-87.59575 41.75916) | 1 | (0.282623, 0.140926, 0.457517, 1.0) | 3 |
1335 | Woodlawn Ave & Lake Park Ave | 41.814078 | -87.597017 | 218 | <table border="1" class="dataframe">\n <thead... | POINT (-87.59702 41.81408) | 1 | (0.282623, 0.140926, 0.457517, 1.0) | 3 |
1336 | Yates Blvd & 75th St | 41.758705 | -87.566409 | 32 | <table border="1" class="dataframe">\n <thead... | POINT (-87.56641 41.75871) | 1 | (0.282623, 0.140926, 0.457517, 1.0) | 3 |
1337 | Yates Blvd & 93rd St | 41.726178 | -87.566361 | 4 | <table border="1" class="dataframe">\n <thead... | POINT (-87.56636 41.72618) | 1 | (0.282623, 0.140926, 0.457517, 1.0) | 3 |
1338 rows × 9 columns
Code
= folium.Map(location=[geo_df.geometry.centroid.y.mean(), geo_df.geometry.centroid.x.mean()], zoom_start=13)
m # Add markers to the map based on visit_category
for idx, row in geo_df.iterrows():
=[row.geometry.y, row.geometry.x],
folium.CircleMarker(location=row['size'],
radius='red',
color=True,
fill='red',
fill_color=row['description'],
popup= row['title']).add_to(m)
tooltip
m
1.2.5.3 Adding markers for the top 5 popular stations
Code
### Getting the top 5 popular station names and their counts
# unique values in the columns from_station_id and to_station_id
'start_station_name'].value_counts()[: 5] data[
start_station_name
Streeter Dr & Grand Ave 9282
DuSable Lake Shore Dr & Monroe St 6577
Wilton Ave & Belmont Ave 5845
University Ave & 57th St 5196
LaSalle St & Illinois St 4937
Name: count, dtype: int64
Code
from ipywidgets import Layout
from folium import Map, Marker, Popup
from ipywidgets import HTML
Code
def show_locations(locations, titles=None):
if isinstance(locations, pd.DataFrame):
= locations.values
locations if isinstance(locations, np.ndarray):
= [tuple(i) for i in locations]
locations if titles is None:
= ['Location {}'.format(i) for i in range(len(locations))]
titles assert len(titles) == len(locations)
= max([x[0] for x in locations])
max_lat = min([x[0] for x in locations])
min_lat = max([x[1] for x in locations])
max_lon = min([x[1] for x in locations])
min_lon = [(max_lat + min_lat) / 2, (max_lon + min_lon) / 2]
center # Create the map
= Map(center=center, zoom=14, layout=Layout(width='80%', height='800px'))
m # Add the market
for loc,title in zip(locations, titles):
= Marker(location=loc, title=title)
marker = HTML()
message1 = "Try clicking the marker!"
message1.value
= message1
marker.popup
m.add_layer(marker)
display(m)
def show_locations(location_df):
= ['latitude', 'longitude', 'title', 'description']
columns assert all([col in location_df.columns for col in columns])
= location_df['latitude'].max()
max_lat = location_df['latitude'].min()
min_lat = location_df['longitude'].max()
max_lon = location_df['longitude'].min()
min_lon = [(max_lat + min_lat) / 2, (max_lon + min_lon) / 2]
center # Create the map
# m = Map(center=center, zoom_start=14, layout=Layout(width='80%', height='800px'))
= folium.Map(location=center, zoom_start=14)
m #m = folium.Map(location=center, zoom_start=14)
# Add the market
for i, row in location_df.iterrows():
= (row['latitude'], row['longitude'])
loc = Marker(location=loc, title=row['title'], popup=row['description'])
marker # marker = folium.Marker(location=loc)
# marker.popup = row['description']
# m.add_layer(marker)
marker.add_to(m)# m.save('map.html')
# display(m)
return m
def row_to_html(row):
= '<ul>'
html_str for col, val in row.items():
+= f'<li><strong>{col}:</strong> {val}</li>'
html_str += '</ul>'
html_str return html_str
def row_to_html(row):
= pd.DataFrame(row).T
row_df = [col.capitalize() for col in row_df.columns]
row_df.columns return row_df.to_html(index=False)
Code
from matplotlib import pyplot as plt
# get top 5 from locations
# data['from_station_name'].value_counts().head(10)
= data['start_station_name'].value_counts()[:5].index
top_5 = data[data['start_station_name'].isin(top_5)].groupby(['start_station_name', 'start_station_median_lat', 'start_station_median_lng'])['ride_id'].count().reset_index()
top_5_df
display(top_5_df)={'start_station_name':'title', 'start_station_median_lat':'latitude', 'start_station_median_lng':'longitude', 'ride_id':'count'}, inplace=True)
top_5_df.rename(columns'description'] = top_5_df.apply(lambda row: (row_to_html(row)), axis=1)
top_5_df[ show_locations(top_5_df)
start_station_name | start_station_median_lat | start_station_median_lng | ride_id | |
---|---|---|---|---|
0 | DuSable Lake Shore Dr & Monroe St | 41.881059 | -87.616777 | 6577 |
1 | LaSalle St & Illinois St | 41.890853 | -87.631665 | 4937 |
2 | Streeter Dr & Grand Ave | 41.892276 | -87.612041 | 9282 |
3 | University Ave & 57th St | 41.791475 | -87.599920 | 5196 |
4 | Wilton Ave & Belmont Ave | 41.940116 | -87.652973 | 5845 |
2 Independent Practice
Download the trip data from May 2023 to May 2024 from the divvy website and conduct a geospatial analysis. Your tasks include:
- Imputing missing values where possible.
- Removing invalid trips.
- Calculating the percentage of trips where the bike was picked up and returned to the same station. Also, determine the percentage of these trips classified as commuting versus leisure.
- Identifying the top 5 stations with the highest number of trips and checking if these stations are among those with the most Divvy stations.
- Plotting the top 5 popular stations on a map, highlighting those with the highest number of associated trips.”