Working with Geospatial datasets in Python

Geospatial plotting is a powerful technique for uncovering spatial patterns, identifying regional disparities, and communicating insights through interactive maps. It is an essential skill for data scientists working with location-based data.

This lesson introduces geospatial analysis and visualization using Python, focusing on the Chicago Divvy bike-sharing dataset as a case study.

This project was inspired by an MCDC practicum course taught in Winter 2022 by Dr. Arend Kuyper. In collaboration with faculty liaison Dr. Amanda Stathopoulos, the course partnered with the Chicago-based nonprofit organization Equiticity, represented by community partner Oboi Reed. Students worked in teams to deliver on Equiticity’s Request for Data Science Services through MCDC. Results from this and other MCDC student–community partnership projects are summarized on the MCDC Past Projects website. MCDC was funded by the U.S. National Science Foundation.

This lesson was developed by Dr. Lizhen Shi for MCDC pathway courses.

1 Guided Practice

1.1 Data Clearning and Preparation

Real-world datasets often require cleaning before analysis. In this section, we prepare the Divvy bike-sharing dataset for geospatial plotting by addressing missing values, standardizing station coordinates, and removing invalid trips

1.1.1 Import Libraries

Let’s start by importing the necessary Python libraries

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

1.1.2 Load the Dataset

We’ll use trip data from May 2024, available on the Divvy website. This dataset contains records of individual bike trips, including geospatial coordinates.

Code
# Load the data
divvy_0524 = pd.read_csv('202405-divvy-tripdata.csv')
print(divvy_0524.shape)
divvy_0524.head()
(609493, 13)
ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual
0 7D9F0CE9EC2A1297 classic_bike 2024-05-25 15:52:42 2024-05-25 16:11:50 Streeter Dr & Grand Ave 13022 Clark St & Elm St TA1307000039 41.892278 -87.612043 41.902973 -87.631280 casual
1 02EC47687411416F classic_bike 2024-05-14 15:11:51 2024-05-14 15:22:00 Sheridan Rd & Greenleaf Ave KA1504000159 Sheridan Rd & Loyola Ave RP-009 42.010587 -87.662412 42.001044 -87.661198 casual
2 101370FB2D3402BE classic_bike 2024-05-30 17:46:04 2024-05-30 18:09:16 Streeter Dr & Grand Ave 13022 Wabash Ave & 9th St TA1309000010 41.892278 -87.612043 41.870769 -87.625734 member
3 E97E396331ED6913 electric_bike 2024-05-17 20:21:54 2024-05-17 20:40:32 Streeter Dr & Grand Ave 13022 Sheffield Ave & Wellington Ave TA1307000052 41.892270 -87.611946 41.936253 -87.652662 member
4 674EDE311C543165 classic_bike 2024-05-22 18:52:20 2024-05-22 18:59:04 Larrabee St & Division St KA1504000079 Clark St & Elm St TA1307000039 41.903486 -87.643353 41.902973 -87.631280 casual

The dataset has 609,493 rows and 13 columns, each representing a bike trip. Here’s a breakdown of the columns:

Field name Data type Description
ride_id String Unique ID for each ride
rideable_type String Bike types
started_at Timestamp Trip start day and time
ended_at Timestamp Trip end day and time
start_station_name String Trip start station
start_station_id String ID of the start station
end_station_name String Trip end station
end_station_id String ID of the end station
start_lat Float Latitude of the start station
start_lng Float Longitude of the start station
end_lat Float Latitude of the end station
end_lng Float Longitude of the end station
member_casual String Rider type

The geospatial-related columns are:

  • start_latand start_lngrepresent the coordinates where a bike is picked up,
  • end_lat, and end_lng represent the coordinates where a bike is returned.
  • start_station_name represents the trip start station
  • start_station_id is the ID of the start station
  • end_station_name represents the trip end station
  • end_station_ID is the ID of the end station

1.1.3 Handling Missing Values

let’s check for missing values within the dataset.

Code
divvy_0524.isnull().sum()
ride_id                    0
rideable_type              0
started_at                 0
ended_at                   0
start_station_name    109048
start_station_id      109048
end_station_name      112731
end_station_id        112731
start_lat                  0
start_lng                  0
end_lat                  784
end_lng                  784
member_casual              0
dtype: int64

There are 784 observations missing end_lat and end_lng. Let’s explore whether these observations have a station name listed so that we can use the station name to impute its coordinates.

Code
# show start_station_name, start_lat, and start_lng all null
divvy_0524[(divvy_0524['start_station_name'].isnull()) & ((divvy_0524['start_lat'].isnull()) | (divvy_0524['start_lng'].isnull()))]
 
ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual

The result is an empty DataFrame, indicating that rows missing end_lat and end_lng also lack end_station_name and end_station_id. Since imputation isn’t possible, we’ll drop these rows:

Code
# drop rows where  end_station_name, end_lat, and end_lng are all null
divvy_0524 = divvy_0524[ ~((divvy_0524['end_station_name'].isnull()) & ((divvy_0524['end_lat'].isnull()) | (divvy_0524['end_lng'].isnull())))]

# rechekc the null values
divvy_0524.isnull().sum()
ride_id                    0
rideable_type              0
started_at                 0
ended_at                   0
start_station_name    109048
start_station_id      109048
end_station_name      111947
end_station_id        111947
start_lat                  0
start_lng                  0
end_lat                    0
end_lng                    0
member_casual              0
dtype: int64

We successfully dropped the instances where both end_lat and end_lng are missing. Next, let’s explore the mapping between station names and their coordinates for our missing value imputation.

Code
locations = pd.concat([divvy_0524[['start_station_name', 'start_lat', 'start_lng']].drop_duplicates().rename(columns={'start_station_name': 'station_name', 'start_lat': 'lat', 'start_lng': 'lng'}),
                        divvy_0524[['end_station_name', 'end_lat', 'end_lng']].drop_duplicates().rename(columns={'end_station_name': 'station_name', 'end_lat': 'lat', 'end_lng': 'lng'})])

locations = locations.drop_duplicates()
print(locations.shape)
locations.head()
(195676, 3)
station_name lat lng
0 Streeter Dr & Grand Ave 41.892278 -87.612043
1 Sheridan Rd & Greenleaf Ave 42.010587 -87.662412
3 Streeter Dr & Grand Ave 41.892270 -87.611946
4 Larrabee St & Division St 41.903486 -87.643353
5 Sheridan Rd & Greenleaf Ave 42.010571 -87.662456
Code
# get the distinct station names names
locations['station_name'].unique().size
1361

195,676 unique coordinates correspond to 1361 station names. That’s an average of 195 coordinates per station. Let’s find out which station has the most cooridinates next

Code
locations.groupby('station_name').size().sort_values(ascending=False).head(10)
station_name
Streeter Dr & Grand Ave              1831
Clinton St & Washington Blvd         1550
Michigan Ave & Oak St                1372
DuSable Lake Shore Dr & Monroe St    1307
Kingsbury St & Kinzie St             1304
Wells St & Concord Ln                1269
Clark St & Elm St                    1264
Clinton St & Madison St              1223
Wells St & Elm St                    1220
Sheffield Ave & Waveland Ave         1190
dtype: int64

1.1.4 Standardizing Station Coordinates

Divvy stations can have multiple coordinates for the same station due to GPS variations. For geospatial plotting, we’ll use a single coordinate per station by calculating the median latitude and longitude:

Code
# Calculate median lat and lng for each station_name
median_locations = locations.groupby('station_name').agg({'lat': 'median', 'lng': 'median'}).reset_index()
median_locations.rename(columns={'lat': 'median_lat', 'lng': 'median_lng'}, inplace=True)
print(median_locations.shape)
median_locations.head()
(1360, 3)
station_name median_lat median_lng
0 2112 W Peterson Ave 41.991168 -87.683617
1 21st St & Pulaski Rd 41.853457 -87.724529
2 63rd St Beach 41.780958 -87.576320
3 900 W Harrison St 41.874755 -87.649817
4 Aberdeen St & Jackson Blvd 41.877773 -87.654831

If two coordinates are very close, it’s likely that they belong to the same Divvy station. For the remaining missing values, let’s use the nearest neighbor’s station name based on coordinates to fill in the missing names.

Code
# define the nearestneighbor method

from sklearn.neighbors import NearestNeighbors

def make_nn(df):
    # Ensure the DataFrame has the necessary columns
    if not all(col in df.columns for col in ['median_lat', 'median_lng']):
        raise ValueError("DataFrame must contain 'lat' and 'lng' columns")
    
    # Extract the coordinates from the DataFrame
    coordinates = df[['median_lat', 'median_lng']].values
    
    # Create and fit the NearestNeighbors model
    nbrs = NearestNeighbors(n_neighbors=1, algorithm='ball_tree').fit(coordinates)
    
    return nbrs

nbrs = make_nn(median_locations)

def find_nearest_station(nbrs, df, lat, lng):
    
    # Find the nearest neighbor to the given (lat, lng)
    distance, index = nbrs.kneighbors([[lat, lng]])
    
    # Get the nearest station's information
    nearest_station = df.iloc[index[0][0]]['station_name']
    
    return nearest_station
Code
# check whether the method can return the nearest station name
find_nearest_station(nbrs, median_locations, 41.892278, -87.612043)
'Streeter Dr & Grand Ave'
Code
divvy_0524= divvy_0524.reset_index(drop=True)
Code
# use the nearest neighbor method to impute the missing values in the `start_station_name` and `end_station_name` in the dataset
from tqdm import tqdm
# impute start_station_name
for i in tqdm(range(len(divvy_0524))):
    if pd.isnull(divvy_0524.loc[i, 'start_station_name']):
        divvy_0524.loc[i, 'start_station_name'] = find_nearest_station(nbrs, median_locations, divvy_0524.loc[i, 'start_lat'], divvy_0524.loc[i, 'start_lng'])
    # impute end_station_name
    if pd.isnull(divvy_0524.loc[i, 'end_station_name']):
        divvy_0524.loc[i, 'end_station_name'] = find_nearest_station(nbrs, median_locations, divvy_0524.loc[i, 'end_lat'], divvy_0524.loc[i, 'end_lng'])
100%|██████████| 608709/608709 [05:57<00:00, 1701.02it/s] 

The assert statement in Python is used as a debugging aid that tests a condition. If the condition is True, the program continues to execute normally. If the condition is False, an AssertionError is raised, and the program stops executing. assert is commonly used to check assumptions made by the program and to catch bugs early in the development process.

In this case, after the imputation, there should be no missing values in the start_staion_name and end_station_name columns.

Code
# use assert to check the imputation results
assert divvy_0524['start_station_name'].isnull().sum() == 0
assert divvy_0524['end_station_name'].isnull().sum() == 0

Let’s merge the median_lat, and median_lng to the divvy dataset for plotting

Code
divvy_0524 = pd.merge(divvy_0524, median_locations, left_on='start_station_name', right_on='station_name', how='left')
divvy_0524 = divvy_0524.drop(columns=['station_name'])
divvy_0524.rename(columns={'median_lat': 'start_station_median_lat', 'median_lng': 'start_station_median_lng'}, inplace=True)

divvy_0524 = pd.merge(divvy_0524, median_locations, left_on='end_station_name', right_on='station_name', how='left')
divvy_0524 = divvy_0524.drop(columns=['station_name'])
divvy_0524.rename(columns={'median_lat': 'end_station_median_lat', 'median_lng': 'end_station_median_lng'}, inplace=True)

1.1.5 Data Cleaning: Trip Distance Exploration

Trip distance is a critical metric, even though it isn’t an original feature in the dataset. Fortunately, we have the start and end coordinates for each trip, allowing us to calculate distances based on them.

Analyzing trip distances in the Divvy dataset is essential for several reasons, including understanding user behavior, assessing system efficiency, and evaluating the overall performance of the bike-sharing service.

To calculate the trip distance using the start and end coordinates, we can employ the Haversine formula, which determines the great-circle distance between two points on the Earth’s surface based on their latitude and longitude. Alternatively, we can use the geopy library in Python, which provides a straightforward interface for geospatial calculations.

Hee is an implementation of the Haversine formula in python

Code
def haversine(lat1, lon1, lat2, lon2):
    # Convert degrees to radians
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    
    # Compute differences
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    
    # Haversine formula
    a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    
    # Radius of Earth in kilometers and conversion to miles
    R_km = 6371.0
    R_mi = R_km * 0.621371
    
    # Compute distance
    distance = R_mi * c
    
    return distance

divvy_0524['distance'] = haversine(divvy_0524['start_lat'], divvy_0524['start_lng'], divvy_0524['end_lat'], divvy_0524['end_lng'])
Code
# plot the histgram to check the distribution
divvy_0524['distance'].hist(bins=100)

Code
# check the statistics of the distance
divvy_0524['distance'].describe()
count    608709.000000
mean          1.386598
std           1.267535
min           0.000000
25%           0.556175
50%           1.020599
75%           1.825678
max          20.840069
Name: distance, dtype: float64
Code
# bin the distance into 4 groups, min-0.99, 1-1.99, 2-4.99, 5-max()
divvy_0524['distance_bins'] = pd.cut(divvy_0524['distance'], bins=[0, 0.99, 1.99, 4.99, divvy_0524['distance'].max()], labels=['0-0.99', '1-1.99', '2-4.99', '5-max'])


# plot the distance_bins
divvy_0524['distance_bins'].value_counts().plot(kind='bar')

# add the title of the plot
divvy_0524['distance_bins'].value_counts().plot(kind='bar', title='Distance Bins')

Code
# check the percentage of value counts of the distance_bins
divvy_0524['distance_bins'].value_counts(normalize=True)
distance_bins
0-0.99    0.452076
1-1.99    0.311665
2-4.99    0.213208
5-max     0.023052
Name: proportion, dtype: float64

Next, let’s determine the percentage of trips that start and end at the same station, indicating a distance of zero. These trips can fall into two categories:

  • Valid trips: Trips where bikes are picked up and returned to the same station.
  • Invalid trips: These are likely noise in the data, such as when someone unlocks a bike but then decides not to rent it. In such scenarios, the trip duration is also very short.

For the second case, we need to remove the noise before we proceed with our analysis

Code
# show the distance equals to 0
divvy_0524[divvy_0524['distance'] == 0]
ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual start_station_median_lat start_station_median_lng end_station_median_lat end_station_median_lng distance distance_bins
11 95CA09985B75FA22 classic_bike 2024-05-27 18:14:19 2024-05-27 19:09:30 McClurg Ct & Ohio St TA1306000029 McClurg Ct & Ohio St TA1306000029 41.892592 -87.617289 41.892592 -87.617289 member 41.892658 -87.617321 41.892658 -87.617321 0.0 NaN
13 A08FB7FA82FCD72E classic_bike 2024-05-09 17:38:03 2024-05-09 17:41:37 Indiana Ave & 26th St TA1307000005 Indiana Ave & 26th St TA1307000005 41.845687 -87.622481 41.845687 -87.622481 member 41.845708 -87.622545 41.845708 -87.622545 0.0 NaN
14 4E578174D37FBFE5 classic_bike 2024-05-09 17:42:22 2024-05-09 17:42:49 Indiana Ave & 26th St TA1307000005 Indiana Ave & 26th St TA1307000005 41.845687 -87.622481 41.845687 -87.622481 member 41.845708 -87.622545 41.845708 -87.622545 0.0 NaN
15 80903806220B461B classic_bike 2024-05-28 19:42:27 2024-05-28 20:22:45 McClurg Ct & Ohio St TA1306000029 McClurg Ct & Ohio St TA1306000029 41.892592 -87.617289 41.892592 -87.617289 member 41.892658 -87.617321 41.892658 -87.617321 0.0 NaN
16 855A39B703AEEF8D classic_bike 2024-05-31 12:02:35 2024-05-31 12:02:58 Stetson Ave & South Water St TA1308000029 Stetson Ave & South Water St TA1308000029 41.886835 -87.622320 41.886835 -87.622320 member 41.886770 -87.622422 41.886770 -87.622422 0.0 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
608122 8BFD4D2FF581C242 electric_bike 2024-05-21 19:36:07 2024-05-21 19:41:15 University Ave & 57th St NaN University Ave & 57th St NaN 41.790000 -87.600000 41.790000 -87.600000 casual 41.791475 -87.599920 41.791475 -87.599920 0.0 NaN
608123 764B4FDBDFCBC442 electric_bike 2024-05-22 14:34:41 2024-05-22 14:56:53 DuSable Lake Shore Dr & Belmont Ave NaN DuSable Lake Shore Dr & Belmont Ave NaN 41.940000 -87.640000 41.940000 -87.640000 casual 41.940740 -87.639154 41.940740 -87.639154 0.0 NaN
608134 5D055CB420A4FE10 classic_bike 2024-05-21 22:10:52 2024-05-21 23:11:42 Avenue O & 134th St 20214 Avenue O & 134th St 20214 41.651799 -87.540696 41.651799 -87.540696 casual 41.651881 -87.539622 41.651881 -87.539622 0.0 NaN
608140 AF072EB6C20AD421 electric_bike 2024-05-21 19:02:16 2024-05-21 19:02:25 Dearborn Pkwy & Delaware Pl NaN Dearborn Pkwy & Delaware Pl NaN 41.900000 -87.630000 41.900000 -87.630000 casual 41.898946 -87.629925 41.898946 -87.629925 0.0 NaN
608145 87452852EC07FE5F electric_bike 2024-05-22 18:01:00 2024-05-22 18:01:44 DuSable Lake Shore Dr & Monroe St NaN DuSable Lake Shore Dr & Monroe St NaN 41.880000 -87.620000 41.880000 -87.620000 casual 41.881059 -87.616777 41.881059 -87.616777 0.0 NaN

36084 rows × 19 columns

Among these trips, we will identify those with a duration of less than 1 minute and a trip distance of 0. Let’s filter out these anomalies

Code
# set the start and end time to datetime
divvy_0524['started_at'] = pd.to_datetime(divvy_0524['started_at'])
divvy_0524['ended_at'] = pd.to_datetime(divvy_0524['ended_at'])

# create a new column duration in minutes
divvy_0524['duration'] = (divvy_0524['ended_at'] - divvy_0524['started_at']).dt.total_seconds() / 60
Code
# extract the trips that whose duration is less than or equal to 1 minute and distance is 0
divvy_0524[(divvy_0524['duration'] <= 1) & (divvy_0524['distance'] == 0)]
ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual start_station_median_lat start_station_median_lng end_station_median_lat end_station_median_lng distance distance_bins duration
14 4E578174D37FBFE5 classic_bike 2024-05-09 17:42:22 2024-05-09 17:42:49 Indiana Ave & 26th St TA1307000005 Indiana Ave & 26th St TA1307000005 41.845687 -87.622481 41.845687 -87.622481 member 41.845708 -87.622545 41.845708 -87.622545 0.0 NaN 0.450000
16 855A39B703AEEF8D classic_bike 2024-05-31 12:02:35 2024-05-31 12:02:58 Stetson Ave & South Water St TA1308000029 Stetson Ave & South Water St TA1308000029 41.886835 -87.622320 41.886835 -87.622320 member 41.886770 -87.622422 41.886770 -87.622422 0.0 NaN 0.383333
262 9F76CEEAA6B34C0B classic_bike 2024-05-12 16:04:57 2024-05-12 16:05:15 Streeter Dr & Grand Ave 13022 Streeter Dr & Grand Ave 13022 41.892278 -87.612043 41.892278 -87.612043 member 41.892276 -87.612041 41.892276 -87.612041 0.0 NaN 0.300000
631 32B4023E9923173E classic_bike 2024-05-18 22:37:50 2024-05-18 22:38:16 Ashland Ave & Blackhawk St 13224 Ashland Ave & Blackhawk St 13224 41.907066 -87.667252 41.907066 -87.667252 member 41.907060 -87.667226 41.907060 -87.667226 0.0 NaN 0.433333
632 408786CFCC0E004F classic_bike 2024-05-23 13:39:18 2024-05-23 13:39:52 Cornell Ave & Hyde Park Blvd KA1503000007 Cornell Ave & Hyde Park Blvd KA1503000007 41.802406 -87.586924 41.802406 -87.586924 member 41.802421 -87.586920 41.802421 -87.586920 0.0 NaN 0.566667
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
608112 35F4266BF2D0AB22 electric_bike 2024-05-22 18:12:38 2024-05-22 18:12:40 DuSable Lake Shore Dr & Monroe St NaN DuSable Lake Shore Dr & Monroe St NaN 41.880000 -87.620000 41.880000 -87.620000 casual 41.881059 -87.616777 41.881059 -87.616777 0.0 NaN 0.033333
608117 DF9291196964FB93 electric_bike 2024-05-22 07:35:36 2024-05-22 07:35:48 Green St & Madison St NaN Green St & Madison St NaN 41.880000 -87.650000 41.880000 -87.650000 casual 41.881861 -87.649189 41.881861 -87.649189 0.0 NaN 0.200000
608120 2D857C9FCF2434CF electric_bike 2024-05-21 19:40:39 2024-05-21 19:41:23 Sedgwick St & Webster Ave NaN Sedgwick St & Webster Ave NaN 41.920000 -87.640000 41.920000 -87.640000 casual 41.922176 -87.638982 41.922176 -87.638982 0.0 NaN 0.733333
608140 AF072EB6C20AD421 electric_bike 2024-05-21 19:02:16 2024-05-21 19:02:25 Dearborn Pkwy & Delaware Pl NaN Dearborn Pkwy & Delaware Pl NaN 41.900000 -87.630000 41.900000 -87.630000 casual 41.898946 -87.629925 41.898946 -87.629925 0.0 NaN 0.150000
608145 87452852EC07FE5F electric_bike 2024-05-22 18:01:00 2024-05-22 18:01:44 DuSable Lake Shore Dr & Monroe St NaN DuSable Lake Shore Dr & Monroe St NaN 41.880000 -87.620000 41.880000 -87.620000 casual 41.881059 -87.616777 41.881059 -87.616777 0.0 NaN 0.733333

10339 rows × 20 columns

Let’s remove them from the divvy set

Code
# drop the trips that whose duration is less than or equal to 1 minute and distance is 0
divvy_0524 = divvy_0524[~((divvy_0524['duration'] <= 1) & (divvy_0524['distance'] == 0))]
divvy_0524.shape
(598370, 20)
Code
# check the percentage of valid trips that pick up and drop off at the same station
divvy_0524[divvy_0524['start_station_name'] == divvy_0524['end_station_name']].shape[0] / divvy_0524.shape[0]
0.06118956498487558

It turns out that only 6% of the trips start and end at the same station.

Next, let’s examine the overall distribution of trip distances again

Code
# check the distance distribution of trips that pick up and drop off at different stations
divvy_0524['distance'].describe()
count    598370.000000
mean          1.410557
std           1.265152
min           0.000000
25%           0.582939
50%           1.032578
75%           1.849402
max          20.840069
Name: distance, dtype: float64
Code
# plot the distance distribution of trips that pick up and drop off at different stations
divvy_0524[divvy_0524['start_station_name'] != divvy_0524['end_station_name']]['distance'].hist(bins=100)

As can be seen from the histgram, the majority of trips are short, with distances of less than 5 miles.

Let’s save our clearned data to a csv file for further geospatial plotting next

Code
divvy_0524.to_csv('202405-divvy-tripdata-cleaned.csv', index=False)

1.2 Geospatial Plotting

1.2.1 Python Packages for Geospatial Plotting

There are several widely used Python packages designed for geospatial plotting. In this lesson, we focus on two popular libraries:

  1. GeoPandasStatic Plotting
    • Extends the functionality of the pandas library to support spatial data types and geospatial operations.
    • Allows for easy reading, writing, and manipulation of spatial data, including shapefiles and GeoJSON formats.
    • Primarily used for data wrangling and creating basic static maps using matplotlib.
    • Integrates seamlessly with matplotlib, handling both tabular and vector-based geospatial data.
  2. FoliumInteractive Plotting
    • A powerful library for building interactive maps, built on top of the Leaflet.js framework.
    • Supports various map tiles and overlays, including GeoJSON layers for rendering vector data.
    • Ideal for creating web-based, zoomable, and clickable maps.

1.2.2 Static Plots with GeoPandas

A shapefile is a widely used format for storing the geometric location and attribute information of geographic features. Commonly used in GIS (Geographic Information Systems), shapefiles enable spatial data analysis and visualization.

1.2.2.1 📁 Components of a Shapefile

A shapefile actually consists of multiple files that share the same base name but have different extensions. Key components include:

  • .shp: Stores the geometry (points, lines, or polygons)
  • .shx: Index file for the geometry data
  • .dbf: Stores associated attribute data in tabular form (similar to a spreadsheet)
  • .prj: Contains coordinate system and projection information
1.2.2.2 🗺️ Geometry Types in Shapefiles

Shapefiles can represent different types of geographic features:

  • Points: e.g., Divvy stations, landmarks
  • Lines: e.g., roads, bike lanes
  • Polygons: e.g., neighborhood boundaries, parks

Each geometric feature can be linked to attribute data, such as names, types, or demographic values.

The .prj file is especially important for ensuring proper alignment of spatial features by defining the coordinate reference system.

1.2.3 Loading a Shapefile

We will use the Chicago community area shapefile to map the locations where Divvy stations are situated. This provides the foundation for joining spatial data with community-level attributes.

📍 The Chicago Community Area shapefile can be downloaded from the Chicago Data Portal.

Code
import geopandas as gpd

# Create figure and axis
fig, ax = plt.subplots(figsize=(15, 10))

# Plot your GeoDataFrame
chicago = gpd.read_file('chicago_boundaries\geo_export_26bce2f2-c163-42a9-9329-9ca6e082c5e9.shp')
chicago.plot(column='community', ax=ax, legend=True, legend_kwds={'ncol': 2, 'bbox_to_anchor': (2, 1)})

# Add title (optional)
plt.title('Chicago Community Areas')

# Show the plot
plt.show()

1.2.4 Adding the divvy stations to the plot

In this step, we’ll create a scatter plot to visualize the Divvy stations, where each point represents a station’s location.
We’ll use the median latitude and longitude of each station for positioning. Once the scatter plot is created, we’ll overlay it on top of the Chicago community area map to provide geographic context.

Code
# read the csv file'divvy_2013.csv' into pandas pandas dataframe
data = pd.read_csv('202405-divvy-tripdata-cleaned.csv')

# drop the duplicates in the column 'from_station_id', 'from_station_name', 'latitude_start', 'longitude_start'

start_stations = data[[ 'start_station_name', 'start_station_median_lat', 'start_station_median_lng']].drop_duplicates()

start_stations.head()
start_station_name start_station_median_lat start_station_median_lng
0 Streeter Dr & Grand Ave 41.892276 -87.612041
1 Sheridan Rd & Greenleaf Ave 42.010592 -87.662428
4 Larrabee St & Division St 41.903535 -87.643344
6 Damen Ave & Wellington Ave 41.935895 -87.678456
7 Aberdeen St & Monroe St 41.880372 -87.655610
Code
import geopandas as gpd
# Adding the stations to the plot
fig, ax = plt.subplots(figsize=(15, 10))

chicago = gpd.read_file('chicago_boundaries\geo_export_26bce2f2-c163-42a9-9329-9ca6e082c5e9.shp')
chicago.plot(column='community', ax=ax, legend=True, legend_kwds={'ncol': 2, 'bbox_to_anchor': (2, 1)})

plt.scatter(start_stations['start_station_median_lng'], start_stations['start_station_median_lat'], s=10, alpha=0.5, color='black', marker='o')
# data[['longitude_start', 'latitude_start']].drop_duplicates().plot(ax=ax, color='red', markersize=10, marker='o')

# Add title (optional)
plt.title('Chicago Community Areas')

# Show the plot
plt.show()

Let’s change the chicago shapefile

Code
import geodatasets

chicago = gpd.read_file(geodatasets.get_path("geoda.chicago_commpop"))

# Plot the data
fig, ax = plt.subplots(figsize=(15, 10))
chicago.boundary.plot(ax=ax)
plt.scatter(start_stations['start_station_median_lng'], start_stations['start_station_median_lat'], s=10, alpha=0.5, color='black', marker='o')
plt.title('Chicago Community Areas')
Text(0.5, 1.0, 'Chicago Community Areas')

There are 14 stations located outside the community map. According to the Divvy website, the system covers both the Chicago area and Evanston. Based on the map, it appears that these out-of-map stations are situated in Evanston.

Next, we will add the community names to the Divvy trip dataset and verify this information.

Code
from shapely.geometry import Point

def find_community(gdf, lat, long):
    """
    Given a GeoDataFrame with a geometry column of polygons and a community column,
    find the community a point (lat, long) belongs to.

    Parameters:
    - gdf (GeoDataFrame): The GeoDataFrame with polygons and a community column.
    - lat (float): The latitude of the point.
    - long (float): The longitude of the point.

    Returns:
    - str: The community the point belongs to, or None if no community is found.
    """
    point = Point(long, lat)  # Note that Point takes (longitude, latitude)
    
    for _, row in gdf.iterrows():
        if row['geometry'].contains(point):
            return row['community']
    
    return None
Code
start_stations['community'] = start_stations.apply(lambda row: find_community(chicago, row['start_station_median_lat'], row['start_station_median_lng']), axis=1)
start_stations.head()
start_station_name start_station_median_lat start_station_median_lng community
0 Streeter Dr & Grand Ave 41.892276 -87.612041 NEAR NORTH SIDE
1 Sheridan Rd & Greenleaf Ave 42.010592 -87.662428 ROGERS PARK
4 Larrabee St & Division St 41.903535 -87.643344 NEAR NORTH SIDE
6 Damen Ave & Wellington Ave 41.935895 -87.678456 NORTH CENTER
7 Aberdeen St & Monroe St 41.880372 -87.655610 NEAR WEST SIDE
Code
start_stations.isnull().sum()
start_station_name           0
start_station_median_lat     0
start_station_median_lng     0
community                   14
dtype: int64

Based on the Divvy data description and geospatial plotting, the 14 stations missing community information belong to Evanston. Let’s proceed with imputing the community data for Evanston.

Code
# fill the missing values in the column 'community' with 'Evanston'
start_stations['community'] = start_stations['community'].fillna('Evanston')

start_stations.isnull().sum()
start_station_name          0
start_station_median_lat    0
start_station_median_lng    0
community                   0
dtype: int64
Code
# save the start_stations to a csv file

start_stations.to_csv('start_stations_with_community.csv', index=False)

Once we add the community names to the Divvy dataset, let’s explore which community has the most trips associated with it.

Code
# add the community to the divvy set
data=data.merge(start_stations[['start_station_name','community']], left_on='start_station_name', right_on='start_station_name', how='left')

data.head()
ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng ... end_lng member_casual start_station_median_lat start_station_median_lng end_station_median_lat end_station_median_lng distance distance_bins duration community
0 7D9F0CE9EC2A1297 classic_bike 2024-05-25 15:52:42 2024-05-25 16:11:50 Streeter Dr & Grand Ave 13022 Clark St & Elm St TA1307000039 41.892278 -87.612043 ... -87.631280 casual 41.892276 -87.612041 41.902838 -87.631606 1.234844 1-1.99 19.133333 NEAR NORTH SIDE
1 02EC47687411416F classic_bike 2024-05-14 15:11:51 2024-05-14 15:22:00 Sheridan Rd & Greenleaf Ave KA1504000159 Sheridan Rd & Loyola Ave RP-009 42.010587 -87.662412 ... -87.661198 casual 42.010592 -87.662428 42.001143 -87.661277 0.662281 0-0.99 10.150000 ROGERS PARK
2 101370FB2D3402BE classic_bike 2024-05-30 17:46:04 2024-05-30 18:09:16 Streeter Dr & Grand Ave 13022 Wabash Ave & 9th St TA1309000010 41.892278 -87.612043 ... -87.625734 member 41.892276 -87.612041 41.870736 -87.625740 1.644567 1-1.99 23.200000 NEAR NORTH SIDE
3 E97E396331ED6913 electric_bike 2024-05-17 20:21:54 2024-05-17 20:40:32 Streeter Dr & Grand Ave 13022 Sheffield Ave & Wellington Ave TA1307000052 41.892270 -87.611946 ... -87.652662 member 41.892276 -87.612041 41.936300 -87.652654 3.690219 2-4.99 18.633333 NEAR NORTH SIDE
4 674EDE311C543165 classic_bike 2024-05-22 18:52:20 2024-05-22 18:59:04 Larrabee St & Division St KA1504000079 Clark St & Elm St TA1307000039 41.903486 -87.643353 ... -87.631280 casual 41.903535 -87.643344 41.902838 -87.631606 0.621883 0-0.99 6.733333 NEAR NORTH SIDE

5 rows × 21 columns

Code
community_trip_counts = data['community'].value_counts().reset_index()
community_trip_counts.columns = ['community', 'trip_count']
community_trip_counts['log_trip_count'] = community_trip_counts['trip_count'].apply(lambda x: 0 if x == 0 else np.log(x))
community_trip_counts
community trip_count log_trip_count
0 NEAR NORTH SIDE 109137 11.600359
1 LINCOLN PARK 77608 11.259426
2 LOOP 66632 11.106940
3 LAKE VIEW 64769 11.078582
4 NEAR WEST SIDE 61773 11.031222
... ... ... ...
72 AVALON PARK 31 3.433987
73 EDISON PARK 26 3.258097
74 CALUMET HEIGHTS 24 3.178054
75 BURNSIDE 8 2.079442
76 RIVERDALE 4 1.386294

77 rows × 3 columns

Let’s next explore which of the community has the most number of stations

Code
community_station_counts = data[['start_station_name', 'community']].drop_duplicates()['community'].value_counts().reset_index()
community_station_counts.columns = ['start_station_name', 'station_count']
community_station_counts.head()
start_station_name station_count
0 NEAR WEST SIDE 64
1 NEAR NORTH SIDE 55
2 WEST TOWN 47
3 AUSTIN 42
4 LOOP 40

1.2.5 Interactive Plotting with Folium

In addition to static plotting, geopandas supports interactive geospatial visualizations through the folium library.

The explore() method in geopandas provides a simple interface for creating interactive maps, closely mirroring the API of the standard plot() method used for static plotting. You can use explore() directly on a GeoSeries or GeoDataFrame to generate interactive maps powered by folium.

Key Features of explore()

Here are some key capabilities of the explore() method:

  1. Interactive Map Display
    • Calling explore() on a GeoDataFrame opens an interactive map directly within a Jupyter notebook.
    • Users can pan, zoom, and explore the spatial features (points, lines, or polygons) on the map.
  2. Layer Control
    • Geometries are automatically added as layers, each styled based on its type.
    • You can view and toggle different layers for enhanced exploration.
  3. Tooltips and Attribute Popups
    • Hovering over a feature reveals tooltip information, typically drawn from attribute columns in the data.
    • This makes it easy to inspect the properties of individual geographic features.
  4. Search and Filter
    • Built-in search and filter tools allow users to explore specific attribute values or subsets of data directly on the map.
  5. Customization Options
    • The explore() method supports various customization options (e.g., color, column selection, tooltip formatting), enabling you to tailor the map for your audience or use case.

This method is particularly useful for creating web-friendly, interactive maps with minimal code—ideal for exploratory data analysis, public dashboards, or teaching materials.

Code
import folium
Code
# use the geopandas explore default settings
chicago = gpd.read_file(geodatasets.get_path("geoda.chicago_commpop"))

chicago.explore()
Make this Notebook Trusted to load map: File -> Trust Notebook
Code
# Customerize the explore settings
chicago = gpd.read_file(geodatasets.get_path("geoda.chicago_commpop"))

m = chicago.explore(
    column="POP2010",  # make choropleth based on "POP2010" column
    scheme="naturalbreaks",  # use mapclassify's natural breaks scheme
    legend=True,  # show legend
    k=10,  # use 10 bins
    tooltip=False,  # hide tooltip
    popup=["POP2010", "POP2000"],  # show popup (on-click)
    legend_kwds=dict(colorbar=False),  # do not use colorbar
    name="chicago",  # name of the layer in the map
)

m
Make this Notebook Trusted to load map: File -> Trust Notebook

The explore() method returns a folium.Map object, which can also be passed directly (as you do with ax in plot()). You can then use folium functionality directly on the resulting map. Next, let’s add the divvy station plot.

Code
type(m)
folium.folium.Map
1.2.5.1 Adding the divvy station on the Interactive Folium.Map
Code
# Helper function for adding the description to the station
def row_to_html(row):
    row_df = pd.DataFrame(row).T
    row_df.columns = [col.capitalize() for col in row_df.columns]
    return row_df.to_html(index=False)
Code
data[['start_station_name', 'start_station_median_lat', 'start_station_median_lng']]
start_station_name start_station_median_lat start_station_median_lng
0 Streeter Dr & Grand Ave 41.892276 -87.612041
1 Sheridan Rd & Greenleaf Ave 42.010592 -87.662428
2 Streeter Dr & Grand Ave 41.892276 -87.612041
3 Streeter Dr & Grand Ave 41.892276 -87.612041
4 Larrabee St & Division St 41.903535 -87.643344
... ... ... ...
598365 Clarendon Ave & Leland Ave 41.967942 -87.650045
598366 Wabash Ave & Roosevelt Rd 41.867179 -87.625985
598367 DuSable Lake Shore Dr & Belmont Ave 41.940740 -87.639154
598368 Green St & Washington Blvd 41.883201 -87.648782
598369 Morgan St & 31st St 41.837776 -87.651147

598370 rows × 3 columns

Code
# Extracting the latitude, longitude, and station name for plotting, and also counting the number of trips from each station
grouped_df = data.groupby(['start_station_name', 'start_station_median_lat', 'start_station_median_lng'])['ride_id'].count().reset_index()
display(grouped_df.sort_values('ride_id', ascending=False).head())
grouped_df.rename(columns={'start_station_name':'title', 'start_station_median_lat':'latitude', 'start_station_median_lng':'longitude', 'ride_id':'count'}, inplace=True)
grouped_df['description'] = grouped_df.apply(lambda row: row_to_html(row), axis=1)
geometry = gpd.points_from_xy(grouped_df['longitude'], grouped_df['latitude'])
geo_df = gpd.GeoDataFrame(grouped_df, geometry=geometry)
# Optional: Assign Coordinate Reference System (CRS)
geo_df.crs = "EPSG:4326"  # WGS84 coordinate system
geo_df
start_station_name start_station_median_lat start_station_median_lng ride_id
1217 Streeter Dr & Grand Ave 41.892276 -87.612041 9282
271 DuSable Lake Shore Dr & Monroe St 41.881059 -87.616777 6577
1314 Wilton Ave & Belmont Ave 41.940116 -87.652973 5845
1235 University Ave & 57th St 41.791475 -87.599920 5196
478 LaSalle St & Illinois St 41.890853 -87.631665 4937
title latitude longitude count description geometry
0 2112 W Peterson Ave 41.991168 -87.683617 119 <table border="1" class="dataframe">\n <thead... POINT (-87.68362 41.99117)
1 21st St & Pulaski Rd 41.853457 -87.724529 19 <table border="1" class="dataframe">\n <thead... POINT (-87.72453 41.85346)
2 63rd St Beach 41.780958 -87.576320 174 <table border="1" class="dataframe">\n <thead... POINT (-87.57632 41.78096)
3 900 W Harrison St 41.874755 -87.649817 882 <table border="1" class="dataframe">\n <thead... POINT (-87.64982 41.87476)
4 Aberdeen St & Jackson Blvd 41.877773 -87.654831 1471 <table border="1" class="dataframe">\n <thead... POINT (-87.65483 41.87777)
... ... ... ... ... ... ...
1333 Woodlawn Ave & 58th St 41.789404 -87.596487 987 <table border="1" class="dataframe">\n <thead... POINT (-87.59649 41.7894)
1334 Woodlawn Ave & 75th St 41.759160 -87.595751 10 <table border="1" class="dataframe">\n <thead... POINT (-87.59575 41.75916)
1335 Woodlawn Ave & Lake Park Ave 41.814078 -87.597017 218 <table border="1" class="dataframe">\n <thead... POINT (-87.59702 41.81408)
1336 Yates Blvd & 75th St 41.758705 -87.566409 32 <table border="1" class="dataframe">\n <thead... POINT (-87.56641 41.75871)
1337 Yates Blvd & 93rd St 41.726178 -87.566361 4 <table border="1" class="dataframe">\n <thead... POINT (-87.56636 41.72618)

1338 rows × 6 columns

you can add a hover tooltip (sometimes referred to as a tooltip or tooltip popup) for each point on your Folium map. This tooltip will appear when you hover over the markers on the map, providing additional information without needing to click on them. Here’s how you can modify your existing code to include hover tooltips:

Code
chicago = gpd.read_file(geodatasets.get_path("geoda.chicago_commpop"))

m = chicago.explore(
    column="POP2010",  # make choropleth based on "POP2010" column
    scheme="naturalbreaks",  # use mapclassify's natural breaks scheme
    legend=True,  # show legend
    k=10,  # use 10 bins
    tooltip=False,  # hide tooltip
    popup=["POP2010", "POP2000"],  # show popup (on-click)
    legend_kwds=dict(colorbar=False),  # do not use colorbar
    name="chicago",  # name of the layer in the map
)

geo_df.explore(
    m=m,  # pass the map object
    color="red",  # use red color on all points
    marker_kwds=dict(radius=5, fill=True),  # make marker radius 10px with fill
    tooltip="description",  # show "name" column in the tooltip
    tooltip_kwds=dict(labels=False),  # do not show column label in the tooltip
    name="divstation",  # name of the layer in the map
)
 
m
Make this Notebook Trusted to load map: File -> Trust Notebook
1.2.5.2 Assigning the station size based on its popularity

Next, we will categorize the number of trips associated with each station into 10 bins and assign point sizes accordingly. This will visually indicate the popularity of each station based on its size. Hovering over a station marker will display the station name and click it will show additional station information.

To do this, we need to use folium.CircleMarker, which is a versatile function in Folium for creating interactive maps with customizable circle markers. It allows you to visually represent geospatial data points and provide additional information through popups, enhancing the interactivity and usability of your map visualizations in Python. Adjust parameters and explore further customization options to suit your specific mapping needs.

Code

def add_bins(df):
    # Binning the 'count' column into 10 bins
    bins = np.linspace(df['count'].min(), df['count'].max(), 10)
    df['bin'] = np.digitize(df['count'], bins)

    # Assigning colors to bins
    cmap = plt.get_cmap('viridis')  # You can choose any colormap here
    df['color'] = df['bin'].apply(lambda x: cmap(x / len(bins)))

    # Assigning sizes to bins (for example, increasing size with bin number)
    df['size'] = df['bin'].apply(lambda x: 2 + x * 1)  # Adjust the size calculation as needed
    return df
Code
geo_df  = add_bins(geo_df)
geo_df
title latitude longitude count description geometry bin color size
0 2112 W Peterson Ave 41.991168 -87.683617 119 <table border="1" class="dataframe">\n <thead... POINT (-87.68362 41.99117) 1 (0.282623, 0.140926, 0.457517, 1.0) 3
1 21st St & Pulaski Rd 41.853457 -87.724529 19 <table border="1" class="dataframe">\n <thead... POINT (-87.72453 41.85346) 1 (0.282623, 0.140926, 0.457517, 1.0) 3
2 63rd St Beach 41.780958 -87.576320 174 <table border="1" class="dataframe">\n <thead... POINT (-87.57632 41.78096) 1 (0.282623, 0.140926, 0.457517, 1.0) 3
3 900 W Harrison St 41.874755 -87.649817 882 <table border="1" class="dataframe">\n <thead... POINT (-87.64982 41.87476) 1 (0.282623, 0.140926, 0.457517, 1.0) 3
4 Aberdeen St & Jackson Blvd 41.877773 -87.654831 1471 <table border="1" class="dataframe">\n <thead... POINT (-87.65483 41.87777) 2 (0.253935, 0.265254, 0.529983, 1.0) 4
... ... ... ... ... ... ... ... ... ...
1333 Woodlawn Ave & 58th St 41.789404 -87.596487 987 <table border="1" class="dataframe">\n <thead... POINT (-87.59649 41.7894) 1 (0.282623, 0.140926, 0.457517, 1.0) 3
1334 Woodlawn Ave & 75th St 41.759160 -87.595751 10 <table border="1" class="dataframe">\n <thead... POINT (-87.59575 41.75916) 1 (0.282623, 0.140926, 0.457517, 1.0) 3
1335 Woodlawn Ave & Lake Park Ave 41.814078 -87.597017 218 <table border="1" class="dataframe">\n <thead... POINT (-87.59702 41.81408) 1 (0.282623, 0.140926, 0.457517, 1.0) 3
1336 Yates Blvd & 75th St 41.758705 -87.566409 32 <table border="1" class="dataframe">\n <thead... POINT (-87.56641 41.75871) 1 (0.282623, 0.140926, 0.457517, 1.0) 3
1337 Yates Blvd & 93rd St 41.726178 -87.566361 4 <table border="1" class="dataframe">\n <thead... POINT (-87.56636 41.72618) 1 (0.282623, 0.140926, 0.457517, 1.0) 3

1338 rows × 9 columns

Code


m = folium.Map(location=[geo_df.geometry.centroid.y.mean(), geo_df.geometry.centroid.x.mean()], zoom_start=13)
# Add markers to the map based on visit_category
for idx, row in geo_df.iterrows():
    folium.CircleMarker(location=[row.geometry.y, row.geometry.x],
                        radius=row['size'],
                        color='red',
                        fill=True,
                        fill_color='red',
                        popup=row['description'],
                        tooltip = row['title']).add_to(m)

m
Make this Notebook Trusted to load map: File -> Trust Notebook

2 Independent Practice

Download the trip data from May 2023 to May 2024 from the divvy website and conduct a geospatial analysis. Your tasks include:

  • Imputing missing values where possible.
  • Removing invalid trips.
  • Calculating the percentage of trips where the bike was picked up and returned to the same station. Also, determine the percentage of these trips classified as commuting versus leisure.
  • Identifying the top 5 stations with the highest number of trips and checking if these stations are among those with the most Divvy stations.
  • Plotting the top 5 popular stations on a map, highlighting those with the highest number of associated trips.”

3 Reference: