Working with Geospatial datasets in Python

Geospatial plotting is a powerful technique for uncovering spatial patterns, identifying regional disparities, and communicating insights through interactive maps. It is an essential skill for data scientists working with location-based data.

This lesson introduces geospatial analysis and visualization using Python, focusing on the Chicago Divvy bike-sharing dataset as a case study.

This project was inspired by an MCDC practicum course taught in Winter 2022 by Dr. Arend Kuyper. In collaboration with faculty liaison Dr. Amanda Stathopoulos, the course partnered with the Chicago-based nonprofit organization Equiticity, represented by community partner Oboi Reed. Students worked in teams to deliver on Equiticity’s Request for Data Science Services through MCDC. Results from this and other MCDC student–community partnership projects are summarized on the MCDC Past Projects website. MCDC was funded by the U.S. National Science Foundation.

This lesson was developed by Dr. Lizhen Shi for MCDC pathway courses.

1 Guided Practice

1.1 Data Clearning and Preparation

Real-world datasets often require cleaning before analysis. In this section, we prepare the Divvy bike-sharing dataset for geospatial plotting by addressing missing values, standardizing station coordinates, and removing invalid trips

1.1.1 Import Libraries

Let’s start by importing the necessary Python libraries

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

1.1.2 Load the Dataset

We’ll use trip data from May 2024, available on the Divvy website. This dataset contains records of individual bike trips, including geospatial coordinates.

Code

# Load the data
divvy_0524 = pd.read_csv('202405-divvy-tripdata.csv')
print(divvy_0524.shape)
divvy_0524.head()

(609493, 13)

	ride_id	rideable_type	started_at	ended_at	start_station_name	start_station_id	end_station_name	end_station_id	start_lat	start_lng	end_lat	end_lng	member_casual
0	7D9F0CE9EC2A1297	classic_bike	2024-05-25 15:52:42	2024-05-25 16:11:50	Streeter Dr & Grand Ave	13022	Clark St & Elm St	TA1307000039	41.892278	-87.612043	41.902973	-87.631280	casual
1	02EC47687411416F	classic_bike	2024-05-14 15:11:51	2024-05-14 15:22:00	Sheridan Rd & Greenleaf Ave	KA1504000159	Sheridan Rd & Loyola Ave	RP-009	42.010587	-87.662412	42.001044	-87.661198	casual
2	101370FB2D3402BE	classic_bike	2024-05-30 17:46:04	2024-05-30 18:09:16	Streeter Dr & Grand Ave	13022	Wabash Ave & 9th St	TA1309000010	41.892278	-87.612043	41.870769	-87.625734	member
3	E97E396331ED6913	electric_bike	2024-05-17 20:21:54	2024-05-17 20:40:32	Streeter Dr & Grand Ave	13022	Sheffield Ave & Wellington Ave	TA1307000052	41.892270	-87.611946	41.936253	-87.652662	member
4	674EDE311C543165	classic_bike	2024-05-22 18:52:20	2024-05-22 18:59:04	Larrabee St & Division St	KA1504000079	Clark St & Elm St	TA1307000039	41.903486	-87.643353	41.902973	-87.631280	casual

The dataset has 609,493 rows and 13 columns, each representing a bike trip. Here’s a breakdown of the columns:

Field name	Data type	Description
ride_id	String	Unique ID for each ride
rideable_type	String	Bike types
started_at	Timestamp	Trip start day and time
ended_at	Timestamp	Trip end day and time
start_station_name	String	Trip start station
start_station_id	String	ID of the start station
end_station_name	String	Trip end station
end_station_id	String	ID of the end station
start_lat	Float	Latitude of the start station
start_lng	Float	Longitude of the start station
end_lat	Float	Latitude of the end station
end_lng	Float	Longitude of the end station
member_casual	String	Rider type

The geospatial-related columns are:

start_latand start_lngrepresent the coordinates where a bike is picked up,
end_lat, and end_lng represent the coordinates where a bike is returned.
start_station_name represents the trip start station
start_station_id is the ID of the start station
end_station_name represents the trip end station
end_station_ID is the ID of the end station

1.1.3 Handling Missing Values

let’s check for missing values within the dataset.

Code

divvy_0524.isnull().sum()

ride_id                    0
rideable_type              0
started_at                 0
ended_at                   0
start_station_name    109048
start_station_id      109048
end_station_name      112731
end_station_id        112731
start_lat                  0
start_lng                  0
end_lat                  784
end_lng                  784
member_casual              0
dtype: int64

There are 784 observations missing end_lat and end_lng. Let’s explore whether these observations have a station name listed so that we can use the station name to impute its coordinates.

Code

# show start_station_name, start_lat, and start_lng all null
divvy_0524[(divvy_0524['start_station_name'].isnull()) & ((divvy_0524['start_lat'].isnull()) | (divvy_0524['start_lng'].isnull()))]

	ride_id	rideable_type	started_at	ended_at	start_station_name	start_station_id	end_station_name	end_station_id	start_lat	start_lng	end_lat	end_lng	member_casual

The result is an empty DataFrame, indicating that rows missing end_lat and end_lng also lack end_station_name and end_station_id. Since imputation isn’t possible, we’ll drop these rows:

Code

# drop rows where  end_station_name, end_lat, and end_lng are all null
divvy_0524 = divvy_0524[ ~((divvy_0524['end_station_name'].isnull()) & ((divvy_0524['end_lat'].isnull()) | (divvy_0524['end_lng'].isnull())))]

# rechekc the null values
divvy_0524.isnull().sum()

ride_id                    0
rideable_type              0
started_at                 0
ended_at                   0
start_station_name    109048
start_station_id      109048
end_station_name      111947
end_station_id        111947
start_lat                  0
start_lng                  0
end_lat                    0
end_lng                    0
member_casual              0
dtype: int64

We successfully dropped the instances where both end_lat and end_lng are missing. Next, let’s explore the mapping between station names and their coordinates for our missing value imputation.

Code

locations = pd.concat([divvy_0524[['start_station_name', 'start_lat', 'start_lng']].drop_duplicates().rename(columns={'start_station_name': 'station_name', 'start_lat': 'lat', 'start_lng': 'lng'}),
                        divvy_0524[['end_station_name', 'end_lat', 'end_lng']].drop_duplicates().rename(columns={'end_station_name': 'station_name', 'end_lat': 'lat', 'end_lng': 'lng'})])

locations = locations.drop_duplicates()
print(locations.shape)
locations.head()

(195676, 3)

	station_name	lat	lng
0	Streeter Dr & Grand Ave	41.892278	-87.612043
1	Sheridan Rd & Greenleaf Ave	42.010587	-87.662412
3	Streeter Dr & Grand Ave	41.892270	-87.611946
4	Larrabee St & Division St	41.903486	-87.643353
5	Sheridan Rd & Greenleaf Ave	42.010571	-87.662456

Code

# get the distinct station names names
locations['station_name'].unique().size

195,676 unique coordinates correspond to 1361 station names. That’s an average of 195 coordinates per station. Let’s find out which station has the most cooridinates next

Code

locations.groupby('station_name').size().sort_values(ascending=False).head(10)

station_name
Streeter Dr & Grand Ave              1831
Clinton St & Washington Blvd         1550
Michigan Ave & Oak St                1372
DuSable Lake Shore Dr & Monroe St    1307
Kingsbury St & Kinzie St             1304
Wells St & Concord Ln                1269
Clark St & Elm St                    1264
Clinton St & Madison St              1223
Wells St & Elm St                    1220
Sheffield Ave & Waveland Ave         1190
dtype: int64

1.1.4 Standardizing Station Coordinates

Divvy stations can have multiple coordinates for the same station due to GPS variations. For geospatial plotting, we’ll use a single coordinate per station by calculating the median latitude and longitude:

Code

# Calculate median lat and lng for each station_name
median_locations = locations.groupby('station_name').agg({'lat': 'median', 'lng': 'median'}).reset_index()
median_locations.rename(columns={'lat': 'median_lat', 'lng': 'median_lng'}, inplace=True)
print(median_locations.shape)
median_locations.head()

(1360, 3)

	station_name	median_lat	median_lng
0	2112 W Peterson Ave	41.991168	-87.683617
1	21st St & Pulaski Rd	41.853457	-87.724529
2	63rd St Beach	41.780958	-87.576320
3	900 W Harrison St	41.874755	-87.649817
4	Aberdeen St & Jackson Blvd	41.877773	-87.654831

If two coordinates are very close, it’s likely that they belong to the same Divvy station. For the remaining missing values, let’s use the nearest neighbor’s station name based on coordinates to fill in the missing names.

Code

# define the nearestneighbor method

from sklearn.neighbors import NearestNeighbors

def make_nn(df):
    # Ensure the DataFrame has the necessary columns
    if not all(col in df.columns for col in ['median_lat', 'median_lng']):
        raise ValueError("DataFrame must contain 'lat' and 'lng' columns")
    
    # Extract the coordinates from the DataFrame
    coordinates = df[['median_lat', 'median_lng']].values
    
    # Create and fit the NearestNeighbors model
    nbrs = NearestNeighbors(n_neighbors=1, algorithm='ball_tree').fit(coordinates)
    
    return nbrs

nbrs = make_nn(median_locations)

def find_nearest_station(nbrs, df, lat, lng):
    
    # Find the nearest neighbor to the given (lat, lng)
    distance, index = nbrs.kneighbors([[lat, lng]])
    
    # Get the nearest station's information
    nearest_station = df.iloc[index[0][0]]['station_name']
    
    return nearest_station

Code

# check whether the method can return the nearest station name
find_nearest_station(nbrs, median_locations, 41.892278, -87.612043)

'Streeter Dr & Grand Ave'

Code

divvy_0524= divvy_0524.reset_index(drop=True)

Code

# use the nearest neighbor method to impute the missing values in the `start_station_name` and `end_station_name` in the dataset
from tqdm import tqdm
# impute start_station_name
for i in tqdm(range(len(divvy_0524))):
    if pd.isnull(divvy_0524.loc[i, 'start_station_name']):
        divvy_0524.loc[i, 'start_station_name'] = find_nearest_station(nbrs, median_locations, divvy_0524.loc[i, 'start_lat'], divvy_0524.loc[i, 'start_lng'])
    # impute end_station_name
    if pd.isnull(divvy_0524.loc[i, 'end_station_name']):
        divvy_0524.loc[i, 'end_station_name'] = find_nearest_station(nbrs, median_locations, divvy_0524.loc[i, 'end_lat'], divvy_0524.loc[i, 'end_lng'])

100%|██████████| 608709/608709 [05:57<00:00, 1701.02it/s]

The assert statement in Python is used as a debugging aid that tests a condition. If the condition is True, the program continues to execute normally. If the condition is False, an AssertionError is raised, and the program stops executing. assert is commonly used to check assumptions made by the program and to catch bugs early in the development process.

In this case, after the imputation, there should be no missing values in the start_staion_name and end_station_name columns.

Code

# use assert to check the imputation results
assert divvy_0524['start_station_name'].isnull().sum() == 0
assert divvy_0524['end_station_name'].isnull().sum() == 0

Let’s merge the median_lat, and median_lng to the divvy dataset for plotting

Code

divvy_0524 = pd.merge(divvy_0524, median_locations, left_on='start_station_name', right_on='station_name', how='left')
divvy_0524 = divvy_0524.drop(columns=['station_name'])
divvy_0524.rename(columns={'median_lat': 'start_station_median_lat', 'median_lng': 'start_station_median_lng'}, inplace=True)

divvy_0524 = pd.merge(divvy_0524, median_locations, left_on='end_station_name', right_on='station_name', how='left')
divvy_0524 = divvy_0524.drop(columns=['station_name'])
divvy_0524.rename(columns={'median_lat': 'end_station_median_lat', 'median_lng': 'end_station_median_lng'}, inplace=True)

1.1.5 Data Cleaning: Trip Distance Exploration

Trip distance is a critical metric, even though it isn’t an original feature in the dataset. Fortunately, we have the start and end coordinates for each trip, allowing us to calculate distances based on them.

Analyzing trip distances in the Divvy dataset is essential for several reasons, including understanding user behavior, assessing system efficiency, and evaluating the overall performance of the bike-sharing service.

To calculate the trip distance using the start and end coordinates, we can employ the Haversine formula, which determines the great-circle distance between two points on the Earth’s surface based on their latitude and longitude. Alternatively, we can use the geopy library in Python, which provides a straightforward interface for geospatial calculations.

Hee is an implementation of the Haversine formula in python

Code

def haversine(lat1, lon1, lat2, lon2):
    # Convert degrees to radians
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    
    # Compute differences
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    
    # Haversine formula
    a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    
    # Radius of Earth in kilometers and conversion to miles
    R_km = 6371.0
    R_mi = R_km * 0.621371
    
    # Compute distance
    distance = R_mi * c
    
    return distance

divvy_0524['distance'] = haversine(divvy_0524['start_lat'], divvy_0524['start_lng'], divvy_0524['end_lat'], divvy_0524['end_lng'])

Code

# plot the histgram to check the distribution
divvy_0524['distance'].hist(bins=100)

Code

# check the statistics of the distance
divvy_0524['distance'].describe()

count    608709.000000
mean          1.386598
std           1.267535
min           0.000000
25%           0.556175
50%           1.020599
75%           1.825678
max          20.840069
Name: distance, dtype: float64

Code

# bin the distance into 4 groups, min-0.99, 1-1.99, 2-4.99, 5-max()
divvy_0524['distance_bins'] = pd.cut(divvy_0524['distance'], bins=[0, 0.99, 1.99, 4.99, divvy_0524['distance'].max()], labels=['0-0.99', '1-1.99', '2-4.99', '5-max'])


# plot the distance_bins
divvy_0524['distance_bins'].value_counts().plot(kind='bar')

# add the title of the plot
divvy_0524['distance_bins'].value_counts().plot(kind='bar', title='Distance Bins')

Code

# check the percentage of value counts of the distance_bins
divvy_0524['distance_bins'].value_counts(normalize=True)

distance_bins
0-0.99    0.452076
1-1.99    0.311665
2-4.99    0.213208
5-max     0.023052
Name: proportion, dtype: float64

Next, let’s determine the percentage of trips that start and end at the same station, indicating a distance of zero. These trips can fall into two categories:

Valid trips: Trips where bikes are picked up and returned to the same station.
Invalid trips: These are likely noise in the data, such as when someone unlocks a bike but then decides not to rent it. In such scenarios, the trip duration is also very short.

For the second case, we need to remove the noise before we proceed with our analysis

Code

# show the distance equals to 0
divvy_0524[divvy_0524['distance'] == 0]

	ride_id	rideable_type	started_at	ended_at	start_station_name	start_station_id	end_station_name	end_station_id	start_lat	start_lng	end_lat	end_lng	member_casual	start_station_median_lat	start_station_median_lng	end_station_median_lat	end_station_median_lng	distance	distance_bins
11	95CA09985B75FA22	classic_bike	2024-05-27 18:14:19	2024-05-27 19:09:30	McClurg Ct & Ohio St	TA1306000029	McClurg Ct & Ohio St	TA1306000029	41.892592	-87.617289	41.892592	-87.617289	member	41.892658	-87.617321	41.892658	-87.617321	0.0	NaN
13	A08FB7FA82FCD72E	classic_bike	2024-05-09 17:38:03	2024-05-09 17:41:37	Indiana Ave & 26th St	TA1307000005	Indiana Ave & 26th St	TA1307000005	41.845687	-87.622481	41.845687	-87.622481	member	41.845708	-87.622545	41.845708	-87.622545	0.0	NaN
14	4E578174D37FBFE5	classic_bike	2024-05-09 17:42:22	2024-05-09 17:42:49	Indiana Ave & 26th St	TA1307000005	Indiana Ave & 26th St	TA1307000005	41.845687	-87.622481	41.845687	-87.622481	member	41.845708	-87.622545	41.845708	-87.622545	0.0	NaN
15	80903806220B461B	classic_bike	2024-05-28 19:42:27	2024-05-28 20:22:45	McClurg Ct & Ohio St	TA1306000029	McClurg Ct & Ohio St	TA1306000029	41.892592	-87.617289	41.892592	-87.617289	member	41.892658	-87.617321	41.892658	-87.617321	0.0	NaN
16	855A39B703AEEF8D	classic_bike	2024-05-31 12:02:35	2024-05-31 12:02:58	Stetson Ave & South Water St	TA1308000029	Stetson Ave & South Water St	TA1308000029	41.886835	-87.622320	41.886835	-87.622320	member	41.886770	-87.622422	41.886770	-87.622422	0.0	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
608122	8BFD4D2FF581C242	electric_bike	2024-05-21 19:36:07	2024-05-21 19:41:15	University Ave & 57th St	NaN	University Ave & 57th St	NaN	41.790000	-87.600000	41.790000	-87.600000	casual	41.791475	-87.599920	41.791475	-87.599920	0.0	NaN
608123	764B4FDBDFCBC442	electric_bike	2024-05-22 14:34:41	2024-05-22 14:56:53	DuSable Lake Shore Dr & Belmont Ave	NaN	DuSable Lake Shore Dr & Belmont Ave	NaN	41.940000	-87.640000	41.940000	-87.640000	casual	41.940740	-87.639154	41.940740	-87.639154	0.0	NaN
608134	5D055CB420A4FE10	classic_bike	2024-05-21 22:10:52	2024-05-21 23:11:42	Avenue O & 134th St	20214	Avenue O & 134th St	20214	41.651799	-87.540696	41.651799	-87.540696	casual	41.651881	-87.539622	41.651881	-87.539622	0.0	NaN
608140	AF072EB6C20AD421	electric_bike	2024-05-21 19:02:16	2024-05-21 19:02:25	Dearborn Pkwy & Delaware Pl	NaN	Dearborn Pkwy & Delaware Pl	NaN	41.900000	-87.630000	41.900000	-87.630000	casual	41.898946	-87.629925	41.898946	-87.629925	0.0	NaN
608145	87452852EC07FE5F	electric_bike	2024-05-22 18:01:00	2024-05-22 18:01:44	DuSable Lake Shore Dr & Monroe St	NaN	DuSable Lake Shore Dr & Monroe St	NaN	41.880000	-87.620000	41.880000	-87.620000	casual	41.881059	-87.616777	41.881059	-87.616777	0.0	NaN

36084 rows × 19 columns

Among these trips, we will identify those with a duration of less than 1 minute and a trip distance of 0. Let’s filter out these anomalies

Code

# set the start and end time to datetime
divvy_0524['started_at'] = pd.to_datetime(divvy_0524['started_at'])
divvy_0524['ended_at'] = pd.to_datetime(divvy_0524['ended_at'])

# create a new column duration in minutes
divvy_0524['duration'] = (divvy_0524['ended_at'] - divvy_0524['started_at']).dt.total_seconds() / 60

Code

# extract the trips that whose duration is less than or equal to 1 minute and distance is 0
divvy_0524[(divvy_0524['duration'] <= 1) & (divvy_0524['distance'] == 0)]

	ride_id	rideable_type	started_at	ended_at	start_station_name	start_station_id	end_station_name	end_station_id	start_lat	start_lng	end_lat	end_lng	member_casual	start_station_median_lat	start_station_median_lng	end_station_median_lat	end_station_median_lng	distance	distance_bins	duration
14	4E578174D37FBFE5	classic_bike	2024-05-09 17:42:22	2024-05-09 17:42:49	Indiana Ave & 26th St	TA1307000005	Indiana Ave & 26th St	TA1307000005	41.845687	-87.622481	41.845687	-87.622481	member	41.845708	-87.622545	41.845708	-87.622545	0.0	NaN	0.450000
16	855A39B703AEEF8D	classic_bike	2024-05-31 12:02:35	2024-05-31 12:02:58	Stetson Ave & South Water St	TA1308000029	Stetson Ave & South Water St	TA1308000029	41.886835	-87.622320	41.886835	-87.622320	member	41.886770	-87.622422	41.886770	-87.622422	0.0	NaN	0.383333
262	9F76CEEAA6B34C0B	classic_bike	2024-05-12 16:04:57	2024-05-12 16:05:15	Streeter Dr & Grand Ave	13022	Streeter Dr & Grand Ave	13022	41.892278	-87.612043	41.892278	-87.612043	member	41.892276	-87.612041	41.892276	-87.612041	0.0	NaN	0.300000
631	32B4023E9923173E	classic_bike	2024-05-18 22:37:50	2024-05-18 22:38:16	Ashland Ave & Blackhawk St	13224	Ashland Ave & Blackhawk St	13224	41.907066	-87.667252	41.907066	-87.667252	member	41.907060	-87.667226	41.907060	-87.667226	0.0	NaN	0.433333
632	408786CFCC0E004F	classic_bike	2024-05-23 13:39:18	2024-05-23 13:39:52	Cornell Ave & Hyde Park Blvd	KA1503000007	Cornell Ave & Hyde Park Blvd	KA1503000007	41.802406	-87.586924	41.802406	-87.586924	member	41.802421	-87.586920	41.802421	-87.586920	0.0	NaN	0.566667
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
608112	35F4266BF2D0AB22	electric_bike	2024-05-22 18:12:38	2024-05-22 18:12:40	DuSable Lake Shore Dr & Monroe St	NaN	DuSable Lake Shore Dr & Monroe St	NaN	41.880000	-87.620000	41.880000	-87.620000	casual	41.881059	-87.616777	41.881059	-87.616777	0.0	NaN	0.033333
608117	DF9291196964FB93	electric_bike	2024-05-22 07:35:36	2024-05-22 07:35:48	Green St & Madison St	NaN	Green St & Madison St	NaN	41.880000	-87.650000	41.880000	-87.650000	casual	41.881861	-87.649189	41.881861	-87.649189	0.0	NaN	0.200000
608120	2D857C9FCF2434CF	electric_bike	2024-05-21 19:40:39	2024-05-21 19:41:23	Sedgwick St & Webster Ave	NaN	Sedgwick St & Webster Ave	NaN	41.920000	-87.640000	41.920000	-87.640000	casual	41.922176	-87.638982	41.922176	-87.638982	0.0	NaN	0.733333
608140	AF072EB6C20AD421	electric_bike	2024-05-21 19:02:16	2024-05-21 19:02:25	Dearborn Pkwy & Delaware Pl	NaN	Dearborn Pkwy & Delaware Pl	NaN	41.900000	-87.630000	41.900000	-87.630000	casual	41.898946	-87.629925	41.898946	-87.629925	0.0	NaN	0.150000
608145	87452852EC07FE5F	electric_bike	2024-05-22 18:01:00	2024-05-22 18:01:44	DuSable Lake Shore Dr & Monroe St	NaN	DuSable Lake Shore Dr & Monroe St	NaN	41.880000	-87.620000	41.880000	-87.620000	casual	41.881059	-87.616777	41.881059	-87.616777	0.0	NaN	0.733333

10339 rows × 20 columns

Let’s remove them from the divvy set

Code

# drop the trips that whose duration is less than or equal to 1 minute and distance is 0
divvy_0524 = divvy_0524[~((divvy_0524['duration'] <= 1) & (divvy_0524['distance'] == 0))]
divvy_0524.shape

(598370, 20)

Code

# check the percentage of valid trips that pick up and drop off at the same station
divvy_0524[divvy_0524['start_station_name'] == divvy_0524['end_station_name']].shape[0] / divvy_0524.shape[0]

0.06118956498487558

It turns out that only 6% of the trips start and end at the same station.

Next, let’s examine the overall distribution of trip distances again

Code

# check the distance distribution of trips that pick up and drop off at different stations
divvy_0524['distance'].describe()

count    598370.000000
mean          1.410557
std           1.265152
min           0.000000
25%           0.582939
50%           1.032578
75%           1.849402
max          20.840069
Name: distance, dtype: float64

Code

# plot the distance distribution of trips that pick up and drop off at different stations
divvy_0524[divvy_0524['start_station_name'] != divvy_0524['end_station_name']]['distance'].hist(bins=100)

As can be seen from the histgram, the majority of trips are short, with distances of less than 5 miles.

Let’s save our clearned data to a csv file for further geospatial plotting next

Code

divvy_0524.to_csv('202405-divvy-tripdata-cleaned.csv', index=False)

1.2 Geospatial Plotting

1.2.1 Python Packages for Geospatial Plotting

There are several widely used Python packages designed for geospatial plotting. In this lesson, we focus on two popular libraries:

GeoPandas – Static Plotting
- Extends the functionality of the pandas library to support spatial data types and geospatial operations.
- Allows for easy reading, writing, and manipulation of spatial data, including shapefiles and GeoJSON formats.
- Primarily used for data wrangling and creating basic static maps using matplotlib.
- Integrates seamlessly with matplotlib, handling both tabular and vector-based geospatial data.
Folium – Interactive Plotting
- A powerful library for building interactive maps, built on top of the Leaflet.js framework.
- Supports various map tiles and overlays, including GeoJSON layers for rendering vector data.
- Ideal for creating web-based, zoomable, and clickable maps.

1.2.2 Static Plots with `GeoPandas`

A shapefile is a widely used format for storing the geometric location and attribute information of geographic features. Commonly used in GIS (Geographic Information Systems), shapefiles enable spatial data analysis and visualization.

1.2.2.1 📁 Components of a Shapefile

A shapefile actually consists of multiple files that share the same base name but have different extensions. Key components include:

.shp: Stores the geometry (points, lines, or polygons)
.shx: Index file for the geometry data
.dbf: Stores associated attribute data in tabular form (similar to a spreadsheet)
.prj: Contains coordinate system and projection information

1.2.2.2 🗺️ Geometry Types in Shapefiles

Shapefiles can represent different types of geographic features:

Points: e.g., Divvy stations, landmarks
Lines: e.g., roads, bike lanes
Polygons: e.g., neighborhood boundaries, parks

Each geometric feature can be linked to attribute data, such as names, types, or demographic values.

The .prj file is especially important for ensuring proper alignment of spatial features by defining the coordinate reference system.

1.2.3 Loading a Shapefile

We will use the Chicago community area shapefile to map the locations where Divvy stations are situated. This provides the foundation for joining spatial data with community-level attributes.

📍 The Chicago Community Area shapefile can be downloaded from the Chicago Data Portal.

Code

import geopandas as gpd

# Create figure and axis
fig, ax = plt.subplots(figsize=(15, 10))

# Plot your GeoDataFrame
chicago = gpd.read_file('chicago_boundaries\geo_export_26bce2f2-c163-42a9-9329-9ca6e082c5e9.shp')
chicago.plot(column='community', ax=ax, legend=True, legend_kwds={'ncol': 2, 'bbox_to_anchor': (2, 1)})

# Add title (optional)
plt.title('Chicago Community Areas')

# Show the plot
plt.show()

1.2.4 Adding the divvy stations to the plot

In this step, we’ll create a scatter plot to visualize the Divvy stations, where each point represents a station’s location.
We’ll use the median latitude and longitude of each station for positioning. Once the scatter plot is created, we’ll overlay it on top of the Chicago community area map to provide geographic context.

Code

# read the csv file'divvy_2013.csv' into pandas pandas dataframe
data = pd.read_csv('202405-divvy-tripdata-cleaned.csv')

# drop the duplicates in the column 'from_station_id', 'from_station_name', 'latitude_start', 'longitude_start'

start_stations = data[[ 'start_station_name', 'start_station_median_lat', 'start_station_median_lng']].drop_duplicates()

start_stations.head()

	start_station_name	start_station_median_lat	start_station_median_lng
0	Streeter Dr & Grand Ave	41.892276	-87.612041
1	Sheridan Rd & Greenleaf Ave	42.010592	-87.662428
4	Larrabee St & Division St	41.903535	-87.643344
6	Damen Ave & Wellington Ave	41.935895	-87.678456
7	Aberdeen St & Monroe St	41.880372	-87.655610

Code

import geopandas as gpd
# Adding the stations to the plot
fig, ax = plt.subplots(figsize=(15, 10))

chicago = gpd.read_file('chicago_boundaries\geo_export_26bce2f2-c163-42a9-9329-9ca6e082c5e9.shp')
chicago.plot(column='community', ax=ax, legend=True, legend_kwds={'ncol': 2, 'bbox_to_anchor': (2, 1)})

plt.scatter(start_stations['start_station_median_lng'], start_stations['start_station_median_lat'], s=10, alpha=0.5, color='black', marker='o')
# data[['longitude_start', 'latitude_start']].drop_duplicates().plot(ax=ax, color='red', markersize=10, marker='o')

# Add title (optional)
plt.title('Chicago Community Areas')

# Show the plot
plt.show()

Let’s change the chicago shapefile

Code

import geodatasets

chicago = gpd.read_file(geodatasets.get_path("geoda.chicago_commpop"))

# Plot the data
fig, ax = plt.subplots(figsize=(15, 10))
chicago.boundary.plot(ax=ax)
plt.scatter(start_stations['start_station_median_lng'], start_stations['start_station_median_lat'], s=10, alpha=0.5, color='black', marker='o')
plt.title('Chicago Community Areas')

Text(0.5, 1.0, 'Chicago Community Areas')

There are 14 stations located outside the community map. According to the Divvy website, the system covers both the Chicago area and Evanston. Based on the map, it appears that these out-of-map stations are situated in Evanston.

Next, we will add the community names to the Divvy trip dataset and verify this information.

Code

from shapely.geometry import Point

def find_community(gdf, lat, long):
    """
    Given a GeoDataFrame with a geometry column of polygons and a community column,
    find the community a point (lat, long) belongs to.

    Parameters:
    - gdf (GeoDataFrame): The GeoDataFrame with polygons and a community column.
    - lat (float): The latitude of the point.
    - long (float): The longitude of the point.

    Returns:
    - str: The community the point belongs to, or None if no community is found.
    """
    point = Point(long, lat)  # Note that Point takes (longitude, latitude)
    
    for _, row in gdf.iterrows():
        if row['geometry'].contains(point):
            return row['community']
    
    return None

Code

start_stations['community'] = start_stations.apply(lambda row: find_community(chicago, row['start_station_median_lat'], row['start_station_median_lng']), axis=1)
start_stations.head()

	start_station_name	start_station_median_lat	start_station_median_lng	community
0	Streeter Dr & Grand Ave	41.892276	-87.612041	NEAR NORTH SIDE
1	Sheridan Rd & Greenleaf Ave	42.010592	-87.662428	ROGERS PARK
4	Larrabee St & Division St	41.903535	-87.643344	NEAR NORTH SIDE
6	Damen Ave & Wellington Ave	41.935895	-87.678456	NORTH CENTER
7	Aberdeen St & Monroe St	41.880372	-87.655610	NEAR WEST SIDE

Code

start_stations.isnull().sum()

start_station_name           0
start_station_median_lat     0
start_station_median_lng     0
community                   14
dtype: int64

Based on the Divvy data description and geospatial plotting, the 14 stations missing community information belong to Evanston. Let’s proceed with imputing the community data for Evanston.

Code

# fill the missing values in the column 'community' with 'Evanston'
start_stations['community'] = start_stations['community'].fillna('Evanston')

start_stations.isnull().sum()

start_station_name          0
start_station_median_lat    0
start_station_median_lng    0
community                   0
dtype: int64

Code

# save the start_stations to a csv file

start_stations.to_csv('start_stations_with_community.csv', index=False)

Once we add the community names to the Divvy dataset, let’s explore which community has the most trips associated with it.

Code

# add the community to the divvy set
data=data.merge(start_stations[['start_station_name','community']], left_on='start_station_name', right_on='start_station_name', how='left')

data.head()

	ride_id	rideable_type	started_at	ended_at	start_station_name	start_station_id	end_station_name	end_station_id	start_lat	start_lng	...	end_lng	member_casual	start_station_median_lat	start_station_median_lng	end_station_median_lat	end_station_median_lng	distance	distance_bins	duration	community
0	7D9F0CE9EC2A1297	classic_bike	2024-05-25 15:52:42	2024-05-25 16:11:50	Streeter Dr & Grand Ave	13022	Clark St & Elm St	TA1307000039	41.892278	-87.612043	...	-87.631280	casual	41.892276	-87.612041	41.902838	-87.631606	1.234844	1-1.99	19.133333	NEAR NORTH SIDE
1	02EC47687411416F	classic_bike	2024-05-14 15:11:51	2024-05-14 15:22:00	Sheridan Rd & Greenleaf Ave	KA1504000159	Sheridan Rd & Loyola Ave	RP-009	42.010587	-87.662412	...	-87.661198	casual	42.010592	-87.662428	42.001143	-87.661277	0.662281	0-0.99	10.150000	ROGERS PARK
2	101370FB2D3402BE	classic_bike	2024-05-30 17:46:04	2024-05-30 18:09:16	Streeter Dr & Grand Ave	13022	Wabash Ave & 9th St	TA1309000010	41.892278	-87.612043	...	-87.625734	member	41.892276	-87.612041	41.870736	-87.625740	1.644567	1-1.99	23.200000	NEAR NORTH SIDE
3	E97E396331ED6913	electric_bike	2024-05-17 20:21:54	2024-05-17 20:40:32	Streeter Dr & Grand Ave	13022	Sheffield Ave & Wellington Ave	TA1307000052	41.892270	-87.611946	...	-87.652662	member	41.892276	-87.612041	41.936300	-87.652654	3.690219	2-4.99	18.633333	NEAR NORTH SIDE
4	674EDE311C543165	classic_bike	2024-05-22 18:52:20	2024-05-22 18:59:04	Larrabee St & Division St	KA1504000079	Clark St & Elm St	TA1307000039	41.903486	-87.643353	...	-87.631280	casual	41.903535	-87.643344	41.902838	-87.631606	0.621883	0-0.99	6.733333	NEAR NORTH SIDE

5 rows × 21 columns

Code

community_trip_counts = data['community'].value_counts().reset_index()
community_trip_counts.columns = ['community', 'trip_count']
community_trip_counts['log_trip_count'] = community_trip_counts['trip_count'].apply(lambda x: 0 if x == 0 else np.log(x))
community_trip_counts

	community	trip_count	log_trip_count
0	NEAR NORTH SIDE	109137	11.600359
1	LINCOLN PARK	77608	11.259426
2	LOOP	66632	11.106940
3	LAKE VIEW	64769	11.078582
4	NEAR WEST SIDE	61773	11.031222
...	...	...	...
72	AVALON PARK	31	3.433987
73	EDISON PARK	26	3.258097
74	CALUMET HEIGHTS	24	3.178054
75	BURNSIDE	8	2.079442
76	RIVERDALE	4	1.386294

77 rows × 3 columns

Let’s next explore which of the community has the most number of stations

Code

community_station_counts = data[['start_station_name', 'community']].drop_duplicates()['community'].value_counts().reset_index()
community_station_counts.columns = ['start_station_name', 'station_count']
community_station_counts.head()

	start_station_name	station_count
0	NEAR WEST SIDE	64
1	NEAR NORTH SIDE	55
2	WEST TOWN	47
3	AUSTIN	42
4	LOOP	40

1.2.5 Interactive Plotting with `Folium`

In addition to static plotting, geopandas supports interactive geospatial visualizations through the folium library.

The explore() method in geopandas provides a simple interface for creating interactive maps, closely mirroring the API of the standard plot() method used for static plotting. You can use explore() directly on a GeoSeries or GeoDataFrame to generate interactive maps powered by folium.

Key Features of explore()

Here are some key capabilities of the explore() method:

Interactive Map Display
- Calling explore() on a GeoDataFrame opens an interactive map directly within a Jupyter notebook.
- Users can pan, zoom, and explore the spatial features (points, lines, or polygons) on the map.
Layer Control
- Geometries are automatically added as layers, each styled based on its type.
- You can view and toggle different layers for enhanced exploration.
Tooltips and Attribute Popups
- Hovering over a feature reveals tooltip information, typically drawn from attribute columns in the data.
- This makes it easy to inspect the properties of individual geographic features.
Search and Filter
- Built-in search and filter tools allow users to explore specific attribute values or subsets of data directly on the map.
Customization Options
- The explore() method supports various customization options (e.g., color, column selection, tooltip formatting), enabling you to tailor the map for your audience or use case.

This method is particularly useful for creating web-friendly, interactive maps with minimal code—ideal for exploratory data analysis, public dashboards, or teaching materials.

Code

import folium

Code

# use the geopandas explore default settings
chicago = gpd.read_file(geodatasets.get_path("geoda.chicago_commpop"))

chicago.explore()

Make this Notebook Trusted to load map: File -> Trust Notebook

Code

# Customerize the explore settings
chicago = gpd.read_file(geodatasets.get_path("geoda.chicago_commpop"))

m = chicago.explore(
    column="POP2010",  # make choropleth based on "POP2010" column
    scheme="naturalbreaks",  # use mapclassify's natural breaks scheme
    legend=True,  # show legend
    k=10,  # use 10 bins
    tooltip=False,  # hide tooltip
    popup=["POP2010", "POP2000"],  # show popup (on-click)
    legend_kwds=dict(colorbar=False),  # do not use colorbar
    name="chicago",  # name of the layer in the map
)

m

Make this Notebook Trusted to load map: File -> Trust Notebook

The explore() method returns a folium.Map object, which can also be passed directly (as you do with ax in plot()). You can then use folium functionality directly on the resulting map. Next, let’s add the divvy station plot.

Code

type(m)

folium.folium.Map

1.2.5.1 Adding the divvy station on the Interactive `Folium.Map`

Code

# Helper function for adding the description to the station
def row_to_html(row):
    row_df = pd.DataFrame(row).T
    row_df.columns = [col.capitalize() for col in row_df.columns]
    return row_df.to_html(index=False)

Code

data[['start_station_name', 'start_station_median_lat', 'start_station_median_lng']]

	start_station_name	start_station_median_lat	start_station_median_lng
0	Streeter Dr & Grand Ave	41.892276	-87.612041
1	Sheridan Rd & Greenleaf Ave	42.010592	-87.662428
2	Streeter Dr & Grand Ave	41.892276	-87.612041
3	Streeter Dr & Grand Ave	41.892276	-87.612041
4	Larrabee St & Division St	41.903535	-87.643344
...	...	...	...
598365	Clarendon Ave & Leland Ave	41.967942	-87.650045
598366	Wabash Ave & Roosevelt Rd	41.867179	-87.625985
598367	DuSable Lake Shore Dr & Belmont Ave	41.940740	-87.639154
598368	Green St & Washington Blvd	41.883201	-87.648782
598369	Morgan St & 31st St	41.837776	-87.651147

598370 rows × 3 columns

Code

# Extracting the latitude, longitude, and station name for plotting, and also counting the number of trips from each station
grouped_df = data.groupby(['start_station_name', 'start_station_median_lat', 'start_station_median_lng'])['ride_id'].count().reset_index()
display(grouped_df.sort_values('ride_id', ascending=False).head())
grouped_df.rename(columns={'start_station_name':'title', 'start_station_median_lat':'latitude', 'start_station_median_lng':'longitude', 'ride_id':'count'}, inplace=True)
grouped_df['description'] = grouped_df.apply(lambda row: row_to_html(row), axis=1)
geometry = gpd.points_from_xy(grouped_df['longitude'], grouped_df['latitude'])
geo_df = gpd.GeoDataFrame(grouped_df, geometry=geometry)
# Optional: Assign Coordinate Reference System (CRS)
geo_df.crs = "EPSG:4326"  # WGS84 coordinate system
geo_df

	start_station_name	start_station_median_lat	start_station_median_lng	ride_id
1217	Streeter Dr & Grand Ave	41.892276	-87.612041	9282
271	DuSable Lake Shore Dr & Monroe St	41.881059	-87.616777	6577
1314	Wilton Ave & Belmont Ave	41.940116	-87.652973	5845
1235	University Ave & 57th St	41.791475	-87.599920	5196
478	LaSalle St & Illinois St	41.890853	-87.631665	4937

	title	latitude	longitude	count	description	geometry
0	2112 W Peterson Ave	41.991168	-87.683617	119	<table border="1" class="dataframe">\n <thead...	POINT (-87.68362 41.99117)
1	21st St & Pulaski Rd	41.853457	-87.724529	19	<table border="1" class="dataframe">\n <thead...	POINT (-87.72453 41.85346)
2	63rd St Beach	41.780958	-87.576320	174	<table border="1" class="dataframe">\n <thead...	POINT (-87.57632 41.78096)
3	900 W Harrison St	41.874755	-87.649817	882	<table border="1" class="dataframe">\n <thead...	POINT (-87.64982 41.87476)
4	Aberdeen St & Jackson Blvd	41.877773	-87.654831	1471	<table border="1" class="dataframe">\n <thead...	POINT (-87.65483 41.87777)
...	...	...	...	...	...	...
1333	Woodlawn Ave & 58th St	41.789404	-87.596487	987	<table border="1" class="dataframe">\n <thead...	POINT (-87.59649 41.7894)
1334	Woodlawn Ave & 75th St	41.759160	-87.595751	10	<table border="1" class="dataframe">\n <thead...	POINT (-87.59575 41.75916)
1335	Woodlawn Ave & Lake Park Ave	41.814078	-87.597017	218	<table border="1" class="dataframe">\n <thead...	POINT (-87.59702 41.81408)
1336	Yates Blvd & 75th St	41.758705	-87.566409	32	<table border="1" class="dataframe">\n <thead...	POINT (-87.56641 41.75871)
1337	Yates Blvd & 93rd St	41.726178	-87.566361	4	<table border="1" class="dataframe">\n <thead...	POINT (-87.56636 41.72618)

1338 rows × 6 columns

you can add a hover tooltip (sometimes referred to as a tooltip or tooltip popup) for each point on your Folium map. This tooltip will appear when you hover over the markers on the map, providing additional information without needing to click on them. Here’s how you can modify your existing code to include hover tooltips:

Code

chicago = gpd.read_file(geodatasets.get_path("geoda.chicago_commpop"))

m = chicago.explore(
    column="POP2010",  # make choropleth based on "POP2010" column
    scheme="naturalbreaks",  # use mapclassify's natural breaks scheme
    legend=True,  # show legend
    k=10,  # use 10 bins
    tooltip=False,  # hide tooltip
    popup=["POP2010", "POP2000"],  # show popup (on-click)
    legend_kwds=dict(colorbar=False),  # do not use colorbar
    name="chicago",  # name of the layer in the map
)

geo_df.explore(
    m=m,  # pass the map object
    color="red",  # use red color on all points
    marker_kwds=dict(radius=5, fill=True),  # make marker radius 10px with fill
    tooltip="description",  # show "name" column in the tooltip
    tooltip_kwds=dict(labels=False),  # do not show column label in the tooltip
    name="divstation",  # name of the layer in the map
)
 
m

Make this Notebook Trusted to load map: File -> Trust Notebook

1.2.5.2 Assigning the station size based on its popularity

Next, we will categorize the number of trips associated with each station into 10 bins and assign point sizes accordingly. This will visually indicate the popularity of each station based on its size. Hovering over a station marker will display the station name and click it will show additional station information.

To do this, we need to use folium.CircleMarker, which is a versatile function in Folium for creating interactive maps with customizable circle markers. It allows you to visually represent geospatial data points and provide additional information through popups, enhancing the interactivity and usability of your map visualizations in Python. Adjust parameters and explore further customization options to suit your specific mapping needs.

Code


def add_bins(df):
    # Binning the 'count' column into 10 bins
    bins = np.linspace(df['count'].min(), df['count'].max(), 10)
    df['bin'] = np.digitize(df['count'], bins)

    # Assigning colors to bins
    cmap = plt.get_cmap('viridis')  # You can choose any colormap here
    df['color'] = df['bin'].apply(lambda x: cmap(x / len(bins)))

    # Assigning sizes to bins (for example, increasing size with bin number)
    df['size'] = df['bin'].apply(lambda x: 2 + x * 1)  # Adjust the size calculation as needed
    return df

Code

geo_df  = add_bins(geo_df)
geo_df

	title	latitude	longitude	count	description	geometry	bin	color	size
0	2112 W Peterson Ave	41.991168	-87.683617	119	<table border="1" class="dataframe">\n <thead...	POINT (-87.68362 41.99117)	1	(0.282623, 0.140926, 0.457517, 1.0)	3
1	21st St & Pulaski Rd	41.853457	-87.724529	19	<table border="1" class="dataframe">\n <thead...	POINT (-87.72453 41.85346)	1	(0.282623, 0.140926, 0.457517, 1.0)	3
2	63rd St Beach	41.780958	-87.576320	174	<table border="1" class="dataframe">\n <thead...	POINT (-87.57632 41.78096)	1	(0.282623, 0.140926, 0.457517, 1.0)	3
3	900 W Harrison St	41.874755	-87.649817	882	<table border="1" class="dataframe">\n <thead...	POINT (-87.64982 41.87476)	1	(0.282623, 0.140926, 0.457517, 1.0)	3
4	Aberdeen St & Jackson Blvd	41.877773	-87.654831	1471	<table border="1" class="dataframe">\n <thead...	POINT (-87.65483 41.87777)	2	(0.253935, 0.265254, 0.529983, 1.0)	4
...	...	...	...	...	...	...	...	...	...
1333	Woodlawn Ave & 58th St	41.789404	-87.596487	987	<table border="1" class="dataframe">\n <thead...	POINT (-87.59649 41.7894)	1	(0.282623, 0.140926, 0.457517, 1.0)	3
1334	Woodlawn Ave & 75th St	41.759160	-87.595751	10	<table border="1" class="dataframe">\n <thead...	POINT (-87.59575 41.75916)	1	(0.282623, 0.140926, 0.457517, 1.0)	3
1335	Woodlawn Ave & Lake Park Ave	41.814078	-87.597017	218	<table border="1" class="dataframe">\n <thead...	POINT (-87.59702 41.81408)	1	(0.282623, 0.140926, 0.457517, 1.0)	3
1336	Yates Blvd & 75th St	41.758705	-87.566409	32	<table border="1" class="dataframe">\n <thead...	POINT (-87.56641 41.75871)	1	(0.282623, 0.140926, 0.457517, 1.0)	3
1337	Yates Blvd & 93rd St	41.726178	-87.566361	4	<table border="1" class="dataframe">\n <thead...	POINT (-87.56636 41.72618)	1	(0.282623, 0.140926, 0.457517, 1.0)	3

1338 rows × 9 columns

Code



m = folium.Map(location=[geo_df.geometry.centroid.y.mean(), geo_df.geometry.centroid.x.mean()], zoom_start=13)
# Add markers to the map based on visit_category
for idx, row in geo_df.iterrows():
    folium.CircleMarker(location=[row.geometry.y, row.geometry.x],
                        radius=row['size'],
                        color='red',
                        fill=True,
                        fill_color='red',
                        popup=row['description'],
                        tooltip = row['title']).add_to(m)

m

Make this Notebook Trusted to load map: File -> Trust Notebook

1.2.5.3 Adding markers for the top 5 popular stations

Code

### Getting the top 5 popular station names and their counts
# unique values in the columns from_station_id and to_station_id
data['start_station_name'].value_counts()[: 5]

start_station_name
Streeter Dr & Grand Ave              9282
DuSable Lake Shore Dr & Monroe St    6577
Wilton Ave & Belmont Ave             5845
University Ave & 57th St             5196
LaSalle St & Illinois St             4937
Name: count, dtype: int64

Code

from ipywidgets import Layout
from folium import Map, Marker, Popup
from ipywidgets import HTML

Code

def show_locations(locations, titles=None):
    if isinstance(locations, pd.DataFrame):
        locations = locations.values
    if isinstance(locations, np.ndarray):
        locations = [tuple(i) for i in locations]
    if titles is None:
        titles = ['Location {}'.format(i) for i in range(len(locations))]
    assert len(titles) == len(locations)
    max_lat = max([x[0] for x in locations])
    min_lat = min([x[0] for x in locations])
    max_lon = max([x[1] for x in locations])
    min_lon = min([x[1] for x in locations])
    center = [(max_lat + min_lat) / 2, (max_lon + min_lon) / 2]
    # Create the map
    m = Map(center=center, zoom=14,    layout=Layout(width='80%', height='800px'))
    # Add the market
    for loc,title in zip(locations, titles):
        marker = Marker(location=loc, title=title)
        message1 = HTML()
        message1.value = "Try clicking the marker!"
 
        marker.popup = message1
        m.add_layer(marker)
    display(m)

def show_locations(location_df):
    columns = ['latitude', 'longitude', 'title', 'description']
    assert all([col in location_df.columns for col in columns])
    
    max_lat = location_df['latitude'].max()
    min_lat = location_df['latitude'].min()
    max_lon = location_df['longitude'].max()
    min_lon = location_df['longitude'].min()
    center = [(max_lat + min_lat) / 2, (max_lon + min_lon) / 2]
    # Create the map
    # m = Map(center=center, zoom_start=14, layout=Layout(width='80%', height='800px'))
    m = folium.Map(location=center, zoom_start=14)
    #m = folium.Map(location=center, zoom_start=14)
    # Add the market
    for i, row in location_df.iterrows():
        loc = (row['latitude'], row['longitude'])
        marker = Marker(location=loc, title=row['title'], popup=row['description'])
        # marker = folium.Marker(location=loc)
        # marker.popup = row['description']  
        # m.add_layer(marker)
        marker.add_to(m)
        # m.save('map.html')
    # display(m)  
    return m

def row_to_html(row):
    html_str = '<ul>'
    for col, val in row.items():
        html_str += f'<li><strong>{col}:</strong> {val}</li>'
    html_str += '</ul>'
    return html_str  
def row_to_html(row):
    row_df = pd.DataFrame(row).T
    row_df.columns = [col.capitalize() for col in row_df.columns]
    return row_df.to_html(index=False)

Code

from matplotlib import pyplot as plt
# get top 5 from locations
# data['from_station_name'].value_counts().head(10)
top_5 = data['start_station_name'].value_counts()[:5].index
top_5_df = data[data['start_station_name'].isin(top_5)].groupby(['start_station_name', 'start_station_median_lat', 'start_station_median_lng'])['ride_id'].count().reset_index()
display(top_5_df)
top_5_df.rename(columns={'start_station_name':'title', 'start_station_median_lat':'latitude', 'start_station_median_lng':'longitude', 'ride_id':'count'}, inplace=True)
top_5_df['description'] = top_5_df.apply(lambda row: (row_to_html(row)), axis=1)
show_locations(top_5_df)

	start_station_name	start_station_median_lat	start_station_median_lng	ride_id
0	DuSable Lake Shore Dr & Monroe St	41.881059	-87.616777	6577
1	LaSalle St & Illinois St	41.890853	-87.631665	4937
2	Streeter Dr & Grand Ave	41.892276	-87.612041	9282
3	University Ave & 57th St	41.791475	-87.599920	5196
4	Wilton Ave & Belmont Ave	41.940116	-87.652973	5845

Make this Notebook Trusted to load map: File -> Trust Notebook

2 Independent Practice

Download the trip data from May 2023 to May 2024 from the divvy website and conduct a geospatial analysis. Your tasks include:

Imputing missing values where possible.
Removing invalid trips.
Calculating the percentage of trips where the bike was picked up and returned to the same station. Also, determine the percentage of these trips classified as commuting versus leisure.
Identifying the top 5 stations with the highest number of trips and checking if these stations are among those with the most Divvy stations.
Plotting the top 5 popular stations on a map, highlighting those with the highest number of associated trips.”

1 Guided Practice

1.1 Data Clearning and Preparation

1.1.1 Import Libraries

1.1.2 Load the Dataset

1.1.3 Handling Missing Values

1.1.4 Standardizing Station Coordinates

1.1.5 Data Cleaning: Trip Distance Exploration

1.2 Geospatial Plotting

1.2.1 Python Packages for Geospatial Plotting

1.2.2 Static Plots with GeoPandas

1.2.2.1 📁 Components of a Shapefile

1.2.2.2 🗺️ Geometry Types in Shapefiles

1.2.3 Loading a Shapefile

1.2.4 Adding the divvy stations to the plot

1.2.5 Interactive Plotting with Folium

1.2.5.1 Adding the divvy station on the Interactive Folium.Map

1.2.5.2 Assigning the station size based on its popularity

1.2.5.3 Adding markers for the top 5 popular stations

2 Independent Practice

3 Reference:

1.2.2 Static Plots with `GeoPandas`

1.2.5 Interactive Plotting with `Folium`

1.2.5.1 Adding the divvy station on the Interactive `Folium.Map`