Bay Wheels is the public bicycle sharing system of San Francisco Bay Area. Nowadays, public sharing systems are key to sustainable road transportation. Services like these try to aid and join efforts with transportation networks by lowering private car ownership. For that reason, their main goal is to attract as many users as possible. But not all users are the intended user. For instance, if all customers used the bikes for long time the service would easily run out of available bikes and would be harder to maintain, on the other hand it is not the intention of the service to take customers out of public transportation services, but to serve as last/first mile options. With this in mind this investigation intends to inspect how Bay wheels performed in the last 6 months of 2017 in terms of the type of user they attracted.
In this investigation, I examined user's performances in order to see if certain type of users were associated with certain distances, timing, specific stations, effects of seasonability, etc. Trips duration, user type and length are the main variables.
The main questions guiding this investigation were:
This dataset contains observations from the regional public bicycle sharing system of San Francisco Bay Area. This system is called Bay Wheels. It holds almost half million entries with variables describing the trips along the year 2017. Information covers the type of user, start and end stations, as well as the duration of the trip.
Variables contained in the dataset could help analyzing the type of customer associated with longer or shorter trips. Do regular customers go on shorter trips? Perhaps regular commuting? Do occasional riders travel longer distances? These could be tourists that decide to get to know the city using the bike system.
#This code helps ploting 1000 of the half million trips
#lat long san fran
latitude = 37.814104
longitude = -122.358961
# create map and display it
sanfran_map = folium.Map(location=[latitude, longitude], zoom_start=12, tiles='Stamen Toner')
# display the map of San Francisco
sanfran_map
limit = 1000
trip_data5 = trip_data.iloc[0:limit, :]
trips = folium.map.FeatureGroup()
# loop through the 100 crimes and add each to the incidents feature group
for lat, lng, in zip(trip_data5.end_station_latitude, trip_data5.end_station_longitude):
trips.add_child(
folium.CircleMarker(
[lat, lng],
radius=5, # define how big you want the circle markers to be
color='purple',
fill=True,
fill_color='gray',
fill_opacity=0.6
)
)
# add incidents to map
sanfran_map.add_child(trips)
Initially the duration of the trips enclosed a wide array of values, with trips from 61 seconds to even 24 hours. Trips that lasted more than 5 hours were a few thousands (compared to half million values is a very low value) and were thought to be outliers as this service has a good way of ensuring trips between 30 and 45 min with a cost per additional 15 minutes of 2 dollars.
What is the mean trip duration across this dataset? Does it follow a normal distribution?
By dividing trips by their duration into short, medium and long, the inspection will be more focused.
# first: zoom in
plt.figure(figsize = [15, 5])
# histogram on left: full data with large bin size
plt.subplot(1, 3, 1)
binsize = 1
bin_edges = np.arange(0, short_trips['duration_min'].max()+1, binsize)
plt.hist(data = short_trips, x = 'duration_min', bins = bin_edges, color = 'black');
plt.xlabel('Duration (minutes)')
plt.ylabel('Number of trips');
plt.title('Trips that lasted less than 20 minutes', loc = 'center')
## line showing the mean
plt.axvline(short_trips.duration_min.mean(), color='red', linewidth = 2)
min_ylim, max_ylim = plt.ylim()
plt.text(short_trips.duration_min.mean()*1.5, max_ylim*0.5, 'Mean: {:.2f} min'.format(short_trips.duration_min.mean()))
# center histogram: zoom into trips lasting little more than 1 hour
plt.subplot(1, 3, 2)
binsize2 = 1
bin_edges = np.arange(10, medium_trips['duration_min'].max()+5, binsize2)
plt.hist(data = medium_trips, x = 'duration_min', bins = bin_edges, color = 'black');
plt.xlabel('Duration (minutes)')
plt.title('Trips between 20 minutes and an hour', loc = 'center')
## line showing the mean
plt.axvline(medium_trips.duration_min.mean(), color='red', linewidth = 2)
min_ylim, max_ylim = plt.ylim()
plt.text(medium_trips.duration_min.mean()*1.1, max_ylim*0.5, 'Mean: {:.2f} min'.format(medium_trips.duration_min.mean()))
# histogram on the right: zoom into trips lasting little more than 1 hour
plt.subplot(1, 3, 3)
binsize2 = 20
bin_edges = np.arange(0, long_trips['duration_min'].max()+1, binsize2)
plt.hist(data = long_trips, x = 'duration_min', bins = bin_edges, color = 'black');
plt.xlabel('Duration (minutes)')
plt.title('Trips lasting hours', loc = 'center')
## line showing the mean
plt.axvline(long_trips.duration_min.mean(), color='red', linewidth = 2)
min_ylim, max_ylim = plt.ylim()
plt.text(long_trips.duration_min.mean()*1.5, max_ylim*0.5, 'Mean: {:.2f} min'.format(long_trips.duration_min.mean()));
# Re plot the data with the log transformation
log_binsize = 0.045
bins = 10 ** np.arange(0.05, np.log10(trip_data['duration_min'].max())+log_binsize, log_binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = trip_data, x = 'duration_min', bins = bins, color = 'gray')
plt.xscale('log')
tick_locs = [1, 2, 4, 8, 16, 32, 64, 128, 256]
plt.xticks(tick_locs, tick_locs)
plt.title('Histogram, number of trips by duration in minutes ')
plt.xlabel('Duration (minutes)')
plt.ylabel('Number of trips');
# bar to show where the mean and median are
plt.axvline(trip_data.duration_min.mean(), color='blue', linewidth = 1)
min_ylim, max_ylim = plt.ylim()
plt.text(trip_data.duration_min.mean()*1.2, max_ylim*0.5, 'Mean: {:.2f}'.format(trip_data.duration_min.mean()))
plt.axvline(trip_data.duration_min.median(), color='red', linewidth = 1)
min_ylim, max_ylim = plt.ylim()
plt.text(trip_data.duration_min.median()*0.13, max_ylim*0.5, 'Median: {:.2f}'.format(trip_data.duration_min.median()))
plt.show();
This graph looks more like a normal distribution as excessively long observations were dropped. As seen before the median is almost 10 minutes with a mean of 18 minutes; this mean is heavily affected by the larger values.
Most users go in trips that last less than 30 minutes. Good signal.
As with the duration the length of the trips enclosed a wide array of values, with trips from 0 to 68 km.
# re plot the data with log transformation
plt.figure(figsize=[8, 5])
log_binsize = 0.045
bins = 10 ** np.arange(-1, np.log10(trip_data['trip_length'].max())+log_binsize, log_binsize)
plt.hist(data = trip_data, x = 'trip_length', bins = bins, color = 'gray')
plt.xscale('log')
tick_locs = [0.1, 0.3, 1, 3, 10, 30, 70]
plt.xticks(tick_locs, tick_locs)
plt.xlabel('Length (kms)');
plt.ylabel('Number of trips');
plt.title('Histogram, number of trips by length in kilometers');
After dropping abnormal values, this variable also depicts a normal distribution. Most users go in trips that last less 10 kilometers.
Users are either Subscribers (members of the service) or Customer (casual users). Which user is more common?
# plot for the Type of user
default_color = sb.color_palette()[3];
order1 = trip_data.user_type.value_counts().index;
sb.countplot(data = trip_data, x = 'user_type', color = default_color);
plt.xlabel('User type');
plt.ylabel('Frequency');
plt.title('User types');
# add text of relative value
total = trip_data.shape[0]
type_counts = trip_data['user_type'].value_counts()
locs, labels = plt.xticks()
for loc, label in zip(locs, labels):
count = type_counts[label.get_text()]
perc_string = '{:0.2f}%'.format(100*count/total)
plt.text(loc, count-50000, perc_string, va= 'top', ha = 'center', color = 'white', weight = 'bold')
# clustered bar chart for the categorical vars
sb.countplot(data = Top_start_stations, y ='start_station_name', hue = 'user_type', palette = 'Reds');
plt.xlabel('User type');
plt.ylabel("Start station's name" );
plt.title('Number of trips by start stations');
# clustered bar chart for the categorical vars
sb.countplot(data = Top_end_stations, y ='end_station_name', hue = 'user_type', palette = 'Reds');
plt.xlabel('User type');
plt.ylabel("End station's name" );
plt.title('Number of trips by end stations');
Most popular stations are located near public transportation facilities, this is a good signal.
Both variables were plotted in logarithmic scales showing a positive linear relationship. Log-log relationships happen when strong variations in one variable causes strong variations in the other.
# log-log plot
# one of them correlates with price.
plt.figure(figsize = [8, 5])
plt.scatter(data = trip_data3.sample(1000, random_state = 4), x = 'duration_min', y = 'trip_length', alpha = 1/5);
plt.xscale('log')
tick_locsx = [1, 2, 4, 8, 16, 32, 64, 128, 256]
plt.xticks(tick_locsx, tick_locsx)
plt.yscale('log')
tick_locs = [0.1, 0.3, 1, 3, 10, 20]
plt.yticks(tick_locs, tick_locs);
plt.xlabel('Duration (minutes)');
plt.ylabel("Trip's length (km)");
plt.title("Trip's length (log scale) vs. duration (log scale)");
At first, plotting the type of customer along with durations and lengths showed that subscribed users were less scattered than customers, in spite of being the majority. Subscribers went on longer trips whereas customers had a strong presence in trips that lasted around 30 minutes regardless of the length covered.
# plot two numerical plus one categorical
g = sb.FacetGrid(data = trip_data3.sample(1000, random_state = 4), hue = 'user_type', height= 4.5)
g.map(plt.scatter, 'duration_min', 'trip_length')
g.add_legend(title = 'User type');
plt.xscale('log')
tick_locsx = [1, 2, 4, 8, 16, 32, 64, 128, 256]
plt.xticks(tick_locsx, tick_locsx)
plt.yscale('log')
tick_locs = [0.1, 0.3, 1, 3, 10, 20]
plt.yticks(tick_locs, tick_locs);
plt.xlabel('Duration (minutes)');
plt.ylabel("Trip's length (km)");
plt.title("Length vs Duration (log scales) by type of user");
After plotting each type of customer in the short and medium trips it showed that indeed, subscribers described a more consistent distribution whereas customers had more variability. Their differences were distinctive for each kind of trip.
# plot two numerical plus one categorical
g = sb.FacetGrid(data = short_trips.sample(1000, random_state = 4), hue = 'user_type', height= 4.5)
g.map(plt.scatter, 'duration_min', 'trip_length')
g.add_legend(title = 'User type');
plt.xlabel('Duration (minutes)');
plt.ylabel("Trip's length (km)");
plt.title("Length vs Duration (log scales) by type of user - Shorter trips");
# plot two numerical plus one categorical
g = sb.FacetGrid(data = medium_trips.sample(1000, random_state = 4), hue = 'user_type', height= 4.5)
g.map(plt.scatter, 'duration_min', 'trip_length')
g.add_legend(title = 'User type');
plt.xlabel('Duration (minutes)');
plt.ylabel("Trip's length (km)");
plt.title("Length vs Duration (log scales) by type of user - Longer trips");
Both users are more scattered in medium trips.
# create faceted heat maps on levels of the cut variable
g = sb.FacetGrid(data = trip_data3.sample(1000, random_state = 4), col = 'user_type', height = 4.5)
g.map(plt.scatter, 'log_duration_min', 'log_length', color = 'red');
g.fig.suptitle('Duration vs Length by type of user', y = 1.08)
g.axes[0,0].set_title('Users = Subscribers')
g.axes[0,1].set_title('Users = Customers')
g.axes[0,0].set_ylabel('Length (km)')
g.axes[0,0].set_xlabel('Duration (Minutes)');
g.axes[0,1].set_xlabel('Duration (Minutes)');
This plot shows how different distributions subscribers and customers have.