Introduction

This project investigates an Uber dataset of New York City pickups from January through June of 2015. The Taxi & Limousine Commission (TLC) released the data after receiving a Freedom of Information Law (FOIL) request from FiveThirtyEight on June 22, 2015. I pulled the data from the FiveThirtyEight github repo found here. Each observation represents a single pickup with the following features:

Variable	Definition
Dispatching_base_num	TLC base company code that dispatched the Uber
Pickup_date	Full date of the pickup in yyyy-mm-dd h:m:s format
Affiliated_base_num	TLC base company code of the Uber pickup
locationID	Pickup location ID of the Uber pickup
Borough	New York City Borough where the pickup took place
Zone	Neighborhood in the New York City Borough where the pickup took place
lat	Latitude of the pickup Zone
lon	Longitude of the pickup Zone
month	Month that the pickup took place
date	Date (1-31) that the pickup took place
day	Day (Monday-Sunday) that the pickup took place
hour	Hour that the pickup took place

I was most interested in how ridership varied over time and location. I wanted to know how it changed by hour, week, day and month. I also wanted to know what the most popular Boroughs and Zones were.

Univariate Plots Section

Dimensions, Structure and Summary

The Uber dataset consists of 14 million observations of pickup data spread across 12 variables. Originally there were only 4 variables (dispatch base number, pickup date, affiliated base number, and location ID), but I joined that with taxi lookup data to get Zone and Borough names for each location ID. I also wanted to map each location to a set of coordinates so I used Google’s API to get the latitude and longitude for each zone. The Google API limits users to 2500 queries during each 24 hour period so I had to be strategic when joining the lookup data to the ridership data.

I parsed the month, date, day, and hour from each pickup timestamp using lubridate to get a more granular analysis of patterns over time. I also binned the pickup data by location and time for some of the histograms and boxplots. I had a dilema when it came to the taxi lookup dataset because some of the location data points were missing. It was a small subset (about 6000 observations) but I did not want to throw away the data. Instead I kept all observations for the time period analyses and excluded them from the location analyses.

Since the dataset was so large, I ran most tests on a sample of 100,000 randomly selected observations before making the final plots on the population. This saved a lot of time since some plots took a couple minutes (each) to run.

## [1] 14270479       12

## 'data.frame':    14270479 obs. of  12 variables:
##  $ Dispatching_base_num: Factor w/ 8 levels "B02512","B02598",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Pickup_date         : POSIXct, format: "2015-05-17 09:47:00" "2015-05-17 09:47:00" ...
##  $ Affiliated_base_num : Factor w/ 285 levels "","B00013","B00014",..: 188 188 188 247 188 188 188 239 188 239 ...
##  $ locationID          : Factor w/ 262 levels "1","2","3","4",..: 139 65 100 80 90 225 7 74 246 22 ...
##  $ Borough             : Factor w/ 7 levels "Bronx","Brooklyn",..: 4 2 4 2 4 2 5 4 4 2 ...
##  $ Zone                : Factor w/ 260 levels "Allerton/Pelham Gardens",..: 137 62 97 77 87 224 5 71 246 20 ...
##  $ lat                 : num  40.8 40.7 40.8 40.7 40.7 ...
##  $ lon                 : num  -74 -74 -74 -73.9 -74 ...
##  $ hour                : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ day                 : Ord.factor w/ 7 levels "Monday"<"Tuesday"<..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ month               : Ord.factor w/ 6 levels "January"<"February"<..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ date                : int  17 17 17 17 17 17 17 17 17 17 ...

##  Dispatching_base_num  Pickup_date                  Affiliated_base_num
##  B02764 :5753653      Min.   :2015-01-01 00:00:05   B02764 :4352321    
##  B02682 :3484530      1st Qu.:2015-02-21 03:00:16   B02682 :3448698    
##  B02617 :2068525      Median :2015-04-10 16:21:00   B02617 :1946933    
##  B02598 :1526660      Mean   :2015-04-07 15:04:13   B02598 :1287723    
##  B02765 :1152727      3rd Qu.:2015-05-23 03:53:00   B02765 :1038379    
##  B02512 : 255772      Max.   :2015-06-30 23:59:00   B02512 : 188112    
##  (Other):  28612                                    (Other):2008313    
##    locationID                Borough        
##  161    :  460732   Bronx        :  220146  
##  231    :  420356   Brooklyn     : 2322000  
##  234    :  419045   EWR          :     105  
##  79     :  407591   Manhattan    :10371060  
##  249    :  323989   Queens       : 1343945  
##  230    :  315919   Staten Island:    6959  
##  (Other):11922847   Unknown      :    6264  
##                         Zone               lat             lon        
##  Midtown Center           :  460732   Min.   :40.53   Min.   :-74.24  
##  TriBeCa/Civic Center     :  420356   1st Qu.:40.72   1st Qu.:-74.00  
##  Union Sq                 :  419045   Median :40.74   Median :-73.98  
##  East Village             :  407591   Mean   :40.74   Mean   :-73.97  
##  West Village             :  323989   3rd Qu.:40.76   3rd Qu.:-73.96  
##  Times Sq/Theatre District:  315919   Max.   :40.90   Max.   :-73.71  
##  (Other)                  :11922847                                   
##       hour              day               month              date     
##  Min.   : 0.00   Monday   :1694252   January :1953801   Min.   : 1.0  
##  1st Qu.: 9.00   Tuesday  :1872902   February:2263620   1st Qu.: 8.0  
##  Median :16.00   Wednesday:1893811   March   :2259773   Median :16.0  
##  Mean   :14.09   Thursday :2159598   April   :2280837   Mean   :15.9  
##  3rd Qu.:20.00   Friday   :2282571   May     :2695553   3rd Qu.:23.0  
##  Max.   :23.00   Saturday :2414563   June    :2816895   Max.   :31.0  
##                  Sunday   :1952782

Total Pickups by Hour

## 
##       0       1       2       3       4       5       6       7       8 
##  602178  394510  260603  183655  173038  193523  288533  443543  583348 
##       9      10      11      12      13      14      15      16      17 
##  593437  520092  516716  533021  537909  584463  649414  737170  863990 
##      18      19      20      21      22      23 
##  987093 1007464  948574  930462  922954  814789

The graph above shows the total number of pickups per hour. There are 3 notable spikes at midnight (0), 8am and 7pm. There is also a small bump around noon for lunch.

Pickups per Hour

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     159    1892    3182    3293    4478   10650

The above graphs show the distribution of pickups per hour. The shape is bimodal with a major peak around 3200 pickups and a minor peak around 1000 pickups. I wasn’t sure about the minor peak so I used a log scale for the x-axis and found that there seems to really be one large peak at 3200.

Total Pickups by Day

## 
##    Monday   Tuesday Wednesday  Thursday    Friday  Saturday    Sunday 
##   1694252   1872902   1893811   2159598   2282571   2414563   1952782

Pickups per day followed an expected trend, with demand rising Monday - Friday, then peaking on the weekend.

Pickups per Day

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25950   68330   78340   78840   90730  136200

The number of pickups per day was normally distributed with a center around 75,000. There were a couple outliers at both 25,000 and around 125,000.

Pickups by Month

## 
##  January February    March    April      May     June 
##  1953801  2263620  2259773  2280837  2695553  2816895

Demand over this 6-month span went from a little under 2 million rides in January to nearly 3 million in June.

Pickups by Borough

Oh wow, it looks like Manhattan is killing it in Uber pickups, but we can’ tell how many pickups Staten Island or the Bronx had. Let’s change this to a log scale.

## 
##         Bronx      Brooklyn     Manhattan        Queens Staten Island 
##        220146       2322000      10371060       1343945          6959

## 
##         Bronx      Brooklyn           EWR     Manhattan        Queens 
##        220146       2322000           105      10371060       1343945 
## Staten Island       Unknown 
##          6959          6264

Ah, much better. Manhattan had the most pickups by far at around 10,000,000 total, while Staten Island had the fewest number of pickups (under 10,000 total). Below the graph are the tables of pickups in the full and filtered uber datasets. I filtered the dataset for the location plots to exclude pickups with missing location data and pickups originating from New Jersey (EWR).

Univariate Analysis

The most diverse distributions came from the location data. Some boroughs only had a few hundred pickups while others were in the 10’s of thousands and even hundreds of thousands. This was interesting because as I saw later, these patterns did not change over the course of the week. Manhattan was the busiest location during the week and on the weekends.

Even more unusual was the fact that Manhattan (pop 1.6 million) is not the most populated Borough in New York. Brooklyn (2.6 million) and Queens (2.3 million) have considerably larger populations. By contrast, the population density of Manhattan is twice that of Brooklyn and triple that of Queens. This could be worth investigating for Uber as they expand into more markets. It would be interesting to look at ridership data for other densely populated cities to look for correlations.

I used the lubridate, dplyr and tidyr packages to reshape the data. I parsed the Pickup_date variable with lubridate to create the month, date, day and hour columns. Then I grouped and summarised observations to bin the rides by time period and location. One day I hope to shake Hadley Wickham’s hand for what he has done for the data science community. He is truly a prolific developer and I feel a certain level of comfort when I see his name in a package’s documentation.

Bivariate Plots

Pickups by Hour 2D Density Plot

The 2D density plot allows us to see at a glance the distribution of rides per hour for all days. I used the viridis package to color the points. The black regions show regions of lower density while the green and yellow regions show higher density. We can start to see at least two demand curves emerge. One starts lower at midnight and has two distinct peaks while the latter starts a bit higher and seems to have one prominent peak. Curiously, both curves seem to intersect at around 4am.

Hourly Pickups

This shows the distribution of pickups throughout the week. It’s easier to see some of the more nuanced patterns that emerge between the days. For instance, the 8am peak is only prevalent Monday - Friday, most likely due to morning rush hour traffic. The evening climax shifts slightly from 6 pm to 7 pm throughout the week and its duration lengthens as passengers edge closer to the weekend. Instead of retiring promptly after work, demand extends through midnight as the night life heats up.

This shift in rider behavior doesn’t begin abruptly on Friday or even Thursday night as I would expect, but begins as early as Tuesday as workers loosen their ties in the evening. Interestingly, the 8am peak remains unchanged throughout the week at a steady 100,000. People are not rushing to work, but they are rushing to the bar in larger numbers as the week presses on.

Saturday and Sunday behavior removes the 8am peak for obvious reasons and instead, demand grows steadily from 6am until midnight. The growth on Saturday extends through this entire time period while Sunday demand tapers off around 6pm as riders prepare for the coming work week.

Daily Pickups

## uber.by_day$day: Monday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   43680   56860   64040   65160   74240   93500 
## -------------------------------------------------------- 
## uber.by_day$day: Tuesday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25950   65120   72500   72030   80110   97590 
## -------------------------------------------------------- 
## uber.by_day$day: Wednesday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   60380   68240   73740   75750   84940   93710 
## -------------------------------------------------------- 
## uber.by_day$day: Thursday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   58040   74510   83330   83060   93250  102000 
## -------------------------------------------------------- 
## uber.by_day$day: Friday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   40290   81860   88890   87790   99640  103100 
## -------------------------------------------------------- 
## uber.by_day$day: Saturday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   59840   81740   93360   92870  101700  136200 
## -------------------------------------------------------- 
## uber.by_day$day: Sunday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   41940   65180   75090   75110   88830   96950

Tuesday, Friday and Saturday had the largest variation of pickups per day while Wednesday had the smallest.

Monthly Pickups Colored by Day

This graph gave me some of the most interesting information about overall rider behavior. It shows both long-term increases in demand as well as more cyclical patterns during the week. Demand started small in January with most days averaging around 40,000 - 70,000 pickups. February saw an uptick in the number of 70,000+ days and June consistently reached over 100,000 rides each weekend. It is unclear what exactly caused the increase in demand, but it would be interesting to examine year-over-year Uber data to see how rider behavior changes from winter to spring.

Weekly patterns were also very interesting. Each week saw the same (relative) pattern. Monday was usually the least popular day, with demand rising steadily before climaxing on Saturday. Interestingly, Sunday was a more popular day than Monday, which could have been partly due to the extra runoff when Saturday night turned into Sunday early morning.

Understanding this weekly cycle piqued my interest in weeks that did not follow the pattern. The week of May 17-24 was particularly interesting because the climax was on a Thursday and demand actually dropped on Friday and Saturday. I tried looking up historical data from the area during that time period but was unable to find anything to explain the strange shift in rider behavior.

This also helped identify outliers. Of particular note were Saturday, May 16 and Saturday, June 27. Both days saw demand surge past 100,000 riders. It seemed worth investigating to understand why ridership increased so much and possibly how to replicate those results. The section below dives into May and June in more depth.

May and June In-Depth

I wanted to understand how demand changed thorughout May and June. In particular, I was interested in whether higher demand days had peaks that were fundamentally different or whether they resulted from the same demand curves. May 16th started the same as any other Saturday, but instead of demand plateauing around 7pm, it continued to rise through 11 pm. This spike in demand was most likely due to the festivals taking place in the area that weekend. I was able to find a notice from the NYC Police Department announcing street closures that weekend due to a number of local festivals and a marathon taking place.

The same patterns emerged on June 27th. It started the same as any other Saturday but demand spiked in the afternoon, toping 10,000 rides for a single hour at 11pm. I was able to find another notice from the NYC Police Department that seemed to explain the huge surge in pickups. Another mixture of local festivals and a marathon was responsible for the increase. And again, demand on Sunday, Monday and Tuesday afterwards was relatively unchanged. I’d be very curious to see whether demand was affected on Saturday, July 3rd and 4th. I feel conflicted because demand should probably drop on those days but with July 4th being a major holiday I would expect a good-sized surge in ridership. Unfortunately, there was no available data for the month of July.

Seeing these patterns led me to believe there may be at least two distinct populations of riders. There was the Monday-Friday croud that regularly used the service to get to work but there was also a second, wilder population that liked to party when night fell and on the weekends. This second population was noticeably absent on May 22 and May 23, probably because they were spent from the weekend before. Demand did not suddenly drop off on the Sunday-Thursday of the followng week, but it did on the following Friday and Saturday. The expected peaks from 6pm - 11pm simply did not show up on May 22 or May 23.

May and June Outliers

I wanted to take a more in-depth look at how ridership changed on the two biggest Saturdays in the dataset (May 16th and June 27th). I was also interested in how it changed on the lowest Saturday (May 23rd). All 3 days had above-average turnout in the early morning, but it’s what happened later in the day that had the biggest impact on the numbers. While demand dropped below the average on May 23rd after 10am, it was consistently above normal throughout the day on both May 16th and June 27. This makes sense since demand increases throughout the day after 10am on Saturdays in general.

Location Boxplot

## uber_known.by_loc$Borough: Bronx
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     156    3056    4327    5242    6485   15840 
## -------------------------------------------------------- 
## uber_known.by_loc$Borough: Brooklyn
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     290    8409   15730   38070   54610  217900 
## -------------------------------------------------------- 
## uber_known.by_loc$Borough: Manhattan
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      12   50280  133600  150300  251500  460700 
## -------------------------------------------------------- 
## uber_known.by_loc$Borough: Queens
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       8    3103    5571   19610   12370  289200 
## -------------------------------------------------------- 
## uber_known.by_loc$Borough: Staten Island
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0    38.5   316.5   348.0   619.0   925.0

There was a lot of variation in the location data. I binned the data by Zone to create the boxplot. Overall, Staten Island had very few pickups per zone, with most falling in the low 100’s. Manhattan had the most pickups with some zones nearing 500,000. What made some of this data unusual was the distribution of pickups within each Borough. For instance, Manhattan is such a busy area I expected huge numbers for all of its zones, but some had as little as 12 pickups. This makes me wonder how Uber chose to agregate their location data. Maybe there are some neighborhoods or remote areas of Manhattan that don’t get much foot traffic at all. I used a log scale for the y axis because the data was so disperse.

Bivariate Analysis

The May and June faceted histograms helped me understand demand changes over time. Although it was not a particularly eventful day, May 25th caught my attention because it didn’t look like any other Monday in the graph. The usual 8am peak was curiously absent and although it was the least busy day in May and June, it started higher than all of the other Mondays. It turns out that was Memorial day. Again, not a particularly interesting day but still helpful for extrapolating information about other Monday holidays.

Ridership varied dramatically from week to week over the 6 month period but a few patterns remained constant. Demand was consistently greatest on Friday and Saturday of each week with few exceptions. There were even a few super Saturdays that rose above others. I think that Uber and their drivers could really benefit from taking a closer look at some of this data. For instance, the two largest peaks both happened to coincide with large festivals and marathons. Uber as an organization could benefit from forming city partnerships to promote more activities as they seem to lead to surges in demand.

Uber drivers could also benefit from some of this data. As a former driver, I understand the struggle to create a feasible work schedule. Because drivers are independent contractors, they set their own hours so they are entirely responsible for how much they can earn each day. This can also lead to burnout, however; because it’s often difficult to predict demand without experience.

Multivariate Plots Section

Map Scatter Plot

Most rides originated from Manhattan, with a few outliers in Brooklyn and Queens. I pulled the latitude and longitude data using the Google Maps API and merged them with the Uber pickup data to generate the map. I used ggmap to create the plot and grapped the background from stamen. I chose the black and white (toner) version to help the data points stand out more.

Heatmap by month and date

## uber.by_day$month: January
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25950   56960   63380   63030   73190   94450 
## -------------------------------------------------------- 
## uber.by_day$month: February
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   63310   72250   79560   80840   89840  102800 
## -------------------------------------------------------- 
## uber.by_day$month: March
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   53920   63910   73950   72900   79080   95270 
## -------------------------------------------------------- 
## uber.by_day$month: April
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   53560   68240   75960   76030   82440  106300 
## -------------------------------------------------------- 
## uber.by_day$month: May
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   59750   78390   87360   86950   95110  121600 
## -------------------------------------------------------- 
## uber.by_day$month: June
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   74430   87770   93340   93900  100900  136200

This heatmap highlights overall changes in ridership over time. It is easy to see the busiest (June 27th) and slowest (Jan 27th) days during the 6-month time period. I used dplyr’s group_by and summarise functions to bin the ride data by date, then used geom_tile() to create the heatmap. I played around with scale_fill_gradientn() to get an appropriate-looking color palette. I had to reverse the scaling because originally the higher values were blue and the lower ones were red.

Pickups per hour by Day Scatterplots

## uber.by_hour$day: Monday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     217    1638    2808    2720    3688    6808 
## -------------------------------------------------------- 
## uber.by_hour$day: Tuesday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     159    1723    3048    3040    4284    8394 
## -------------------------------------------------------- 
## uber.by_hour$day: Wednesday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     231    1893    3134    3156    4408    8075 
## -------------------------------------------------------- 
## uber.by_hour$day: Thursday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     323    1920    3299    3461    4939    7980 
## -------------------------------------------------------- 
## uber.by_hour$day: Friday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     366    2118    3416    3658    5037    8741 
## -------------------------------------------------------- 
## uber.by_hour$day: Saturday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     548    2088    3690    3869    5462   10650 
## -------------------------------------------------------- 
## uber.by_hour$day: Sunday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     565    1915    3181    3134    4148    8980

The above graphs show the distribution of rides per hour for all dates in the 6-month time period. The top graph is colored by day, which highlights how demand rose along a gradient from Sunday (yellow) through Saturday (dark red). I threw in a smoothing curve to show the (mean) average demand for all days. The bottom graph helps distinguish the two types of demand curves for Monday - Friday vs Saturday and Sunday. Finally, the table shows the distribution of rides per hour broken down by day. Saturday had the largest range of pickups per hour while Monday had the smallest.

Weekday vs Weekend Pickups in Each Borough

As discussed in the previous section, there seems to be more than one demand curve for each day. I decided to try to tease the patterns apart. What I found pointed to not two, but 3 distinct demand curves. Each density plot is broken down by borough. Most boroughs followed roughly the same trend, although Staten Island was consistently unique. This may have been due to the lower number of overall rides (Under 10,000 total vs 100,000-10,000,000 for other boroughs).

The Monday-Friday plot shows 3 peaks of increasing size starting at midnight. Although the second (8am) peak was pretty evenly distributed across the boroughs, the final peak showed a large amount of variance. It seems Manhattan passengers went out a lot in the evening while Bronx passengers seemed to be more like homebodies, with a less pronounced evening surge.

The weekend told a different story for the 5 boroughs. Saturday consisted of two peaks of increasing size in the early morning and in the evening. Demand seemed to drop sharply after the first peak and into the wee hours of the morning before climbing slowly throughout the day. Sunday was almost the inverse of Saturday. The major peak occurred in the early morning while the minor peak occurred in the afternoon / evening.

Multivariate Analysis

These graphs complement each other well to paint a clear picture of rider habits. The map scatterplot highlights the most popular locations while the colored pickups by hour scatterplots show how rider behavior changes over time. I think Uber should share more of this type of data with drivers. It takes a lot of the guesswork out of driving and would eliminate a lot of regret over missed opportunities. Having a better understanding of rider behavior would help drivers plan a more concrete schedule.

I was surprised that February was so busy. Even though it is the shortest month, more people used the service each day on average than in March or April. It makes sense that Valentine’s day was so busy, but I wonder what caused the uptick the following weekend (Feb 20th and 21st). I would have expected that after the busy Valentine’s day weekend, demand would drop the following weekend. I tried to run a Google Trends query for the timeframe and region, but the results were inconclusive.

Final Plots and Summary

New York City Map Scatterplot

Description

I chose this plot because it gives an easy to understand visualization of the distribution of pickups by location. We can see from the map that Manhattan was by far the most popular region, followed by remote parts of Brooklyn and Queens. The dark red circle in the lower right corner is JFK Airport and the one at the top of Queens is LaGuardia. I coded each dot to represent the number of pickups using both size and color because it was more helpful than either attribute alone. I chose to hide the size guide because it seemed redundant. I also transformed the color and size to a log scale because the range of pickups was massive (between 2 and 460,732 pickups). Finally, I played around with the bias in colorRampPalette to stretch the range of colors for larger numbers. This helped highlight how the number of pickups seems to diminish the further you are from the Manhattan epicenter.

Monthly Pickups Colored by Day

Description

This plot gave me the most ideas when performing my exploratory data analysis. It captured a lot of the subtle and not-so subtle patterns along weekly and monthly cycles. I colored the bars by day because it highlighted how demand rose throughout the week and edited the color palette so that all bars were clearly visible. Finally, I changed the labels from scientific notation to decimal notation to make them more human-readable.

Pickups Per Hour with Daily Median Lines

Description

The Pickups by Hour with Daily Median graph was interesting because it showed key differences between weekdays and weekends. There are at least two distinct demand curves at play here. Saturday and Sunday ridership decreased from midnight to 6am, then steadily rose throughout the rest of the day. Demand on these days also started higher at midnight because people usually go out more on Friday and Saturday nights. Monday through Friday demand decreased from midnight to 3am, then then abruptly rose as people started waking up to get to work. It was interesting to see how tight and consistent the 8am rush was, as opposed to the 6pm rush when some passengers went home while others headed to the bar.

Reflection

This was an eye-opening project for me. As a former Uber driver I was curious to see how demand changed over time and varied by location. Being in the right place at the right time can be the difference between a $2000 week and a $200 one. It was fun testing my intuitions against the data even when I was wrong. I had no idea that a Thursday or even a Wednesday could be nearly as profitable as a Friday under the right conditions. I was also surprised to learn that demand spiked so much around 6pm and that there really wasn’t much of a lunch rush Monday-Friday.

People need data to make more informed choices. When LinkedIn started sharing even simple data (who’s looked at your profile and profile tips), their popularity skyrocketed. This seems like a great opportunity to empower more drivers. Helping drivers perform their jobs more effectively could be a win for all sides. If more drivers were more aware of the cyclical patterns in their area they would be better equipped to maximize their time and profit. Drivers would be more satisfied, leading to lower turnover and passengers would be better served by more experienced (and perhaps friendlier) drivers.

There is a lot of data Uber tracks that was not present in this dataset. This includes but is not limited to:

Pickup requests that went unanswered
The amount charged per pickup
Surge information for each pickup
The number of times drivers ignored requests
Personal information about passengers and drivers
Passenger and driver rating information for each pickup

Each of the above holds valuable information about driver and passenger behavior. Analyzing the number of pickup requests that went unanswered could help improve Uber’s overall efficiency. For example, there may be time periods when there should be more drivers on the road or the opposite, when an area is supersaturated and drivers are less likely to find an available passenger. This, in conjunction with the surge information for each pickup could help unpack how Uber calculates their surge multiplier. Further analysis could help improve overall satisfaction for both passengers and drivers.

Pricing data could shed a lot of information on both passengers and drivers. Looking at the amount charged per pickup as well as any available surge information could reveal hidden trends. Do customers avoid Uber past a certain surge multiplier? Are they affected at all by the surge? Do drivers flock to surge areas or are they mostly stationary? Pricing data opens up many possibilities for exploration. We could look at whether it affects trip duration, the number of subsequent trips, or even the differences in price sensitivity between different locations. In addition, it would be interesting to compare passenger ratings for surge trips vs non-surge trips. With enough variables, we could even predict future pickup demand. I’m curious how close a machine learning algorith could get to predicting surges or lulls. There is a lot we are missing out on here.

The biggest struggle was trying to capture as much information from each plot as efficiently as possible. There were many graphs I had to throw away because they really didn’t contribute much to the report. I also came across a few road-blocks while using ggmaps. For instance, I originally tried to make a 2d histogram using stat_density2d() but was unable to because there were too many observations. All-in-all I was very happy to go through this and I feel like the tools developed here can be applied to a wide variety of data exploration projects.

Sources

Uber: https://github.com/fivethirtyeight/uber-tlc-foil-response
ggmap: D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf

Exploratory Data Analysis of Uber Pickups

Courtney Ferguson Lee

May 30, 2017

Introduction

Univariate Plots Section

Dimensions, Structure and Summary

Total Pickups by Hour

Pickups per Hour

Total Pickups by Day

Pickups per Day

Pickups by Month

Pickups by Borough

Univariate Analysis

Bivariate Plots

Pickups by Hour 2D Density Plot

Hourly Pickups

Daily Pickups

Monthly Pickups Colored by Day

May and June In-Depth

May and June Outliers

Location Boxplot

Bivariate Analysis

Multivariate Plots Section

Map Scatter Plot

Heatmap by month and date

Pickups per hour by Day Scatterplots

Weekday vs Weekend Pickups in Each Borough

Multivariate Analysis

Final Plots and Summary

New York City Map Scatterplot

Description

Monthly Pickups Colored by Day

Description

Pickups Per Hour with Daily Median Lines

Description

Reflection

Sources