This project investigates an Uber dataset of New York City pickups from January through June of 2015. The Taxi & Limousine Commission (TLC) released the data after receiving a Freedom of Information Law (FOIL) request from FiveThirtyEight on June 22, 2015. I pulled the data from the FiveThirtyEight github repo found here. Each observation represents a single pickup with the following features:
Variable | Definition |
---|---|
Dispatching_base_num | TLC base company code that dispatched the Uber |
Pickup_date | Full date of the pickup in yyyy-mm-dd h:m:s format |
Affiliated_base_num | TLC base company code of the Uber pickup |
locationID | Pickup location ID of the Uber pickup |
Borough | New York City Borough where the pickup took place |
Zone | Neighborhood in the New York City Borough where the pickup took place |
lat | Latitude of the pickup Zone |
lon | Longitude of the pickup Zone |
month | Month that the pickup took place |
date | Date (1-31) that the pickup took place |
day | Day (Monday-Sunday) that the pickup took place |
hour | Hour that the pickup took place |
I was most interested in how ridership varied over time and location. I wanted to know how it changed by hour, week, day and month. I also wanted to know what the most popular Boroughs and Zones were.
The Uber dataset consists of 14 million observations of pickup data spread across 12 variables. Originally there were only 4 variables (dispatch base number, pickup date, affiliated base number, and location ID), but I joined that with taxi lookup data to get Zone and Borough names for each location ID. I also wanted to map each location to a set of coordinates so I used Google’s API to get the latitude and longitude for each zone. The Google API limits users to 2500 queries during each 24 hour period so I had to be strategic when joining the lookup data to the ridership data.
I parsed the month, date, day, and hour from each pickup timestamp using lubridate to get a more granular analysis of patterns over time. I also binned the pickup data by location and time for some of the histograms and boxplots. I had a dilema when it came to the taxi lookup dataset because some of the location data points were missing. It was a small subset (about 6000 observations) but I did not want to throw away the data. Instead I kept all observations for the time period analyses and excluded them from the location analyses.
Since the dataset was so large, I ran most tests on a sample of 100,000 randomly selected observations before making the final plots on the population. This saved a lot of time since some plots took a couple minutes (each) to run.
## [1] 14270479 12
## 'data.frame': 14270479 obs. of 12 variables:
## $ Dispatching_base_num: Factor w/ 8 levels "B02512","B02598",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Pickup_date : POSIXct, format: "2015-05-17 09:47:00" "2015-05-17 09:47:00" ...
## $ Affiliated_base_num : Factor w/ 285 levels "","B00013","B00014",..: 188 188 188 247 188 188 188 239 188 239 ...
## $ locationID : Factor w/ 262 levels "1","2","3","4",..: 139 65 100 80 90 225 7 74 246 22 ...
## $ Borough : Factor w/ 7 levels "Bronx","Brooklyn",..: 4 2 4 2 4 2 5 4 4 2 ...
## $ Zone : Factor w/ 260 levels "Allerton/Pelham Gardens",..: 137 62 97 77 87 224 5 71 246 20 ...
## $ lat : num 40.8 40.7 40.8 40.7 40.7 ...
## $ lon : num -74 -74 -74 -73.9 -74 ...
## $ hour : int 9 9 9 9 9 9 9 9 9 9 ...
## $ day : Ord.factor w/ 7 levels "Monday"<"Tuesday"<..: 7 7 7 7 7 7 7 7 7 7 ...
## $ month : Ord.factor w/ 6 levels "January"<"February"<..: 5 5 5 5 5 5 5 5 5 5 ...
## $ date : int 17 17 17 17 17 17 17 17 17 17 ...
## Dispatching_base_num Pickup_date Affiliated_base_num
## B02764 :5753653 Min. :2015-01-01 00:00:05 B02764 :4352321
## B02682 :3484530 1st Qu.:2015-02-21 03:00:16 B02682 :3448698
## B02617 :2068525 Median :2015-04-10 16:21:00 B02617 :1946933
## B02598 :1526660 Mean :2015-04-07 15:04:13 B02598 :1287723
## B02765 :1152727 3rd Qu.:2015-05-23 03:53:00 B02765 :1038379
## B02512 : 255772 Max. :2015-06-30 23:59:00 B02512 : 188112
## (Other): 28612 (Other):2008313
## locationID Borough
## 161 : 460732 Bronx : 220146
## 231 : 420356 Brooklyn : 2322000
## 234 : 419045 EWR : 105
## 79 : 407591 Manhattan :10371060
## 249 : 323989 Queens : 1343945
## 230 : 315919 Staten Island: 6959
## (Other):11922847 Unknown : 6264
## Zone lat lon
## Midtown Center : 460732 Min. :40.53 Min. :-74.24
## TriBeCa/Civic Center : 420356 1st Qu.:40.72 1st Qu.:-74.00
## Union Sq : 419045 Median :40.74 Median :-73.98
## East Village : 407591 Mean :40.74 Mean :-73.97
## West Village : 323989 3rd Qu.:40.76 3rd Qu.:-73.96
## Times Sq/Theatre District: 315919 Max. :40.90 Max. :-73.71
## (Other) :11922847
## hour day month date
## Min. : 0.00 Monday :1694252 January :1953801 Min. : 1.0
## 1st Qu.: 9.00 Tuesday :1872902 February:2263620 1st Qu.: 8.0
## Median :16.00 Wednesday:1893811 March :2259773 Median :16.0
## Mean :14.09 Thursday :2159598 April :2280837 Mean :15.9
## 3rd Qu.:20.00 Friday :2282571 May :2695553 3rd Qu.:23.0
## Max. :23.00 Saturday :2414563 June :2816895 Max. :31.0
## Sunday :1952782
##
## 0 1 2 3 4 5 6 7 8
## 602178 394510 260603 183655 173038 193523 288533 443543 583348
## 9 10 11 12 13 14 15 16 17
## 593437 520092 516716 533021 537909 584463 649414 737170 863990
## 18 19 20 21 22 23
## 987093 1007464 948574 930462 922954 814789
The graph above shows the total number of pickups per hour. There are 3 notable spikes at midnight (0), 8am and 7pm. There is also a small bump around noon for lunch.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 159 1892 3182 3293 4478 10650
The above graphs show the distribution of pickups per hour. The shape is bimodal with a major peak around 3200 pickups and a minor peak around 1000 pickups. I wasn’t sure about the minor peak so I used a log scale for the x-axis and found that there seems to really be one large peak at 3200.
##
## Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## 1694252 1872902 1893811 2159598 2282571 2414563 1952782
Pickups per day followed an expected trend, with demand rising Monday - Friday, then peaking on the weekend.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25950 68330 78340 78840 90730 136200
The number of pickups per day was normally distributed with a center around 75,000. There were a couple outliers at both 25,000 and around 125,000.
##
## January February March April May June
## 1953801 2263620 2259773 2280837 2695553 2816895
Demand over this 6-month span went from a little under 2 million rides in January to nearly 3 million in June.
Oh wow, it looks like Manhattan is killing it in Uber pickups, but we can’ tell how many pickups Staten Island or the Bronx had. Let’s change this to a log scale.
##
## Bronx Brooklyn Manhattan Queens Staten Island
## 220146 2322000 10371060 1343945 6959
##
## Bronx Brooklyn EWR Manhattan Queens
## 220146 2322000 105 10371060 1343945
## Staten Island Unknown
## 6959 6264
Ah, much better. Manhattan had the most pickups by far at around 10,000,000 total, while Staten Island had the fewest number of pickups (under 10,000 total). Below the graph are the tables of pickups in the full and filtered uber datasets. I filtered the dataset for the location plots to exclude pickups with missing location data and pickups originating from New Jersey (EWR).
The most diverse distributions came from the location data. Some boroughs only had a few hundred pickups while others were in the 10’s of thousands and even hundreds of thousands. This was interesting because as I saw later, these patterns did not change over the course of the week. Manhattan was the busiest location during the week and on the weekends.
Even more unusual was the fact that Manhattan (pop 1.6 million) is not the most populated Borough in New York. Brooklyn (2.6 million) and Queens (2.3 million) have considerably larger populations. By contrast, the population density of Manhattan is twice that of Brooklyn and triple that of Queens. This could be worth investigating for Uber as they expand into more markets. It would be interesting to look at ridership data for other densely populated cities to look for correlations.
I used the lubridate, dplyr and tidyr packages to reshape the data. I parsed the Pickup_date variable with lubridate to create the month, date, day and hour columns. Then I grouped and summarised observations to bin the rides by time period and location. One day I hope to shake Hadley Wickham’s hand for what he has done for the data science community. He is truly a prolific developer and I feel a certain level of comfort when I see his name in a package’s documentation.
The 2D density plot allows us to see at a glance the distribution of rides per hour for all days. I used the viridis package to color the points. The black regions show regions of lower density while the green and yellow regions show higher density. We can start to see at least two demand curves emerge. One starts lower at midnight and has two distinct peaks while the latter starts a bit higher and seems to have one prominent peak. Curiously, both curves seem to intersect at around 4am.
This shows the distribution of pickups throughout the week. It’s easier to see some of the more nuanced patterns that emerge between the days. For instance, the 8am peak is only prevalent Monday - Friday, most likely due to morning rush hour traffic. The evening climax shifts slightly from 6 pm to 7 pm throughout the week and its duration lengthens as passengers edge closer to the weekend. Instead of retiring promptly after work, demand extends through midnight as the night life heats up.
This shift in rider behavior doesn’t begin abruptly on Friday or even Thursday night as I would expect, but begins as early as Tuesday as workers loosen their ties in the evening. Interestingly, the 8am peak remains unchanged throughout the week at a steady 100,000. People are not rushing to work, but they are rushing to the bar in larger numbers as the week presses on.
Saturday and Sunday behavior removes the 8am peak for obvious reasons and instead, demand grows steadily from 6am until midnight. The growth on Saturday extends through this entire time period while Sunday demand tapers off around 6pm as riders prepare for the coming work week.
## uber.by_day$day: Monday
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 43680 56860 64040 65160 74240 93500
## --------------------------------------------------------
## uber.by_day$day: Tuesday
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25950 65120 72500 72030 80110 97590
## --------------------------------------------------------
## uber.by_day$day: Wednesday
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60380 68240 73740 75750 84940 93710
## --------------------------------------------------------
## uber.by_day$day: Thursday
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 58040 74510 83330 83060 93250 102000
## --------------------------------------------------------
## uber.by_day$day: Friday
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 40290 81860 88890 87790 99640 103100
## --------------------------------------------------------
## uber.by_day$day: Saturday
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 59840 81740 93360 92870 101700 136200
## --------------------------------------------------------
## uber.by_day$day: Sunday
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 41940 65180 75090 75110 88830 96950
Tuesday, Friday and Saturday had the largest variation of pickups per day while Wednesday had the smallest.
This graph gave me some of the most interesting information about overall rider behavior. It shows both long-term increases in demand as well as more cyclical patterns during the week. Demand started small in January with most days averaging around 40,000 - 70,000 pickups. February saw an uptick in the number of 70,000+ days and June consistently reached over 100,000 rides each weekend. It is unclear what exactly caused the increase in demand, but it would be interesting to examine year-over-year Uber data to see how rider behavior changes from winter to spring.
Weekly patterns were also very interesting. Each week saw the same (relative) pattern. Monday was usually the least popular day, with demand rising steadily before climaxing on Saturday. Interestingly, Sunday was a more popular day than Monday, which could have been partly due to the extra runoff when Saturday night turned into Sunday early morning.
Understanding this weekly cycle piqued my interest in weeks that did not follow the pattern. The week of May 17-24 was particularly interesting because the climax was on a Thursday and demand actually dropped on Friday and Saturday. I tried looking up historical data from the area during that time period but was unable to find anything to explain the strange shift in rider behavior.
This also helped identify outliers. Of particular note were Saturday, May 16 and Saturday, June 27. Both days saw demand surge past 100,000 riders. It seemed worth investigating to understand why ridership increased so much and possibly how to replicate those results. The section below dives into May and June in more depth.
I wanted to understand how demand changed thorughout May and June. In particular, I was interested in whether higher demand days had peaks that were fundamentally different or whether they resulted from the same demand curves. May 16th started the same as any other Saturday, but instead of demand plateauing around 7pm, it continued to rise through 11 pm. This spike in demand was most likely due to the festivals taking place in the area that weekend. I was able to find a notice from the NYC Police Department announcing street closures that weekend due to a number of local festivals and a marathon taking place.
The same patterns emerged on June 27th. It started the same as any other Saturday but demand spiked in the afternoon, toping 10,000 rides for a single hour at 11pm. I was able to find another notice from the NYC Police Department that seemed to explain the huge surge in pickups. Another mixture of local festivals and a marathon was responsible for the increase. And again, demand on Sunday, Monday and Tuesday afterwards was relatively unchanged. I’d be very curious to see whether demand was affected on Saturday, July 3rd and 4th. I feel conflicted because demand should probably drop on those days but with July 4th being a major holiday I would expect a good-sized surge in ridership. Unfortunately, there was no available data for the month of July.
Seeing these patterns led me to believe there may be at least two distinct populations of riders. There was the Monday-Friday croud that regularly used the service to get to work but there was also a second, wilder population that liked to party when night fell and on the weekends. This second population was noticeably absent on May 22 and May 23, probably because they were spent from the weekend before. Demand did not suddenly drop off on the Sunday-Thursday of the followng week, but it did on the following Friday and Saturday. The expected peaks from 6pm - 11pm simply did not show up on May 22 or May 23.
I wanted to take a more in-depth look at how ridership changed on the two biggest Saturdays in the dataset (May 16th and June 27th). I was also interested in how it changed on the lowest Saturday (May 23rd). All 3 days had above-average turnout in the early morning, but it’s what happened later in the day that had the biggest impact on the numbers. While demand dropped below the average on May 23rd after 10am, it was consistently above normal throughout the day on both May 16th and June 27. This makes sense since demand increases throughout the day after 10am on Saturdays in general.
## uber_known.by_loc$Borough: Bronx
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 156 3056 4327 5242 6485 15840
## --------------------------------------------------------
## uber_known.by_loc$Borough: Brooklyn
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 290 8409 15730 38070 54610 217900
## --------------------------------------------------------
## uber_known.by_loc$Borough: Manhattan
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12 50280 133600 150300 251500 460700
## --------------------------------------------------------
## uber_known.by_loc$Borough: Queens
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8 3103 5571 19610 12370 289200
## --------------------------------------------------------
## uber_known.by_loc$Borough: Staten Island
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 38.5 316.5 348.0 619.0 925.0
There was a lot of variation in the location data. I binned the data by Zone to create the boxplot. Overall, Staten Island had very few pickups per zone, with most falling in the low 100’s. Manhattan had the most pickups with some zones nearing 500,000. What made some of this data unusual was the distribution of pickups within each Borough. For instance, Manhattan is such a busy area I expected huge numbers for all of its zones, but some had as little as 12 pickups. This makes me wonder how Uber chose to agregate their location data. Maybe there are some neighborhoods or remote areas of Manhattan that don’t get much foot traffic at all. I used a log scale for the y axis because the data was so disperse.
The May and June faceted histograms helped me understand demand changes over time. Although it was not a particularly eventful day, May 25th caught my attention because it didn’t look like any other Monday in the graph. The usual 8am peak was curiously absent and although it was the least busy day in May and June, it started higher than all of the other Mondays. It turns out that was Memorial day. Again, not a particularly interesting day but still helpful for extrapolating information about other Monday holidays.
Ridership varied dramatically from week to week over the 6 month period but a few patterns remained constant. Demand was consistently greatest on Friday and Saturday of each week with few exceptions. There were even a few super Saturdays that rose above others. I think that Uber and their drivers could really benefit from taking a closer look at some of this data. For instance, the two largest peaks both happened to coincide with large festivals and marathons. Uber as an organization could benefit from forming city partnerships to promote more activities as they seem to lead to surges in demand.
Uber drivers could also benefit from some of this data. As a former driver, I understand the struggle to create a feasible work schedule. Because drivers are independent contractors, they set their own hours so they are entirely responsible for how much they can earn each day. This can also lead to burnout, however; because it’s often difficult to predict demand without experience.
Most rides originated from Manhattan, with a few outliers in Brooklyn and Queens. I pulled the latitude and longitude data using the Google Maps API and merged them with the Uber pickup data to generate the map. I used ggmap to create the plot and grapped the background from stamen. I chose the black and white (toner) version to help the data points stand out more.
## uber.by_day$month: January
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25950 56960 63380 63030 73190 94450
## --------------------------------------------------------
## uber.by_day$month: February
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 63310 72250 79560 80840 89840 102800
## --------------------------------------------------------
## uber.by_day$month: March
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 53920 63910 73950 72900 79080 95270
## --------------------------------------------------------
## uber.by_day$month: April
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 53560 68240 75960 76030 82440 106300
## --------------------------------------------------------
## uber.by_day$month: May
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 59750 78390 87360 86950 95110 121600
## --------------------------------------------------------
## uber.by_day$month: June
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 74430 87770 93340 93900 100900 136200
This heatmap highlights overall changes in ridership over time. It is easy to see the busiest (June 27th) and slowest (Jan 27th) days during the 6-month time period. I used dplyr’s group_by and summarise functions to bin the ride data by date, then used geom_tile() to create the heatmap. I played around with scale_fill_gradientn() to get an appropriate-looking color palette. I had to reverse the scaling because originally the higher values were blue and the lower ones were red.
## uber.by_hour$day: Monday
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 217 1638 2808 2720 3688 6808
## --------------------------------------------------------
## uber.by_hour$day: Tuesday
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 159 1723 3048 3040 4284 8394
## --------------------------------------------------------
## uber.by_hour$day: Wednesday
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 231 1893 3134 3156 4408 8075
## --------------------------------------------------------
## uber.by_hour$day: Thursday
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 323 1920 3299 3461 4939 7980
## --------------------------------------------------------
## uber.by_hour$day: Friday
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 366 2118 3416 3658 5037 8741
## --------------------------------------------------------
## uber.by_hour$day: Saturday
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 548 2088 3690 3869 5462 10650
## --------------------------------------------------------
## uber.by_hour$day: Sunday
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 565 1915 3181 3134 4148 8980
The above graphs show the distribution of rides per hour for all dates in the 6-month time period. The top graph is colored by day, which highlights how demand rose along a gradient from Sunday (yellow) through Saturday (dark red). I threw in a smoothing curve to show the (mean) average demand for all days. The bottom graph helps distinguish the two types of demand curves for Monday - Friday vs Saturday and Sunday. Finally, the table shows the distribution of rides per hour broken down by day. Saturday had the largest range of pickups per hour while Monday had the smallest.
As discussed in the previous section, there seems to be more than one demand curve for each day. I decided to try to tease the patterns apart. What I found pointed to not two, but 3 distinct demand curves. Each density plot is broken down by borough. Most boroughs followed roughly the same trend, although Staten Island was consistently unique. This may have been due to the lower number of overall rides (Under 10,000 total vs 100,000-10,000,000 for other boroughs).
The Monday-Friday plot shows 3 peaks of increasing size starting at midnight. Although the second (8am) peak was pretty evenly distributed across the boroughs, the final peak showed a large amount of variance. It seems Manhattan passengers went out a lot in the evening while Bronx passengers seemed to be more like homebodies, with a less pronounced evening surge.
The weekend told a different story for the 5 boroughs. Saturday consisted of two peaks of increasing size in the early morning and in the evening. Demand seemed to drop sharply after the first peak and into the wee hours of the morning before climbing slowly throughout the day. Sunday was almost the inverse of Saturday. The major peak occurred in the early morning while the minor peak occurred in the afternoon / evening.
These graphs complement each other well to paint a clear picture of rider habits. The map scatterplot highlights the most popular locations while the colored pickups by hour scatterplots show how rider behavior changes over time. I think Uber should share more of this type of data with drivers. It takes a lot of the guesswork out of driving and would eliminate a lot of regret over missed opportunities. Having a better understanding of rider behavior would help drivers plan a more concrete schedule.
I was surprised that February was so busy. Even though it is the shortest month, more people used the service each day on average than in March or April. It makes sense that Valentine’s day was so busy, but I wonder what caused the uptick the following weekend (Feb 20th and 21st). I would have expected that after the busy Valentine’s day weekend, demand would drop the following weekend. I tried to run a Google Trends query for the timeframe and region, but the results were inconclusive.
I chose this plot because it gives an easy to understand visualization of the distribution of pickups by location. We can see from the map that Manhattan was by far the most popular region, followed by remote parts of Brooklyn and Queens. The dark red circle in the lower right corner is JFK Airport and the one at the top of Queens is LaGuardia. I coded each dot to represent the number of pickups using both size and color because it was more helpful than either attribute alone. I chose to hide the size guide because it seemed redundant. I also transformed the color and size to a log scale because the range of pickups was massive (between 2 and 460,732 pickups). Finally, I played around with the bias in colorRampPalette to stretch the range of colors for larger numbers. This helped highlight how the number of pickups seems to diminish the further you are from the Manhattan epicenter.
This plot gave me the most ideas when performing my exploratory data analysis. It captured a lot of the subtle and not-so subtle patterns along weekly and monthly cycles. I colored the bars by day because it highlighted how demand rose throughout the week and edited the color palette so that all bars were clearly visible. Finally, I changed the labels from scientific notation to decimal notation to make them more human-readable.
The Pickups by Hour with Daily Median graph was interesting because it showed key differences between weekdays and weekends. There are at least two distinct demand curves at play here. Saturday and Sunday ridership decreased from midnight to 6am, then steadily rose throughout the rest of the day. Demand on these days also started higher at midnight because people usually go out more on Friday and Saturday nights. Monday through Friday demand decreased from midnight to 3am, then then abruptly rose as people started waking up to get to work. It was interesting to see how tight and consistent the 8am rush was, as opposed to the 6pm rush when some passengers went home while others headed to the bar.
This was an eye-opening project for me. As a former Uber driver I was curious to see how demand changed over time and varied by location. Being in the right place at the right time can be the difference between a $2000 week and a $200 one. It was fun testing my intuitions against the data even when I was wrong. I had no idea that a Thursday or even a Wednesday could be nearly as profitable as a Friday under the right conditions. I was also surprised to learn that demand spiked so much around 6pm and that there really wasn’t much of a lunch rush Monday-Friday.
People need data to make more informed choices. When LinkedIn started sharing even simple data (who’s looked at your profile and profile tips), their popularity skyrocketed. This seems like a great opportunity to empower more drivers. Helping drivers perform their jobs more effectively could be a win for all sides. If more drivers were more aware of the cyclical patterns in their area they would be better equipped to maximize their time and profit. Drivers would be more satisfied, leading to lower turnover and passengers would be better served by more experienced (and perhaps friendlier) drivers.
There is a lot of data Uber tracks that was not present in this dataset. This includes but is not limited to:
Each of the above holds valuable information about driver and passenger behavior. Analyzing the number of pickup requests that went unanswered could help improve Uber’s overall efficiency. For example, there may be time periods when there should be more drivers on the road or the opposite, when an area is supersaturated and drivers are less likely to find an available passenger. This, in conjunction with the surge information for each pickup could help unpack how Uber calculates their surge multiplier. Further analysis could help improve overall satisfaction for both passengers and drivers.
Pricing data could shed a lot of information on both passengers and drivers. Looking at the amount charged per pickup as well as any available surge information could reveal hidden trends. Do customers avoid Uber past a certain surge multiplier? Are they affected at all by the surge? Do drivers flock to surge areas or are they mostly stationary? Pricing data opens up many possibilities for exploration. We could look at whether it affects trip duration, the number of subsequent trips, or even the differences in price sensitivity between different locations. In addition, it would be interesting to compare passenger ratings for surge trips vs non-surge trips. With enough variables, we could even predict future pickup demand. I’m curious how close a machine learning algorith could get to predicting surges or lulls. There is a lot we are missing out on here.
The biggest struggle was trying to capture as much information from each plot as efficiently as possible. There were many graphs I had to throw away because they really didn’t contribute much to the report. I also came across a few road-blocks while using ggmaps. For instance, I originally tried to make a 2d histogram using stat_density2d() but was unable to because there were too many observations. All-in-all I was very happy to go through this and I feel like the tools developed here can be applied to a wide variety of data exploration projects.