Open Data

Our group works with many collaborators on various datasets. Open datasets are published here.

Urbana 10 car ring road test

This dataset contains trajectory and fuel consumption data from a series of 10 vehicle ring road tests as an extension to the Sugiyama et al. (2008) experiments. The dataset contains video data, the extracted trajectories from the camera, the smoothed trajectories, and the OBD-II logs from each equipped vehicle containing fuel consumption information.  Download the data here:

2010-2013 New York City Traffic Estimates

This dataset contains hourly average traffic speeds on road segments throughout New York City. It covers four years of traffic estimates in New York City estimated from approximately 700 million taxi trips throughout the four year period, also available for download below. A complete description of the traffic estimates and the corresponding open street map road network is described in the readme file contained in the data download. Download the data here:

2010-2013 New York City Taxi Data

This dataset was obtained through a Freedom of Information Law (FOIL) request from the New York City Taxi & Limousine Commission (NYCT&L). It covers four years of taxi operations in New York City and includes 697,622,444 trips. Thanks to a generous hosting policy by the University of Illinois at Urbana Champaign, we are able to make this large dataset publicly available.

You are free to use the data as you wish, we only kindly ask you to consider citing the following works if you plan to publish subsequent results using the dataset:

Brian Donovan and Daniel B. Work. “Using coarse GPS data to quantify city-scale transportation system resilience to extreme events.”  presented at the Transportation Research Board 94th Annual Meeting, January 2015.  preprintsource code.

Brian Donovan and Daniel B. Work  “New York City Taxi Trip Data (2010-2013)”. 1.0. University of Illinois at Urbana-Champaign. Dataset., 2014.

Download the data here:

The data is stored in CSV format, organized by year and month. In each file, each row represents a single taxi trip. As there are several entries per second for four years, the raw trip data takes up about 116GB in text CSV format. The data has been compressed (zip) to reduce download time.

The data is organized as follows:

  • medallion: a permit to operate a yellow taxi cab in New York City, it is effectively a (randomly assigned) car ID.  See also medallions.
  • hack license: a license to drive the vehicle, it is effectively a (randomly assigned) driver ID. See also hack license.
  • vender id:  e.g., Verifone Transportation Systems (VTS), or Mobile Knowledge Systems Inc (CMT), implemented as part of the Technology Passenger Enhancements Project.
  • rate_code: taximeter rate, see NYCT&L description.
  • store_and_fwd_flag: unknown attribute.
  • pickup datetime: start time of the trip, mm-dd-yyyy hh24:mm:ss EDT.
  • dropoff datetime: end time of the trip, mm-dd-yyyy hh24:mm:ss EDT.
  • passenger count: number of passengers on the trip, default value is one.
  • trip time in secs: trip time measured by the taximeter in seconds.
  • trip distance: trip distance measured by the taximeter in miles.
  • pickup_longitude and pickup_latitude: GPS coordinates at the start of the trip.
  • dropoff longitude and dropoff latitude: GPS coordinates at the end of the trip.

The medallion and hack licenses are reassigned each year, so it is only possible to track drivers and vehicles within each year. This is necessary for to render the data pseudo-anonymous, since de-anonymized data from 2013 can be reconstructed from existing published datasets, see the note on anonymity below.

Please note that the dataset contains a large number of errors. For example, there are several trips where the reported meter distances are significantly shorter than the straight-line distance, violating Euclidean geometry. For some periods, the field trip_time_in_secs is reported in seconds, in others it is reported in minutes (see the first record above). Generally the trip time can be safely computed by subtracting the pickup_datetime from the dropoff_datetime.  Additionally, many trips report GPS coordinates of (0,0), or cover impossible distances, times, or velocities. All of these types of obvious trip errors should be discarded in any analysis. In our preliminary investigations, these errors account for roughly 7.5% of all trips.  More details about these errors are available in the above article and corresponding open source code. Currently, only the raw data (no error filtering) is available for download via this site.

Fare data is also available from 2010-2014. The fare data takes about 75GB in raw text CSV format, and is also zipped to reduce download times. The files are also organized by year and month, and contain the following attributes:

  • medallion: a permit to operate a yellow taxi cab in New York City, it is effectively a (randomly assigned) car ID. See also medallions.
  • hack license: a license to drive the vehicle, it is effectively a (randomly assigned) driver ID. See also hack license.
  • vender id:  e.g., Verifone Transportation Systems (VTS), or Mobile Knowledge Systems Inc (CMT), implemented as part of the Technology Passenger Enhancements Project.
  • pickup datetime: start time of the trip, mm-dd-yyyy hh24:mm:ss EDT.
  • payment type: Cash or credit card.
  • fare amount: the meter fare, it should include the Newark surcharge, in USD.
  • surcharge: Extra fees, such as rush hour and overnight surcharges, in USD.
  • mta tax: Metropolitan commuter transportation mobility tax, in USD.
  • tip amount: tip amount, in USD.
  • tolls amount: total price paid for tolls, summed across all tolls for the trip, in USD.
  • total amount: all charges that are presented to the passenger at time of fare payment (includes tip for non-cash trips), in USD.

Again, note the medallion and hack licenses change each year.

A note on anonymity. The published datasets on this site have been pseudo-anonymized to obscure personally identifiable information. It is well known that location data is notoriously difficult to anonymize, see for example the works of Marco Gruteser or John Krumm. Moreover, a subset of the of the raw dataset obtained via a FOIL request (published by Chris Whong) has already been de-anonymized by Vijay Pandurangan. Because the true ids can still be recovered with a FOIL request and by following the techniques described in the above links, we only aim to make recovering the true ids slightly more work than writing a new FOIL request to NYCT&L.

How we pseudo-anonymized the datasets.  The medallion and hack licenses were pseudo-anonymized by assigning a randomly generated medallion and hack license, instead of using the hashed medallion and hack licenses provided by NYCT&L. Each year contains a new set of medallions and hack licenses. This means it is possible to track a driver through all of 2010, but it is NOT possible to track the same driver in 2011, for example. We are not able to give a medallion or hack license across the complete dataset because the 2013 data has already been de-anonymized, and doing so would trivially compromise the remaining data. Finally, the dataset may still be vulnerable to statistical or other attacks to recover the IDs, and thus we do not claim it is anonymous.

We ultimately decided to publish this dataset in an effort to make our own research reproducible, and to aid other researchers interested in taxi operations.

The NYYT&L Commission does not restrict publishing the data, as determined from personal communication with the Commission. “The data was disclosed pursuant to the NYS Freedom of Information Law, therefore there is no licensing restriction on your publication of the data.”

Moreover, the University of Illinois at Urbana Champaign Institutional Review Board reviewed our request to publish this dataset. “Since you received this information via the Freedom of Information Law, and will be analyzing trip and fare data, you are not considered interacting or intervening with human subjects, therefore, it has been determined that this project as described does not meet the definition of human subjects research as defined in 45CFR46(d)(f) or at 21CFR56.102(c)(e) and determined publication does not constitute human subjects research.”

2008 Mobile Century experiment GPS trajectory data

Mobile Century was an experiment run at UC Berkeley to test the potential to use GPS data to estimate traffic conditions. The dataset contains 8 hours of GPS trajectory data from 100 vehicles on a ~10 mile stretch of I-880 in California, as well as inductive loop detector data from PeMS, and travel times recorded by license plate recognition. The dataset remains one of the most comprehensive public GPS datasets for traffic monitoring research.

The key reference paper for the dataset is:

J.-C. Herrera, D. Work, J. Ban, R. Herring, Q. Jacobson, and A. Bayen. “Evaluation of traffic data obtained via GPS-enabled Mobile Phones: the Mobile Century experiment.”  Transportation Research Part C, 18(3), pp. 568–583, 2010. DOI: 10.1016/j.trc.2009.10.006. Download: preprint,  manuscript.  Most Cited Transportation Research Part C: Emerging Technologies Article Since 2008 (June 2013).

Download the data here: