Press "Enter" to skip to content

The Importance of Versioning Data

John Mount demonstrates an important concept:

Our business goal is to build a model relating attendance to popcorn sales, which we will apply to future data in order to predict future popcorn sales. This allows us to plan staffing and purchasing, and also to predict snack bar revenue.

In the above example data, all dates in August of 2024 are “in the past” (available as training and test/validation data) and all dates in September of 2024 are “in the future” (dates we want to make predictions for). The movie attendance service we are subscribing to supplies

  • past schedules
  • past (recorded) attendance
  • future schedules, and
  • (estimated) future attendance.

John’s example scenario covers the problem of future estimations interfering with model quality. Another important scenario is when the past changes. As one example, digital marketing providers (think Google, Bing, Amazon, etc.) will provide you impression and click data pretty quickly, and each day they close the books on a prior day’s data at some normal time. For some of these providers, that prior day’s data is yesterday’s data—on Tuesday, provider X closes the books on Monday’s data and promises that it won’t change after that. But for other providers, they might change data over the course of the next 10 days. This means that the data you’re using for model training might change from under you, and you might never know if you don’t keep track of the actual data you used for training at the time of training.