Traditional data warehouses are built for Business Intelligence analytics, CEO Dashboards, and other types of business reporting prepared for “human consumption.” That often implies that data in these warehouses is not ready for “machine consumption,” including machine learning (ML) models. For example, it is mostly sufficient for humans to know the date of a particular event, while machines usually require the exact timestamp with hours, minutes, seconds, and possibly even milliseconds.
So if data scientists train their ML models on these nice and clear datasets from data warehouses, they often run into numerous unexpected issues when pushing their models into production. To avoid issues arising from the online-offline inconsistency of data, the ML infrastructure team at Airbnb created Zipline, a data management framework for traveling “in time and space.”
Do you like this in-depth educational content on applied machine learning? Subscribe to our Enterprise AI mailing list to be alerted when we release new material.
When would you need Zipline?
There are different types of ML projects depending on what you are trying to predict – some models predict unstructured entities (e.g., image classification), others predict structured entities (e.g., family-friendly or solo-traveler apartment), and yet others predict events (e.g., network traffic).
If your model runs batch-only, you probably don’t need Zipline. However, if it serves traffic in real time, you may find Zipline’s solution very helpful. Events-driven machine learning is where Zipline can be of particular importance.
We can go through a specific example to get a better understanding of when traditional data warehouses are not suitable for predicting events. Let’s say you want to predict the likelihood that a user will make a booking when viewing the webpage for a specific house or apartment. One of the possible features can be the total number of bookings from the previous 7 days. This type of feature is very dynamic: when we change the time point of the prediction even by a few hours, the feature value can also change, which can lead to a different prediction.
Traditional warehouses usually operate with daily totals and can’t give you the interim data. Daily totals often work quite well, but in some cases, they can cause big problems. For example, if you take end-of-day data, you can accidentally include the thing you’re trying to predict in one of the features (i.e., the label leakage problem). The evaluation will show that the corresponding feature is very good at predicting a specific event, but then in production, it will not work that well. Alternatively, if you take end-of-day data from the previous day, you can lose some really relevant features (e.g., number of clicks within the last five minutes).
So, if you use machine learning to predict specific events, and your data scientists are spending most of their time generating training data, and still get models that perform well on test data, but not in production, Zipline is likely to help you.
How does it work?
The goal of Zipline is to ensure online-offline consistency by providing ML models with the exact same data when training and scoring. To this end, ZIpline allows its users to define features in a way that allows point-in-time correct computations.
A data warehouse at Airbnb stores only raw data and no features. Features are computed only after a user asks for certain values to be calculated for certain clients at a specific time point. Zipline creates training data through the following steps:
- A user requests raw data from the warehouse using primary keys and timestamps.
- A user specifies what kind of features they want to create from this raw data (e.g., the average booking value for the last year, or the total number of all bookings for the last 30 days).
- Zipline returns the requested feature vector with up-to-date data.
For example, you can come to Zipline and say: “Hey, for user “123” for timestamp “yesterday at noon,” please give me the “total number of bookings for the last 30 days”.
In the production workflow, scoring also requests only primary key vectors and not feature vectors. Since in production the timestamp is always “now”, the user only needs to give the system the model name, user IDs and listing IDs. Then, Zipline calculates all features necessary for the respective model, for the specified users and listings. These features are then used to make a prediction.
Such a setup ensures that features are the same in all environments and models in production perform as expected after evaluation on a test set.
What are the advantages of Zipline?
The Zipline data management framework has a number of features that boost the effectiveness of data scientists when preparing data for their ML models:
- Online-offline consistency. The framework ensures feature consistency in different environments.
- Data quality and monitoring. Zipline has built-in features that ensure data quality, as well as a user interface that supports the monitoring of data quality.
- Searchability and shareability. The framework includes a feature library where users can search for previously used features and share features between different teams.
- Integration with end-to-end workflow. Zipline is integrated with the rest of the machine learning workflow.
Airbnb’s ML infrastructure team declares that Zipline will be open-sourced by the end of 2019.
If you feel that your ML projects could benefit from the Zipline data management framework or you are simply interested in this solution, check out the video below that this article is based on:
Enjoy this article? Sign up for more updates on applied ML.
We’ll let you know when we release more technical education.