Getting sufficient amounts of labeled training data is a major bottleneck for many machine learning (ML) projects. You can create fancy models but they will be of little value if domain experts need to spend years labeling the relevant dataset. That is particularly relevant in areas where high expertise is required from data labelers, like, for example, in medical applications of machine learning.
Do you like this in-depth educational content on applied machine learning? Subscribe to our Enterprise AI mailing list to be alerted when we release new material.
In 2016, an advisor at Stanford asked his graduate student to work out the solution for this problem, assuming that this “should probably take an afternoon”. That’s how the Snorkel project started. Snorkel is a system for rapid training data creation with weak supervision. With this tool, you can create labeled training datasets quickly and efficiently through the so-called labeling functions. Labeling functions are rules or patterns written by domain experts to set labels automatically instead of manual labeling. Snorkel allows you to replace crowdsourcing of manual data labeling with crowdsourcing of writing labeling functions.
How does it work?
Let’s go through a specific example to see how data labeling is automated with Snorkel. Suppose we need a labeled training dataset for identifying tumors in MRI images. In particular, we need a dataset of MRI images with “yes/no” labels referring to the presence of a tumor on a given image. And what do we have instead? We have an unlabeled dataset of MRI images and accompanying reports prepared by doctors.
The traditional approach would be to hire several radiologists who would sit for months labeling this dataset. The alternative approach enabled by the Snorkel framework is to have a radiology expert spend a week or two writing labeling functions that could leverage text reports to label images. In this particular example, described by Alex Ratner in the TWiML&AI podcast, the experts suggested around 20 labeling functions that leveraged different information from the text reports accompanying MRI images, like for example, the presence of particular words in the report or the number of times the word “normal” appears in the report.
The idea is to create many labeling functions and then check how they correlate. You don’t want two labeling functions that are inherently the same. Similarly, you don’t want unreliable labeling functions to define labels. The analogy from manual labeling would be that you only want to consider labels provided by reliable crowd workers. In the case of automatic labeling, the Snorkel system takes care of this by weighting the labeling functions accordingly.
In a nutshell, Snorkel takes unlabeled data and the output of multiple labeling functions and outputs a matrix of noisy labels.
What are the advantages of Snorkel?
The proposed approach to data labeling has a number of important benefits:
- Interpretability. Moving from manual labeling to programmatic labeling helps to improve the interpretability of ML algorithms. Rules and patterns embedded in the labeling functions improve our understanding of the difference between positive and negative data samples.
- Re-usability. If your training dataset is labeled manually and, in the process of model development, you discover that you need a slightly different kind of labels, you need to restart the data labeling from scratch. Alternatively, if you label your data programmatically, you can just quickly change the labeling functions to fit the new models. Moreover, labeling functions can be repurposed and re-applied when new problems come in.
- Domain expertise. Snorkel leverages the knowledge of subject matter experts who, in addition to writing labeling functions, also provide explanations for the rules embedded in the labeled functions. For example, an expert can specify that a certain rule applies only when a certain condition holds, and this can be reflected in code, e.g., “if the condition holds, output a label; otherwise, abstain”.
The only drawback is that labels generated through labeling functions are only almost as good as manually-applied labels. But considering all the benefits that this approach provides, that is usually something that most ML teams are ready to accept.
Who can leverage this data labeling framework?
In the first academic paper on Snorkel, presented at NeurIPS 2016, the tool was described as an algorithm. Now the creators of Snorkel see it as part of ML infrastructure, a system that can be leveraged by industry and academia for speeding up and improving the data labeling process. In the new version of Snorkel, its creators go beyond data labeling and introduce two new operations: transforming or augmenting data, and slicing or partitioning data.
Snorkel is open-source, and over the last few years, it has been deployed by a number of tech leaders, including Google, Intel, and IBM. Snorkel has also been successfully applied in medicine (e.g. Stanford, VA), government, and science.
Enjoy this article? Sign up for more updates on applied ML.
We’ll let you know when we release more technical education.