How To Crowdsource Labeled Datasets Quickly With Open-Source Tools Like Snorkel

September 25, 2019 by Kate Koidan

Snorkel data labeling

Getting sufficient amounts of labeled training data is a major bottleneck for many machine learning (ML) projects. You can create fancy models but they will be of little value if domain experts need to spend years labeling the relevant dataset. That is particularly relevant in areas where high expertise is required from data labelers, like, for example, in medical applications of machine learning.

Do you like this in-depth educational content on applied machine learning? Subscribe to our Enterprise AI mailing list to be alerted when we release new material.

In 2016, an advisor at Stanford asked his graduate student to work out the solution for this problem, assuming that this “should probably take an afternoon”. That’s how the Snorkel project started. Snorkel is a system for rapid training data creation with weak supervision. With this tool, you can create labeled training datasets quickly and efficiently through the so-called labeling functions. Labeling functions are rules or patterns written by domain experts to set labels automatically instead of manual labeling. Snorkel allows you to replace crowdsourcing of manual data labeling with crowdsourcing of writing labeling functions.

Snorkel data labeling — Ratner et al., 2017

How does it work?

Let’s go through a specific example to see how data labeling is automated with Snorkel. Suppose we need a labeled training dataset for identifying tumors in MRI images. In particular, we need a dataset of MRI images with “yes/no” labels referring to the presence of a tumor on a given image. And what do we have instead? We have an unlabeled dataset of MRI images and accompanying reports prepared by doctors.

The traditional approach would be to hire several radiologists who would sit for months labeling this dataset. The alternative approach enabled by the Snorkel framework is to have a radiology expert spend a week or two writing labeling functions that could leverage text reports to label images. In this particular example, described by Alex Ratner in the TWiML&AI podcast, the experts suggested around 20 labeling functions that leveraged different information from the text reports accompanying MRI images, like for example, the presence of particular words in the report or the number of times the word “normal” appears in the report.

Snorkel data labeling — Example of a labeling function

The idea is to create many labeling functions and then check how they correlate. You don’t want two labeling functions that are inherently the same. Similarly, you don’t want unreliable labeling functions to define labels. The analogy from manual labeling would be that you only want to consider labels provided by reliable crowd workers. In the case of automatic labeling, the Snorkel system takes care of this by weighting the labeling functions accordingly.

In a nutshell, Snorkel takes unlabeled data and the output of multiple labeling functions and outputs a matrix of noisy labels.

Applied AI Book Second Edition

What are the advantages of Snorkel?

The proposed approach to data labeling has a number of important benefits:

Interpretability. Moving from manual labeling to programmatic labeling helps to improve the interpretability of ML algorithms. Rules and patterns embedded in the labeling functions improve our understanding of the difference between positive and negative data samples.
Re-usability. If your training dataset is labeled manually and, in the process of model development, you discover that you need a slightly different kind of labels, you need to restart the data labeling from scratch. Alternatively, if you label your data programmatically, you can just quickly change the labeling functions to fit the new models. Moreover, labeling functions can be repurposed and re-applied when new problems come in.
Domain expertise. Snorkel leverages the knowledge of subject matter experts who, in addition to writing labeling functions, also provide explanations for the rules embedded in the labeled functions. For example, an expert can specify that a certain rule applies only when a certain condition holds, and this can be reflected in code, e.g., “if the condition holds, output a label; otherwise, abstain”.

Snorkel data sampling — Example of a labeling function

The only drawback is that labels generated through labeling functions are only almost as good as manually-applied labels. But considering all the benefits that this approach provides, that is usually something that most ML teams are ready to accept.

Who can leverage this data labeling framework?

In the first academic paper on Snorkel, presented at NeurIPS 2016, the tool was described as an algorithm. Now the creators of Snorkel see it as part of ML infrastructure, a system that can be leveraged by industry and academia for speeding up and improving the data labeling process. In the new version of Snorkel, its creators go beyond data labeling and introduce two new operations: transforming or augmenting data, and slicing or partitioning data.

Snorkel data labeling — New version of Snorkel

Snorkel is open-source, and over the last few years, it has been deployed by a number of tech leaders, including Google, Intel, and IBM. Snorkel has also been successfully applied in medicine (e.g. Stanford, VA), government, and science.

We’ll let you know when we release more technical education.

Related

About Kate Koidan

Kate is Editor at TOPBOTS. She likes to follow the latest research breakthroughs in Artificial Intelligence but she is also a fan of the real-world AI applications. Kate loves these moments when she can enjoy her cup of coffee while an AI-powered robot is entertaining her small kids.

Comments

impressionist artist says

March 10, 2024 at 2:47 pm

This design is wicked! You definitely know how to keep a reader entertained.
Between your wit and your videos, I was almost moved to start my own blog (well, almost…HaHa!) Excellent job.

I really enjoyed what you had to say, and more than that, how you presented
it. Too cool!
cause of hair loss in women says

March 11, 2024 at 7:05 pm

Wow that was unusual. I just wrote an extremely long comment but after I clicked submit my comment didn’t show up.
Grrrr… well I’m not writing all that over again. Anyway, just wanted to say
superb blog!
ストッキングエロ says

March 12, 2024 at 1:18 am

Wonderful goods from you, man. I’ve be aware your stuff prior to
and you’re simply extremely great. I really like what you have got right here, certainly like what you’re
stating and the way in which through which you are saying it.
You’re making it enjoyable and you still take care of to keep it sensible.
I cant wait to read far more from you. This is actually a terrific website.ラブドールエロ
puravive supplement says

March 12, 2024 at 5:53 am

Your means of explaining everything in this piece of writing is truly good, all be capable of easily be aware of
it, Thanks a lot.
best home business ideas says

March 12, 2024 at 2:44 pm

Unquestionably believe that which you said. Your favourite
justification appeared to be at the net the easiest thing to understand of.
I say to you, I certainly get irked whilst folks think about issues
that they just do not realize about. You managed to
hit the nail upon the top and also defined out the whole thing with no need side effect , other folks
can take a signal. Will probably be back to get more.
Thank you
beginners white wine says

March 13, 2024 at 12:37 am

Hey there this is kind of of off topic but I was wanting to know if blogs use WYSIWYG editors
or if you have to manually code with HTML.

I’m starting a blog soon but have no coding knowledge so I
wanted to get guidance from someone with experience. Any help would be greatly appreciated!
ãƒ©ãƒ–ãƒ‰ãƒ¼ãƒ« says

March 13, 2024 at 6:22 am

international online pharmacy
slot terpercaya says

March 17, 2024 at 1:49 pm

Hmm it appears like your site ate my first comment (it
was super long) so I guess I’ll just sum it up what I submitted and say, I’m thoroughly enjoying
your blog. I too am an aspiring blog blogger but I’m still new to everything.
Do you have any tips for novice blog writers? I’d really appreciate it.
えろ人形 says

March 19, 2024 at 8:27 am

Helpful information. Lucky me I found your web site accidentally, and I am shocked why this accident
didn’t took place earlier! I bookmarked it.セックスロボット
спб срочный выкуп комнат says

March 21, 2024 at 2:37 am

Заголовок: Выкуп квартиры срочно
Хотите продать квартиру быстро и без лишних хлопот?

Мы готовы выкупить вашу недвижимость за наличные в самые короткие
сроки.
Почему с нами удобно?
Наша команда опытных специалистов обеспечит
оперативный спб срочный выкуп комнат вашей
квартиры, не требуя кучу документов и предоплат.
Мы работаем честно и прозрачно, гарантируя конфиденциальность сделки.

Не теряйте время на поиски покупателя – оставьте заявку прямо сейчас, и мы свяжемся с
вами в ближайшее время!
online casino siteleri says

March 25, 2024 at 4:55 pm

casino jackpot online
new online casino
online casino
play best casino
best casino online
расследование qiwi says

March 27, 2024 at 6:57 am

Our expertise in e-wallets and online transactions means we’re well-prepared to tackle your QIWI
wallet issues.
Be assured, we utilize advanced techniques and are well-versed in the current
security protocols.

In case of affected by unauthorized transactions
or errors that resulted in a loss of funds, we’re here to help.
비아그라 처방 says

March 30, 2024 at 3:31 am

When some one searches for his required thing, therefore
he/she wants to be available that in detail, therefore that thing is maintained over here.

Leave a Reply

You must be logged in to post a comment.

Share This