Why Does Training Data Matter?
Machine Learning has made significant strides in the last decade. This can be attributed to parallel improvements in processing power and new breakthroughs in Deep Learning research. Another key reason is the abundance of data that has been accumulated. Analysts estimate humankind sits atop 44 zettabytes of information today. The headline-grabbing OpenAI paper GPT-2 was trained on 40GB of internet data. These algorithms have advanced at a phenomenal rate and their appetite for training data has kept pace.
Methods of feeding data into algorithms can take multiple forms. Unsupervised learning takes large amounts of data and identifies its own patterns in order to make predictions for similar situations. Unsupervised learning has been applied to large, unstructured datasets such as stock market behavior or Netflix show recommendations. This article will focus on supervised learning, in which humans apply their own set of labels to data in order to better understand and classify other data. Supervised learning requires less data and can be more accurate, but does require labeling to be applied. The dataset along with its associated label is referred to as ground truth. We will cover common supervised learning use cases below.
Additionally, data itself can be classified under at least 4 overarching formats – text, audio, images, and video. While there are interesting applications for all types of data, we will further hone in on text data to discuss a field called Natural Language Processing (NLP).
Do you find this in-depth technical education about NLP applications to be useful? Subscribe below to be updated when we release new relevant content.
Common Use Cases for NLP
There is a broad spectrum of use cases for supervised learning. One common use case is to understand the core meaning of a sentence or text corpus by identifying and extracting key entities. This sub-branch is commonly referred to as Named Entity Recognition or Named Entity Extraction.
In the above example, Big Bird can be identified as a character, while the porch might be labeled as a location. With enough examples, a model may be able to start recognizing other patterns, such as Elmo sits on the porch, or Cookie Monster stands on the street. Extrapolating beyond this toy example, companies around the world are able to use this methodology to read a doctor’s notes and understand what medical procedures were performed; an algorithm can read a business contract and understand the parties involved and how much money changed hands.
Another popular area for NLP is semantic analysis. This allows algorithms to understand the tone of a sentence.
We can train a binary classifier to understand whether a sentence is positive or negative. More advanced classifiers can be trained beyond the binary on a full spectrum, differentiating between phenomenal, good, and mediocre. Sentiment analysis has been used to understand anything as varied as product reviews on shopping sites, understanding posts about a political candidate on social media and customer experience surveys. Generalizing sentiment analysis further, a field called document labeling allows us to categorize entire documents – a user sending a support email about login issues can be classified separately from an email about product availability, allowing a business to route the requests to the appropriate department.
Other, more advanced tasks in NLP include dependency parsing and syntax trees, which allow us to break down the structure of a sentence in order to better deal with ambiguities in human language.
Interpretation 1: Ernie is on the phone with his friend and says hello
Interpretation 2: Ernie sees his friend on the phone and says hello
Finally, it is possible to blend the tasks above, highlighting individual words as the reason for a document label.
What Do I Actually Need to Label?
While many of the toy examples above may seem clear and obvious, labeling is not always so straightforward. One needs to start with 2 key ingredients: data and a label set.
Some companies may have to begin by finding appropriate data sources. Many academics have scraped sites like Wikipedia, Twitter, and Reddit to find real-world examples. Open-source datasets such as Kaggle, Project Gutenberg, and Stanford’s DeepDive may be good places to start.
Thanks to the period of Big Data and advances in cloud computing, many companies already have large amounts of data. Oftentimes this data will be referred to as unstructured data, or raw data. However, before it is ready to be labeled this data often needs to be processed and cleaned. For example, when presenting data to your labeler, how would you like to determine where one sentence begins, and another ends? How are semicolons treated? Make sure you don’t accidentally treat the ‘.’ at the end of Mrs. as an end of sentence delimiter. Data may also be missing or misspelled. In certain industries like healthcare and financial institutions, it is important or even legally required to remove personally identifiable information (PII) before it is ready to be presented to labelers.
Once you have identified your training data, the next big decision is in determining how you’d like to label that data. The labels to be applied can lead to completely different algorithms. One team browsing a dataset of receipts may want to focus on the prices of individual items over time and use this to predict future prices. Another may be focused on identifying the store, date, and timestamp and understanding purchase patterns.
Practitioners will refer to the taxonomy of a label set. What level of granularity is required for this task? Is it enough to understand that a customer is sending in a customer complaint and route the email to the customer support team? Or would you like to specifically understand which product the customer is complaining about? Or even more specifically, whether they are asking for an exchange/refund, complaining of a defect, an issue in shipping, etc.? Note that as you increase the taxonomy granularity, you will require more data for the algorithm to adequately train on each individual label.
Let’s Get Labelling
Okay – we’ve established the raison d’être for labeled data. How do we actually start?
Many data scientists and students begin by labeling the data themselves. This has the advantage of staying close to the ground on the labeled data. You may label 100 examples and decide if you need to refine your taxonomy, add or remove labels. Data quality is also fully within your control.
In order to scale to the large number of labels that are often required for training algorithms and to save time, companies may choose to hire a professional service. The choice in labeling service can make a big difference in the quality of your training data, the amount of time required and the amount of money you need to spend.
Crowd-sourced labeling services
Amazon Mechanical Turk was established in 2005 as a way to outsource simple tasks to a distributed “crowd” of humans around the world. Since the ascent of AI, we have also seen a rise in companies specializing in crowd-sourced services for data labeling. Some of the top companies include Appen, Playment, Samasource, and iMerit. For a fee, these companies will take your data and set up a labeling task on their platforms. Labelers around the world who are registered with their service can label your data. The advantages of using these companies include elastic scalability and efficiency. Due to the number of labelers on their platform, they can frequently finish labeling your data more quickly than any other option. They will also bring expertise to the job, advising you on how to validate data quality or suggesting how to spot check the quality of work to ensure it is up to your standards. Disadvantages include higher price, higher variance in data quality and the potential for data leaks. The companies will often charge a sizable margin on the data labeling services and require a threshold on the number of labels applied. Fully crowd-sourced solutions can also suffer from labelers who game the system and create fake accounts. We have seen data leaks publicly embarrass companies such as Facebook, Amazon, and Apple as the data may fall into the hands of strangers around the world.
Bringing labeling in-house
In response to the challenges above some companies choose to hire labelers in-house. This offers greater control of access to and quality of the data output. However, this choice does come with its own disadvantages. Sometimes models need to be trained in time to meet a business deadline. It is possible to outsource 500,000 labels in 2 weeks to a professional labeling service but such capacity is difficult to build out internally. In-house teams require significantly more planning and require compromises in project timelines. Additionally, building out operational services require a new set of skills that don’t always coincide with the company’s expertise.
So what should I do?
The decision to outsource or to build in-house will depend on each individual situation. I would start by answering the following questions:
- Is subject matter expertise required for this labeling? Some types of data cannot be handled by laypersons. A legal document may require someone with a law degree to properly understand the technical lingo. A certain level of linguistics expertise may also be required. Despite considering myself fairly fluent in the English language, I personally had to think twice before labeling a past participle verb.
- What are the risks (or legal requirements) for data privacy? If you are considering working with an external party talk to them about the level of privacy they can adhere to. Is HIPAA compliance required for your data? Do you need labelers working with your data to be working on air-gapped computers? What tradeoffs are you willing to make on this front?
- What is my threshold for data quality? What are the repercussions if my algorithm makes a mistake? Does an email get routed to the incorrect department? Or is there a life on the line? The more critical the data quality, the more you may want to bring this in-house so you can train your own labelers to the level of accuracy required for your line of work.
- Will this be a core part of my business in the long-term? If training your own AI is a core part of your company identity, it may be helpful to make the investment and learn how to set up your own labeling workforce. It will likely save you money and operational efficiency in the long run.
Many companies also choose to do a hybrid combination of both – using an in-house labeling workforce for recurring or mission-critical jobs, while supplementing sudden bursts of data needs with an outsourced solution.
I’ve interviewed 100+ data science teams around the world to better understand best practices in the industry. Below are 3 of the most common observations:
- Labeling redundancy – humans are fallible and may make mistakes after labeling at the end of a long day. Additionally, there are subjective biases in each judgment. A common practice is to have 2+ labelers label the same data. For some projects, a majority consensus is sufficient for determining ground truth. For others, nothing short of unanimity and a discussion around each disagreement is acceptable.
- Setting up comprehensive guidelines – one of the most common points of failure in the industry is a lack of specificity when setting up the project. As one example, a client I work with needed to remove “inappropriate content” from their live chat. Certain words are easy to weed out, such as extreme racial slurs and death threats. However, where is the line drawn? How should sarcasm or jokes be treated? These edge cases need to be well-defined by the product and engineering team in order to avoid surprises when the labeled work is complete.
- Iteration – if 500,000 documents need to be labeled, start with a small subset first. Review the first batch of data carefully and make sure it conforms to your expectations. As with many aspects of our industry, rarely is a project set up perfectly on the first try and an iterative approach will save significant time and money in the long run.
Now that you’ve got your data, your label set and your labelers, how exactly is the sausage made, precisely? The young ML industry is still quite varied in its approach.
The most common starting point is an Excel/Google spreadsheet. This interface is serviceable, ubiquitously understood and requires a relatively low learning curve. It handles common labeling tasks such as part-of-speech and named entity recognition labeling. Disadvantages of the spreadsheet are that its interface was not created for the purpose of this task. Furthermore, it can be error-prone. Typos are easier to make and columns of cells are not the most intuitive way to read a text document. Some types of labeling such as dependency parsing are simply not viable using spreadsheets. Most importantly, this approach is not scalable as your needs will expand to more advanced interfaces and workforce management solutions.
A standard for more advanced NLP companies is to turn to the open-source community. Tools such as brat and WebAnno are popular labeling tools. These were built with labeling in mind, offering a wide array of customizations. They can be freely set up and hosted and handle more advanced NLP tasks such as dependency labeling. The downsides are that the learning curve is higher and some level of training and adjustment is required. Direct customer support can be limited. These tools are also in various levels of maintenance as they rely on the open-source community for improvements and bug fixes.
Others still choose to build their own tools in-house. This has the benefit of full integration with your own stack. However, building in-house tools requires the investment of engineering time to not only set up the initial tool but also ongoing support and maintenance.
Commercial tools are also available. These include Prodigy, LightTag, TagTog and Datasaur.ai (disclaimer: I am the founder/CEO of Datasaur). These companies offer labeling tools at various price points. Similar to the open-source tools they offer customizability and handle advanced NLP tasks. Other features to consider include team management workflows for your labeling team, labeler performance reports, data permissioning, on-prem capabilities, and semi-automated labeling. Semi-automated labeling is a relatively recent development that allows your labelers to have a head start when labeling. Instead of labeling everything from scratch, a model can be plugged in to label common English terms.
As with many situations choosing the right tool for the job can make a significant difference in the final output. Considerations should include the intuitiveness of the interface for your particular task. What types of labeling jobs do they specialize in? Is there sufficient customizability for your project’s unique needs? Will you be able to organize and prioritize labeling projects from a single interface? What level of support is offered when questions or issues arise? What is your budget allocation? Identify your primary pain points to find the right solution for your job.
ML is a “garbage in, garbage out” technology. The effectiveness of the resulting model is directly tied to the input data; data labeling is therefore a critical step in training ML algorithms. Indeed, increasing the quantity and quality of training data can be the most efficient way to improve an algorithm. And with ML’s growing popularity the labeling task is here to stay. As you approach setting up or revisiting your own labeling process, review the following:
- Data source
- How will you collect the data?
- How will you clean it?
- Label set
- In order to train your model, what types of labels will you need to feed in?
- What level of granularity in taxonomy is required for your model to make the correct predictions?
- Can you start with a more simple model first, then refine it later?
- Labeling service
- Will you go with an external or internal workforce? Should you use a hybrid approach?
- Are subject-matter experts required?
- Are there any compliance or regulatory requirements to be met?
- Labeling tool
- What type of interface is needed?
- Is semi-automated labeling applicable to your project?
- What level of security and data permissioning is required?
- How do you intend to manage your workforce? Should that be included in the software?
There are many options available and the industry is still figuring out its standards. But by answering the questions above you should be able to narrow down your choices quickly. Best of luck!
Enjoy this article? Sign up for more AI and NLP updates.
We’ll let you know when we release more in-depth technical education.