Data is a human invention. Humans define the phenomenon that they want to measure, design systems to collect data about it, clean and pre-process it before analysis, and finally choose how to interpret the results. Even with the same dataset, two people can form vastly different conclusions. This is because data alone is not “ground truth,” defined by machine learning experts as observable, provable, and objective data that reflects reality. If data was inferred from other information, relies on subjective judgment, was not collected in a rigorous and careful manner, or is of questionable authenticity, then it is not ground truth.
How you choose to conceptualize a phenomenon, determine what to measure, and decide how to take measurements will all impact the data that you collect. Your ability to solve a problem with artificial intelligence depends heavily on how you frame your problem and also whether you can establish ground truth without ambiguity. Ground truth is used as a benchmark to assess the performance of algorithms. If your gold standard is wrong, then your results will not only be wrong but also potentially harmful to your business.
Unless you were directly involved with defining and monitoring your original data collection goals, instruments, and strategy, you are likely missing critical knowledge that may result in incorrect processing, interpretation, and use of that data.
Common Mistakes With Data
What people call “data” can be carefully curated measurements selected purely to support an agenda, haphazard collections of random information with no correspondence to reality, or information that looks reasonable but resulted from unconsciously biased collection efforts. Here’s a crash course on statistical errors that every executive should be familiar with.
Failing to pin down the reason for collecting data means that you’ll miss the opportunity to articulate assumptions and to determine what to collect. The result is that you’ll likely collect the wrong data or incomplete data. A common trend in big data is for enterprises to gather heaps of information without any understanding of why they need it and how they want to use it. Gathering huge but messy volumes of data will only impede your future analytics, since you’ll have to wade through much more junk to find what you actually want.
Let’s say you want to know how much your customers spent on your services last quarter. Seems like an easy task, right? Unfortunately, even a simple goal like this will require defining a number of assumptions before you can get the information that you want.
First, how are you defining “customer”? Depending on your goals, you might not want to lump everyone into one bucket. You may want to segment customers by their purchasing behavior in order to adjust your marketing efforts or product features accordingly. If that’s the case, then you’ll need to be sure that you’re including useful information about the customer, such as demographic information or spending history.
There are also tactical considerations, such as how you define quarters. Will you use fiscal quarters or calendar quarters? Many organization’s fiscal years do not correspond with calendar years. Fiscal years also differ internationally, with Australia’s fiscal year starting on July 1st and India’s fiscal year starting on April 1st. You will also need to develop a strategy to account for returns or exchanges. What if a customer bought your product in one quarter but returned it in another? What if they filed a quality complaint against you and received a refund? Do you net these in the last quarter or this one?
As you can see, definitions are not so simple. You will need to discuss your expectations and set appropriate parameters in order to collect the information that you actually want.
Once you’ve identified the type of data that you wish to collect, you’ll need to design a mechanism to capture it. Mistakes here can result in capturing incorrect or accidentally biased data. For example, if you want to test whether product A is more compelling than product B, yet you always display product A first on your website, then users may not see or purchase product B as frequently, leading you to the wrong conclusion.
Measurement errors occur when the software or hardware that you use to capture data goes awry, either failing to capture usable data or producing spurious data. For example, information about user behavior on your mobile app may be lost if the user experiences connectivity issues and the usage logs are not synchronized with your servers. Similarly, if you are using hardware sensors like a microphone, your audio recordings may capture background noise or interference from other electrical signals.
As you can see from our simple attempt to calculate customer sales earlier, many errors can occur even before you look at your data. Many enterprises own data that is decades-old, where the original team capable of explaining their data decisions is long gone. Many of their assumptions and issues are likely not documented and will be up to you to deduce, which can be a daunting task.
You and your team may make assumptions that differ from the original ones made during data collection and achieve wildly different results. Common errors include missing a particular filter that may have been used on the data, such as the removal of outliers; using different accounting standards, as in the case with financial reporting; and simply making methodological mistakes.
Coverage error describes what happens with survey data when there is insufficient opportunity for all targeted respondents to participate. For example, if you are collecting data on the elderly but only offer a website survey, then you’ll probably miss out on many respondents.
In the case of digital products, your marketing teams may be interested in projecting how all mobile smartphone users might behave with a prospective product. However, if you only offer an iOS app but not an Android app, the iOS user data will give you limited insight into how Android users may behave.
Sampling errors occur when you analyze data from a smaller sample that is not representative of your target population. This is unavoidable when data only exists for some groups within a population. The conclusions that you draw from the unrepresentative sample will probably not apply to the whole. If you only ask your friends for opinions about your products and then assume your user population will feel similarly, this is a classic sampling error.
Inference errors are made by statistical or machine learning models when they make incorrect predictions from the available ground truth. Two types of inference errors can occur: false negatives and false positives. False positives occur when you incorrectly predict that an item belongs in a category when it does not, such as saying that a patient has cancer when he is healthy. False negatives occur when an item is in a category, but you predict that it is not, such as when a patient with cancer is predicted to be cancer-free.
Assuming you have a clean record of ground truth, calculating inference errors will help you to assess the performance of your machine learning models. However, the reality is that many real world datasets are noisy and may be mislabeled, which means that you may not have clarity on the exact inference errors that your AI system is making.
Reality can be elusive and you cannot always establish ground truth with ease. In many cases, such as with digital products, you can capture tons of data about what a user did on your platform but not his motivation for those actions. You may know that a user clicked on an advertisement, but you don’t know how annoyed she may have been with it.
In addition to many known types of errors, there are unknown unknowns about the universe that leave a gap between your representation of reality, in the form of data, and reality itself.