The use of data in driving business decisions is a competitive imperative in today's business world, improving how companies market to, sell to, and service their customers. Yet IBM found that 1 in 3 business leaders do not trust the information they use to make decisions. When business leaders don't believe their data, they likely are not going to support an effort to collect more of it, let alone use it. How can you improve executive trust in the information they use? They need to start by looking at the veracity of their data.
While data can be described by many different qualities, in the era of Big Data, three qualities - Volume, Velocity and Variety - have dominated the conversation about data. Some people have re-introduced these other qualities under the umbrella of Big Data as the fourth, fifth and sixth V of Big Data (e,g., Value, Veracity, Viability). Seth Grimes, however, correctly points out that these three new Wanna-Vs are misleading descriptors of Big Data as they tell you nothing about the "bigness" of your data. They still are, however, important qualities to consider, whether your data are large or small. Static or moving. Structured or unstructured.
The veracity of your data is about the accuracy and truthfulness of the data and the analytic outcomes of those data. The veracity of your data is adversely impacted by different types of error that are introduced into the generation, collection and analysis of those data. As more errors are introduced into the processing of the data, the less trustworthy the data become.
Ensuring Veracity of your Data
Earlier this year, Kate Crawford addressed this notion of data veracity in an excellent piece in Harvard Business Review titled The Hidden Bias of Big Data. Disputing the notion that if you have enough data, the numbers speak for themselves, she states correctly that humans give data their voice; people draw inferences from the data and give data their meaning. Unfortunately, people introduce bias, intentional and unintentional, that weaken the quality of the data.
Improving the veracity of your data requires minimizing the occurrence of different sources of errors. These sources are related to: sampling method, capitalizing on chance, missing data, research bias and poor measurement. Before making decisions using data, executives first need to answer the following questions.
1. What is (are) your hypothesis(es)?
Despite the popular notion that Big Data is about simply finding correlations among variables rather than identifying why these relationships exist, I believe that, to be of real, long-term value to business, Big Data needs to be about understanding the causal links among the variables. Hypothesis testing helps shed light on identifying the reasons why variables are related to each other and the underlying processes that drive the observed relationships. Hypothesis testing helps improve analytical models through trial and error to identify the causal variables and helps you generalize your findings across different situations.
With a plethora of variables and data sets at their disposal, businesses can test literally thousands of relationships quickly. The probability of finding statistically significant relationships among metrics greatly increases when the sheer number of relationships are examined. Often, due simply to chance, a statistically significant relationship between two variables is found when, in reality, there is no underlying reason why they should be related. Using these spurious findings to support your existing beliefs is a good recipe for making sub-optimal decisions.
What can you do? Have a hypothesis(es) and test it (them).
2. What are your biases?
People tend to seek out / remember / interpret results that support their existing beliefs and ignore or discount results that do not support their beliefs. Referred to as confirmation bias, this cognitive short-cut can often result in wrong conclusions about your data.
What can you do? Specifically look for data to refute your beliefs. If you believe product quality is more important than service quality in predicting customer loyalty, be sure to collect evidence about the relative impact of service quality (compared to product quality).
Also, don't rely on your memory. When making decisions based on any kind of data, cite the specific reports/studies in which those data appear. Referencing your information source can help other people verify the information and help them understand your decision and how you arrived at it. If they arrive at a different conclusion than you, understand the source of the difference (data quality? different metrics? different analysis?).
Also, use inferential statistics to separate real, systematic, meaningful variance in the data from random noise. Place verbal descriptions of the interpretation next to the graph. A clear description ensures that the graph has little room for misinterpretation. Also, let multiple people interpret the information contained in customer reports. People from different perspectives (e.g., IT vs. Marketing) might provide highly different (and revealing) interpretations of the same data.
3. What is the sample size?
We rarely (never) have access to the entire population of things which interest us. Instead, we rely on measuring a sample of that population to make conclusions about the entire population. For example, we collect customer satisfaction ratings from a portion of our customers (sample) to understand the satisfaction of the entire customer base (population).
When you use samples to understand populations, you need to understand sampling error. Sampling error reflects the difference of the sample of data from the population of data from which that sample was drawn. Because the sample is only a subset of the population, our estimation includes error due to the mere fact that the sample is only a portion of the population.
What can you do? Use inferential statistics to help you understand if the observations you see in your sample likely reflect what you would see in the population.
4. What is the data source?
Even when we have large data sets, where sampling error seems to be minimized, we need to know the source of the data; data don't occur in a vacuum. They can be intentionally generated/collected to solve a problem. For example, analyzing the location of thousands of tweets during Hurricane Sandy, the data show that more of the tweets about the storm originated from downtown Manhattan compared to New Jersey. Relying on simply counting the number of tweets, you might believe that the brunt of the storm hit downtown Manhattan. In reality, Sandy hit New Jersey, but, because of power outages in New Jersey that were due to the storm, people were simply unable to use Twitter from New Jersey.
Additionally, it is estimated that only 18% of US adult web users use Twitter, the largest segment being between 18 and 29 years old. Also, in 2012, only 8% of shoppers use their mobile device in-store to tweet about their experience. Tweets, in the context of business, represent a small, perhaps biased set of data.
What can you do? Scrutinize the data source to help determine if the data are appropriate for the question you are trying to answer. Consider using different sources of data (e.g., metrics) to test your hypotheses. Multiple lines of converging evidence can be more convincing than a single line of evidence.
5. How good are your customer measures?
As businesses are trying to get value from extremely large, quickly expanding, diverse data sets, the notion of veracity is especially important when our data reflect "softer" entities like "satisfaction with the customer experience," "customer loyalty" and "sentiment." We spend time developing metrics and algorithms of the customer experience.
We measure these constructs with survey questions, sometimes from proprietary instruments that were developed by consultants. I have written about the quality of surveys and how companies need to be concerned about the reliability and validity of these instruments.
What can you do? Ask for evidence that the instrument is measuring something useful. If needed, acquire expertise in survey development/evaluation. Don't rely only on the consultants' reassurance that the instrument measures what they say it is measuring.
The quality of business decisions rests on the quality of business data (and the predictive models using them). While you might derive the slickest analytic model, when that model is based on data that are unreliable and invalid, that model's performance in the real world (e.g., how well it predicts reality) suffers. As my good friend and business partner from Canada, Stephen King, says about data-driven decisions, "Garbage in. Garbage ooot."