An examination of data scientist skills reveals an often overlooked skill necessary to uncover insights from data: The Scientific Method
Data scientists are a hot commodity in today's data-abundant world. Business leaders are relying on data scientists to improve how they acquire data, determine its value, analyze it and build algorithms for the ultimate purpose of improving how they do business. While the job title of "data scientist" was coined by D.J. Patil and Thomas H. Davenport only in 2008, it reached the status of "sexiest job of the 21st century" by 2012. But what makes for a good data scientist?
In this post, I take a look at several industry experts' opinions about the skills, abilities and temperament needed to be a good data scientist. Specifically, I reviewed 11 articles that included lists of various data scientist skills (each link directs you to a specific list): Dataiku.com, Smart Data Collective, InformationWeek, Data Science Central, Teradata, Silicon Angle, Gigaom, Forrester, Wired, TDWI, and Dataversity. From each article, I extracted statements (96 in all) that reflected a skill, ability or temperament and grouped them into smaller categories. I let the content of the statements drive the generation of the categories. Some categories had a fairly specific, narrow meaning (e.g., NLP) and others had a broader meaning (e.g., computer science).
While the major, popular buckets of data scientist skills emerged (e.g., quantitative, computer engineering, business acumen, communication), an additional one also emerged that I call Scientific Method. First, let's look at the details of each skill or category:
Businesses need quantitative skills if they are to extract insights from their data. Quantitative skills include statistics, mathematics and predictive modeling skills. Statistical skills can help businesses summarize their large data sets into smaller pieces of meaningful information. Predictive modeling skills help businesses create algorithms, both automatically and manually, to improve business processes. As a whole,these quantitative skills allow businesses to apply mathematical rigor to to their large, quickly expanding data sets to help make sense of them.
2. Computer Engineering
Another key skill is related to computer engineering. With the advent of new ways to store, analyze and retrieve data, the idea data scientist needs skills in programming languages, distributed computing systems and open-source tools.
In addition to analyzing structured data, businesses are now trying to uncover insights from unstructured data from such sources as social media, emails, community message boards and even open-ended comments in surveys. Skills in natural language processing help data scientists transform these unstructured data into structured data to allow for quantitative analysis. Machine learning skills help businesses identify generalized patterns in the data (training data) that allows for classification of future data (target data). This pattern recognition helps drive recommendation engines that present customers with information that is relevant to them. Finally, data management skills help data scientists develop and integrate different data systems so businesses can utilize all their data in an integrated fashion.
3. Business Acumen
Quantitative and computer skills don't occur in a vacuum. Data scientists, to be successful, need a good understanding of the business, including its people, products and services, and how they all work together. This knowledge of how business works helps data scientists direct their energies to data that are the most valuable to the business.
Data scientists need to have good communication skills. This skill is closely linked to business acumen, as data scientists need to be able to convey complex quantitative, computational findings into terms that business executives, managers, and front line employees can understand. Data scientists often need to use visualization tools to help translate quantitative findings into images that are easily consumable by the masses. With good communication skills and the use of tools to visualize the data, data scientists are able to provide the insights that business leaders need to operationalize changes to their current business processes.
5. Scientific Method
The final group of skills reflect the need to approach problems using critical thinking, creativity and open-mindedness. I grouped these final set of skills under the label of "scientific method." Formally defined, the scientific method is a body of techniques for objectively investigating phenomena, acquiring new knowledge, or correcting and integrating previous knowledge. The scientific method includes the collection of empirical evidence, subject to specific principles of reasoning. Specifically, the scientific method follows these general steps: 1) formulate a question or problem statement; 2) generate a hypothesis; 3) test hypothesis through experimentation (when we can't conduct true experiments, data are obtained through observations and measurements); and 4) analyze data to draw conclusions.
These steps are not meant to imply that science is only a series of activities. Instead of thinking about science as an area of knowledge, it is better to conceptualize it as a way to understand how the world really works. As Carl Sagan said, "Science is a way of thinking much more than it is a body of knowledge." The scientific method not only requires the adherence to rules, it also requires creativity, and imagination in order to find new possibilities, address problems in different ways and apply findings from one setting to another. Separating signal from noise, data scientists' work truly reflects an exercise in uncovering reality.
I believe that the scientific method plays a critical role in understanding any data, irrespective of their size or speed or variety. Despite the idea that Big Data will kill the need for theory and the scientific method, the human element is necessarily involved in the generation, collection and interpretation of data. As Kate Crawford points out in a thoughtful article, The Hidden Bias of Big Data, data do not speak for themselves; humans give data their voice; people draw inferences from the data and give data their meaning. Unfortunately, people introduce bias, intentional and unintentional, that weaken the quality of the data.
Additionally, I highlighted a few ways that the scientific method can help improve the veracity (validity) of data. To be of real, long-term value to business, Big Data needs to be about understanding the causal links among the variables. Hypothesis testing helps shed light on identifying the reasons why variables are related to each other and the underlying processes that drive the observed relationships. Hypothesis testing helps improve analytical models through trial and error to identify the causal variables and helps you generalize your findings across different situations.
Data scientists help businesses extract value from their data by finding insights. To solve business problems, data scientists need a variety of skills, including quantitative, computer, business acumen and communication. The current review, however, uncovered an overlooked skill needed by data scientists: the scientific method.
Even though finding a single person who possesses these data scientist skills is akin to finding a unicorn, companies need to understand all the data scientist requirements as they look to build data science teams to address their data analytic needs. Data scientists will require knowledge in research methodology to learn about different kinds of research methods they can employ (e.g., observational, experimental, quasi-experimental) as well as the threats to different kinds of validity (e.g., statistical conclusion, internal, construct and external).
The goal of the scientific method is to solve problems. If businesses want to solve their problems, they need to put the science in data science. If they don't, all they have are data.