I keep seeing “data scientist” under people’s names on Linkedin; in job listings; in ads for classes and tutorials, etc. I always think, what are the other things we do with data? Have a data salad? Make data-fiber bulletproof vests? Seal a cut with a 50/50 mix of data and oatmeal?
So I looked into it a little bit.
The first thing I realized was that our educators probably did not clarify the meaning of terms like science, statistics, mathematics and data. Many people think of “science” as “the study of something.” That’s not what it means. Normally I would say, the meaning of words evolves; it depends on how people use them.
But scientists, mathematicians and statisticians cling tightly to definitions so that they don’t evolve. If the many definitions in math and science were to become fuzzy or split, these people would not be able to communicate about complex things.
It will be helpful, first, to clarify some definitions. I understand that languages must evolve. But in math and science, one doesn’t mess around with definitions. In any case it may be useful simply to know these definitions before everyone goes back to talking about “data science” like “food science.”
Science is the global or individual enterprise involving the scientific method. Its aim is to challenge the veracity of a particular model, belief, hypothesis or assumption. It fundamentally involves this model, a prediction, and an experiment to test the experiment. Without data there is no science, although the model+prediction and the eventual experiment are sometimes widely separated in time.
Math and science aren’t the same thing. Mathematics is a systematic form of reasoning, borne of logic and augmented with explicit notation. Its goal is an ever-increasing series of abstractions. Mathematics is a way of reasoning beyond what our intuition can handle. Paul Dirac discovered antimatter when he found an imaginary root to an equation. Freeman Dyson showed that apparently different techniques from Feynman and Schwinger were in fact mathematically equivalent. Gell-mann used Group Theory to postulate the existence of quarks. Einstein used math to go beyond anything remotely intuitive. Tragically, the majority of neuroscientists scarcely use any math. We’ll find one day that the character of the brain is far less intuitive than any space-time or quantum theory.
Statistics is a branch of applied mathematics. As in all mathematics it is made of one or more theories, where a theory is a set of theorems derived from axioms. In the case of statistics, the axioms come from probability theory, but many of the probability theorems derive from measure theory. Measure theory, in turn, derives from set theory and real analysis. My point is only that theoretical statistics (statistical inference) is bona fide mathematics, but it is applied mathematics, where the application is dealing with stochastic (uncertain) measurements. Any analysis of measured values with transformations that don’t account for probability distributions will produce results that are always wrong to some degree, but more important, one never knows how wrong they are.
Data, in a scientific context, means observations. Equivalently, data are measurements. Importantly, all measurements on the Newtonian scale are stochastic, so they must be dealt with using statistical models. These are just mathematical models that include predictions for the errors.
What is Data Science?
The definitions above are not meant to invalidate the concept of data science. That will come naturally, below.
According to Wikipedia,
Data science employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, operations research, information science, and computer science, including signal processing, probability models, machine learning, statistical learning, data mining,database, data engineering, pattern recognition and learning, visualization, predictive analytics, uncertainty modeling, data warehousing, data compression, computer programming, artificial intelligence, and high performance computing.
Methods that scale to big data are of particular interest in data science, although the discipline is not generally considered to be restricted to such big data…
This seems like everything. What is the criterion for making the list? On the one hand, we have techniques with no question or hypothesis, e.g. visualization, data mining, data warehousing. On the other hand we have inferential statistics, statistical learning and pattern recognition, all of which require structure in the data to accommodate some experimental design. These are two very different calculations under the same heading.
Maybe what characterizes data science is techniques that handle big data. But this isn’t true. “Statistics,” meaning inferential statistics, has no need to work at a large scale. Nor does visualization, pattern recognition, etc. These things are done all the time with small data sets.
Even if large-scale data were the aim, this comes with problems that aren’t related to scale. Huge data sets are appropriate for machine learning, particularly deep learning, but the data has to be organized and structured so as to feed the appropriate inputs into the first layer over and over. The larger the set, the harder this becomes.
None of this says that some of the subcomponents of data science may not be extremely useful. Deep learning, for example, and establishing high-dimensional distributions. But if this is the goal, why not call it what it is? What is the benefit of smearing everything under a meaningless umbrella?
Maybe data science is concerned mainly with data mining, warehousing, unsupervised learning. Here we have analysis with no motivating question or hypothesis. Imagine we have a hypothesis-driven experiment, and we predict that a certain pair of vectors will be correlated. If it is, we accept this result, because it was falsifiable. Nothing in the experimental design worked against a negative outcome.
Now assume we’re “digging,” as in data mining. Assume we’re tearing through the data, correlating everything with everything, looking for all possible associations. Say we find several correlations. This must be disregarded, because they were never falsifiable. Why? Because, had we made a hypothesis in each case, it might have predicted correlation or no correlation. We’ll never know. So, if false results are undefined, then so are positive results. We’re left with no conclusions.
This is the danger of diving into an analysis without understanding the math.
Data Science is an Advertising Campaign
What we know about data science so far is it includes every conceivable analysis; it appears to exclude nothing; it has no cogent explanation, certainly no reason for being, and it’s wildly popular. This is the sign of the business crooks about to take advantage of many young engineers who should have studied more math.
When someone’s description of something is a list of every single part, with no logic, no adjectives, no flow or reasoning, it probably isn’t really anything. Again from Wikipedia:
Here the author shows extraordinary skill, saying absolutely nothing in just less than the space it took me to explain the scientific method. Either that, or the author believes that human beings have only just now discovered that we can learn things by studying data.
The Berkeley School of Information says Data Scientist is the #1 best job in America. They provide no explanation whatsoever as to what data science is. Why should they? Everyone’s hooked.
Interesting that the list includes statistics, statistical learning, and probability models. To name these as different things suggests the author is trying to come up with every possible thing to put in the list. Moreover, the list includes “unstructured” methods, meaning you dig around looking for something interesting; e.g. data mining, data warehousing, in some cases machine learning. But the list also includes structured methods that require a hypothesis/model and either hypothesis tests or parameter estimation.
These are very different things, and it’s bothersome that everything has been lumped under one name. I’ve been doing statistical modeling, data collection, and various analyses, including compartmental analysis, classification, information theory, Monte Carlo’s, bootstraps, dynamical systems, signal processing, and so on, for 30 years, and I don’t think “data science” means anything at all.
Data science, being a meaningless term, may in turn be used by an engineer to mean practically anything. No doubt very large data sets are now being used for deep network training, estimating high-dimensional distributions, sub-sampling, bootstrapping or subsampling thousands of times without redundancy, and other productive tasks. My comments about data science don’t apply to everything that might be called data science. They apply to every enterprise where the investigator thinks knowledge can be gained simply by applying tests to the data without any particular question or model in mind.
Thinking at a slightly deeper level, observations of any Newtonian system are stochastic; measurements don’t yield numbers but rather probability distributions. [In quantum systems, observables are Hermitian operators; it’s a little more complicated.] Any analysis of stochastic measurements yields a probability distribution at the end. If assumptions about this distribution are violated along the way, or if any incorrect transformation is applied, the final distribution will be wrong, and the final conclusion will be biased. Poking around in the data, looking for patterns or clusters or any desired behavior, makes it almost impossible to apply a statistical model. Whenever a result is reported, and the probability space is not described, or the assumptions are not specified, or the effect of transformations on the distributions are not discussed (e.g. any linear transformation of a Gaussian makes a Gaussian), there is no good reason to believe the result.
At this moment one has an acute urge to wave off this nitpicking, and indeed it’s been waved off in speech research for decades.
Data science is not defined in a way that seems reasonable or even useful. It’s a sales technique–a poster–and people are flocking. Unaware that the term has no meaning; that it includes every possible analysis, the breadth and novelty of the term–and the aggressive advertising–are pulling in job candidates who will not be rewarded because they won’t even know what they’re doing. Every so often engineers do this–claim they can farm through large datasets and do wondrous things.
How? I’ve seen results like this published. They’re incoherent and largely useless.
But to learn something–anything–from data, nothing has come close, so far, to the scientific method put forth by Galileo. And if the discoveries of Einstein, Dirac, Schrodinger, Heisenberg, Poincare, De Broglie, Feynman, Gell-mann, Bell and others seem uninteresting or tedious, go spend 5 years studying “data science.” But think very carefully about it first.