Semphonic.blogs.com

Data Science vs. Big Data

2013-10-06

May the Best Hype Win

Two of the hottest topics at this year’s X Change Conference
were, unsurprisingly, “Data Science” and “Big Data”. If you walk through
airports, read Time Magazine or even watch Fox News (check out the old Semphonic World Headquarters) you’ll get plenty of hype
around both. But the folks at X Change are on the front lines of this stuff –
and that sort of hype doesn’t cut much ice.

I’ve argued that although there is plenty of
misleading hype around the term big data, it nevertheless captures something
different, real and important. And I’ve even worked to explain exactly what
that is (I’ll be presenting a newer version of that argument at IBM’s IOD Conference
in early November).

But what about data science? Just how real is data science
and what exactly does it mean?

I didn't know this, but one of the things I learned in the discussion at X Change is that the term was
originally coined by a statistician who argued that statisticians were (and should
be re-named) data scientists since they spent most of their time manipulating
and experimenting with data. Given the
relative demand (and pay scales) for those selling themselves as “data
scientists” vs. “statisticians” that’s fairly amusing. It turns out that
statisticians just needed a good marketing campaign to double or triple their salaries!

Origins aside, it’s not that easy to unpack what people mean when they talk
about data science. But the emergent (and best) definition I heard at X Change
is that a data scientist is someone who can work at every stage of an analysis
and tackles problems that involve data manipulation, advanced statistical
analysis (particularly those that require custom computational or algorithmic techniques),
and interpretive and expository skills. In the Huddles I was in, we ended up
calling this a “Full Stack” analyst.

On this definition, I probably come reasonably close to being a data
scientist. As someone with a software development background and real chops in
C++ and C# (not to mention toy stuff like SQL), there’s pretty much no data
manipulation I can’t do. I’m not the world’s deepest statistician though, and this
would probably be my downfall in the ranks of pure data scientists. Still, I have a pretty strong history in computational and algorithmic analytics. I was
coding and using Self-Organizing Maps (SOMs) back in the ‘90s, I’ve created my share
of true algorithmic analytic methods, I’ve done my time in SAS, and I’ve
written software (and used it to) that incorporated a wide variety of advanced
statistical techniques and visualization tools (in my days doing real-time technical
trading analytics I programmed and used everything from Black-Scholes models to
Simulated Annealing). I think I can handle the interpretive and expository
stuff pretty well too.

Big whoopee, right? Being full stack is great, but how important is it really?

Back when I was programming Black-Sholes models, I had some
pretty smart folks explaining the models (and corrections to those models) that
they wanted me to program. They didn’t need to know C++. I didn’t need to be a
stats genius. It still worked pretty well. If you're doing data science via a team, you're still doing data science.

I’ve no doubt that having the full stack package in a single
person reduces the cycle time on projects that involve computational analytics and data manipulation. But resourcing to the full stack in
a single person can dramatically extend the time it takes to actually fill a
position and can have a similar impact on cost. I’ve read plenty of data
science job postings that could have changed the job title to “Superman” without materially impacting the odds of finding a plausible candidate.

What’s more, it isn’t clear to me that the value of a data
scientist is equally distributed along all these dimensions. Frankly, I don’t
think people pay my considerable rack rate because I’m full stack.
Few of my current clients benefit from
my skill as a C++ programmer (though I'm not saying that knowledge doesn't sometimes come in
handy on higher-level tasks).

This also makes me wonder about the true value of most people
who can plausibly claim to be full stack. Going back to that original definition, who's the
group most likely to be full stack? Statisticians. Most professional
statisticians may not have my programming chops, but there are many who are
quite skilled in data manipulation and algorithmic analysis. Being the real
McCoy, they are going to cream me in depth of statistical knowledge. But how
useful are most of these people when it comes to actually performing
interesting and useful analytics?

I’ll let your enterprise’s experience with statisticians
answer that question.

I can't resist adding that the one type of academic
background I would never hire in our practice was…statistician. Programmers,
Economists, Mathematics, Psychologists, Bio-Med – I’ve found folks in all these
disciplines who combined an ability to do analytics with a penchant for solving
real-world business problems.

Why no statisticians? I think it's because a good data scientist will think
of statistics not as their discipline, but as a tool for their discipline.

So I’m deeply suspicious of data science. In a “hype-off”
between data science and big data, I think data science wins by a lot. There’s
a lot less there, there.

Having gone that far, I feel compelled to add a little nuance.

In my particular field – digital analytics – analysts have
traditionally been far, far short of full stack. Because of the SaaS model and
the lack of sophistication in digital analytics tools, it’s fair to say that
more digital analysts have neither data manipulation skills, statistical
analysis skills, nor computational analytics skills. The stack, far from being full,
can look a bit threadbare.

That’s a legitimate problem in a world where digital analytics
data is now widely available outside Web analytics tools. I don’t think it’s
necessary for a great analyst to be full stack. I do think a great analyst ought to have to have at least one of those additional skills.

What’s more, I think that in digital analytics (and big data
in general), computational analysis will be somewhat more important than it is in
many other disciplines. My reasons for that are tightly bound to my arguments for
why big data is different than traditional data and why statistical analysis
methods often fail when it comes to digital analytic problems.

Plus, when it comes to computational analytic
methods, it can be hard to build a team that works. It's much harder for a programmer to build complex models in code than to do ETL for a statistician. You need the right
combination of communication skills in both directions, and that might prove to be nearly
as elusive as getting the skills in one person.

Back when I was programming
trading systems or doing credit card analytics, if I wanted to use Neural Nets
or SOMs, I had to program them. And to program them, I had to understand at some level how
they worked. These days, those tools are available out of the box. But for much
of what I think is going to work in digital analytics big data, there won’t be
out-of-the-box tools. Even something as simple as the Topographic analysis I’ve
written about requires custom coding.

So it's possible that the whole data scientist thing might really do some good. With the economic rewards being heaped on those who can do
computational analytics, there’s bound to be significant growth in people who are
skilled at it.

I just hope they aren’t all statisticians.