2013-07-03

O’Reilly just surveyed 250 data scientists about their skills and came up with the four types of data scientist--and one shocking outside finding.

For a profession whose entire “raison d’etre” is quantitative analysis, the role of the data scientist has been surprisingly hard to pin down. Now a new e-book from O’Reilly, Analyzing the Analyzers, has surveyed 250 Data Scientists on how they see themselves and their work. The authors then applied the tools of their trade, in this case a Non-negative Matrix Factorization algorithm to cluster the data, revealing the four archetypes of the data scientist. It also found that most Data Scientists, no matter which group they fell into, “rarely work with terabyte or larger data.”

We think that terms like “data scientist,” “analytics,” and “big data” are the result of what one might call a “buzzword meat grinder.” We’ll use the survey results to identify a new, more precise vocabulary for talking about their work, based on how data scientists describe themselves and their skills.

So who are these new data scientists? A Data Businessperson is focused on how data insights can affect a company’s bottom line or “translates P-values into profits”. This group seems very similar to the old-school Data Analyst, whose skills have sometimes been unjustly discounted in the pursuit of the more fashionable Data Scientist. In fact, only about a quarter of this group described themselves as Data Scientists. Nearly a quarter are female, a much higher proportion than the other types, and they are most likely to have an MBA, have managed other employees or started a business.

The Data Creatives have the broadest range of skills in the survey. They can code and have contributed to open source projects, three quarters have academic experience and creatives are even more likely than Data Businesspeople to have done contract work (80%) or have started a business (40%). Creatives closely identify with the self-description “artist”. Psychologists, economists and political scientists, popped up surprising often among Data Researchers and Data Creatives.

Data Developers build data infrastructure or code up Machine Learning algorithms. This group is the most likely to code on a daily basis and have Hadoop, SVM or Scala on their CV. About half have Computer Science or Computer Engineering degrees.

Data Researchers seem closest to “scientists” in the sense that their work is more open ended and most are lapsed academics. Nearly 75% of Data Researchers had published in peer-reviewed journals, and over half had a PhD. Statistics is their top skill but they were least likely to have started a business, and only half had managed an employee.

Although we hate to disappoint the majority of the tech press, who seem to conflate the terms “Big Data” and “Data Science,” most of the Data Scientists surveyed don’t actually work with Big Data at all. The figure below shows how often respondents worked with data of kilobyte, megabyte, gigabyte, terabyte, and petabyte scale. Data Developers were most likely to work with petabytes of data, but even among developers this was rare.

Keep reading: Who's Afraid of Data Science? What Your Company Should Know About Data

Previous Updates

How Machine Learning Helps People Make Babies

June 25, 2013

How far will you go to have a baby? That's the question facing the one in six couples suffering from infertility in the United States, fewer than 3% of which undergo IVF. A single round of IVF can cost up to $15,000, and the success rate for women over 40 is often less than 12% per round, making the process both financially and physically taxing. According to the CEO of Univfy, Mylene Yao, the couples with the highest likelihood of success are often not the ones who receive treatment. “A lot of women are not aware of what IVF can do for them and are getting to IVF rather late,” says Yao. ”On the other hand, a lot of women may be doing treatments which have lower chances of success.”

Yao is an Ob/Gyn and researcher in reproductive medicine who teamed up with a colleague at Stanford, professor of statistics Wing H. Wong, to create a model that could predict the probability that a live birth will result from a single round of IVF. That model is now used in an online personalized predictive test that uses clinical data readily available to patients.

The main factor currently used to predict IVF success is the age of the woman undergoing treatment. “Every country has a national registry that lists the IVF success rate by age group,” Yao explains. “What we have shown over and over in our research papers is that method vastly underestimates the possibility of success. It's a population average. It's not specific to any individual woman and is useful only at a high-level country policy level.” Many European countries, whose health services fund IVF for certain couples, use such age charts to determine the maximum age of the women who can receive treatment. Yao argues that, instead, European governments should fund couples with the highest likelihood of success.

“People are mixing up two ideas. Everyone knows that aging will compromise your ability to conceive, but the ovaries of each woman age at a different pace. Unless they take into consideration factors like BMI, partner's sperm count, blood tests, reproductive history, that age is not useful. In our prediction model, for patients who have never had IVF, age accounts for 60% of the prediction; 40% of the prediction comes from other sources of information. A 33-year-old woman can be erroneously led to believe that she has great chances, whereas her IVF prospects may be very limited and if she waits further, it could compromise her chance to have a family. A 40-year-old might not even see a doctor because she thinks there is no chance.” In a 2013 research paper Univfy showed that 86% of cases analyzed did not have the results predicted by age alone. Nearly 60% had a higher probability of live birth based on an analysis of the patients’ complete reproductive profiles.

Univfy's predictive model was built from data on 13,076 first IVF treatment cycles from three different clinics and used input parameters such as the woman's body mass index, smoking status, previous miscarriages or live births, clinical diagnoses including polycystic ovarian syndrome or disease, and her male partner's age and sperm count. “If a patient says 'I have one baby,' that's worth as much as what the blood tests show,” says Yao.

Prediction of the probability of a live birth based on these parameters is a regression problem, where a continuous value is predicted from the values of the input parameters. A machine-learning algorithm called stochastic gradient boosting was used to build a boosted tree model predicting the probability of a live birth. A boosting algorithm builds a series of models, in this case up to 70,000 decision trees, which are essentially a series of if-then statements based on the values of the input parameters. Each new tree is created from a random sample of the full data set and uses the prediction errors from the last tree to improve its own predictions. The resulting model determines the relative importance of each input parameter. It turned out that while the age of the patient was still the most significant factor at more than 60% weighting, other single parameters like sperm count (9.6%) and body mass index (9.5%) were also significant.

Another Univfy model used by IVF clinics predicts the probability of multiple births. Some 30% of women receiving IVF in 2008 in the U.S. gave birth to twins since clinics often use multiple embryos to increase the chances of success. “Multiple births are associated with complications for the newborn and the mother, “ says Yao. “So for health reasons, clinics and governments want to have as few multiple births as possible. It's a difficult decision whether to put in one or two embryos.” Univfy's results showed that even when only two embryos were transferred, patients' risks of twins ranged from 12% to 55%. “The clinic can make a protocol that when the probability of multiple births is above a certain rate, then they will have only one embryo, and also identify patients who should really have two embryos. Currently there's a lot of guesswork.”

When the G8 meet in Northern Ireland next week, transparency will be on the agenda. But how do these governments themselves rate?

Open data campaigners the Open Knowledge foundation just published a preview of an Open data Census which assessed how open the critical datasets in the G8 countries really are. Open data doesn’t just mean making datasets available to the public but also distributing them in a format which is easy to process, available in bulk, and regularly updated. When the regional government of Piemonte, Italy was hit by an expenses scandal last year, the government published the expense claim data from the previous year in a set of spreadsheets embedded in a PDF, a typical example of less than accessible “open data.”
More than 30 volunteer contributors from around the world (the foundation says they include lawyers, researchers, policy experts, journalists, data wranglers, and civic hackers) assessed the openness of data sets in 10 core categories: Election Results, Company Register, National Map, Government Budgets, Government Spending, Legislation, National Statistical Office Data, Postcode/ZIP database, Public Transport Timetables, and Environmental Data on major sources of pollutants.

The U.S. topped the list for openness according to the overall score summed across all 10 categories of data, indicating that the executive order “making open and machine readable the new default for government information” announced by president Barack Obama in May this year has had some effect. The U.K. was next, followed by France, Japan, Canada, Germany, and Italy. The Russian Federation limped in last, failing to publish any of the information considered by the census as open data.

Postcode data, which is required for almost all location-based applications and services, is not easily accessible in any G8 country except Germany. “In the U.K., there's quite a big fight over postcodes since Royal Mail sells the postcodes and makes millions of pounds a year,” said Open Knowledge Foundation founder Rufus Pollock. Data on government spending was also patchy in France, Germany, and Italy. Many G8 countries scored low on company registry data, a notable point when the G8’s transparency discussions will address tax evasion and offshore companies. “Government processes aren't always built for the digital age,” said Pollock. “I heard an incredible story about 501(c)3 registration information in the U.S. where they get all this machine-readable data in and the first they do is convert it to PDF which then humans transcribe.”

The data was assessed in line with open data standards such as OpenDefinition.org and the 8 Principles of Open Government Data. Each category of data was given marks out of six depending on how many of the following criteria were met: openly licensed, publicly available, in digital format, machine readable (in a structured file format), up to date, and available in bulk. The assessment wasn’t entirely quantitative. “We strive for reasonably 'yes or no' questions but there are subtleties,” says Pollock. With transport and timetables there's rail, bus, and other public transport. What happens if your bus and tram timetables are available but not train? Or is a certain format acceptable as machine readable?”

The preview does not show how many data sets were assessed in each category but more information will be included in the full results covering 50 countries will be released later this year. For further information on the methodology of the census see the Open Knowledge Foundation’s blog post.

Olly Downs runs the Data Science team at Globys, a company which takes large-scale dynamic data from mobile operators and uses it to contextualize their marketing in order to improve customer experience, increase revenue, and maximize retention. Downs is no data novice: He was the Principal Scientist at traffic prediction startup INRIX, which is planning an IPO this year, and Globys is his seventh early-stage company. Co.Labs talked to him about how to maximize the ROI of a data science team.

How does Globys use its Data Science team?

The Telco space has always been Big Data. Any single one of our mobile operator customers produces data at a rate greater than Facebook. Globys is unique in terms of the scale with which we have embraced the data science role and its impact on the structure of the company and the core proposition of the business. Often data scientists in an organization are pseudo-consultants answering one-off questions. Our data science team is devoted to developing the science behind the technology of our core product.

You trained in hard sciences (Physics at Cambridge and Applied Mathematics at Princeton). Is Data Science really a science?

How we work at Globys is that we develop technologies via a series of experiments. Those experiments are designed to be extremely robust, as they would be in the hard sciences world, but based on data which is only available commercially, and they are designed to answer a business question rather than a fundamental research question. The methodology we use to determine if a technology is useful has the same core elements of a scientific process.

What has come along with the data science field is this cloudburst of new technologies. The science has become mixed in with mastering the technology which allows you to do the science. It's not that surprising. The web was invented by Tim Berners Lee at CERN to exchange scientific data in particle physics. Out in the applied world, the work tends to be a mixture of answering questions and finding the right questions to ask. It's very easy for a data science team to slip into being a pseudo-engineering team on an engineering schedule. It's very important to have a proportion of your time allocated to exploratory work so you have the ability to go down some rabbit holes.

How can a company integrate a data scientist into their business?

With Big Data, the awareness in the enterprise is high and the motivation to do Big Data initiatives is high, but the cultural ability to absorb the business value, and the strategic shift that might bring, is hard to achieve. My experience is if the data scientist is not viewed as a peer-level stakeholder who can have an impact on the leadership and the culture of the business, then most likely they won't be successful.

I remember working on a project on a “Save” program where anyone who called to cancel their service got two months free. The director who initiated that program had gotten a significant promotion based on its success. It wasn't a data-driven initiative. The anecdotes were much more positive than the data. What I found, after some data analysis, was that the program was retaining customers for an average of 1.2 months after the customer had been given the two months free. Every saved call the customer was taking, which included the cost of the agent talking to the customer, was actually losing them $13 per call. We came with a model-based solution which allowed the business to test who they should allow to cancel and who they should talk to. That changed the ROI on the program to plus-$22 per call taken. That stakeholder then made it from general manager to VP, and ultimately was very happy, but it took a while to make the shift to seeing that data was ultimately going to improve the outcome.

Can't you fool yourself with data as well as with anecdotes?

As a data scientist, it's hard to come to a piece of analysis without a hypothesis, a Bayesian prior (probability), a model in your mind of how you think things probably are. That makes it difficult to interrogate the data in the purest way, and even if you do, you are manipulating a subset of attributes or individuals who represent the problem that you have. Being aware of the limitations of the analysis is important. A real problem with communicating the work you have done is that while scientists are very good at explaining the caveats, the people listening are not interested in caveats but in conclusions. I remember doing an all-day experiment when I was at Cambridge to measure the Gravitational constant to four decimal places of precision. I measured to four levels of precision but the result was incorrect in the constant by more than a thousand times. You can fool yourself into thinking you have measured something with a very high level of accuracy and yet the thing you were measuring turned out to be the wrong thing.

How do measure the ROI (Return on Investment) of a data science team?

The measure of success is getting initiatives to completion, addressing a finding about the business is a measurable way. At Globys our business is around getting paid for the extra retention or revenue that we achieve for our customers. Recently, we have been leveraging the idea of every customer as a sequence of events--every purchase, every call, every SMS message, every data session, every top-up purchase--which allows us to take Machine Learning approaches (Dynamic State-Space Modeling) which otherwise do not apply to this problem domain of retaining customers. This approach outperforms the current state of the art in churn prediction modeling by about 200%. When you run an experiment to retain customers, the proportion of customers you are messaging to is biased in favor of those with a problem. We already have an optimized price elasticity and offer targeting capability so you improve your campaign by a similar factor. The normal improvement you would achieve in churn retention is in the 5% range. We are achieving improvements in the 15-20% range.

What's the biggest gap you see in data science teams?

Since our customers have been Big Data businesses for a long time, they will typically have analysts, and many of those teams are unsuccessful because the communications skill set is missing. They may have the development capability, the statistical and modeling capability, but have been very weak at communicating with the other elements of the business. What we are seeing now is some roles being hired which bridge between the data science capability and the business functions like marketing and finance. It's a product manager for the analytics team.

This story tracks the cult of Big Data: The hype and the reality. It’s everything you ever wanted to know about data science but were afraid to ask. Read on to learn why we’re covering this story, or skip ahead to read previous updates.

Take lots of data and analyze it: That’s what data scientists do and it’s yielding all sorts of conclusions that weren’t previously attainable. We can discover how our cities are run, disasters are tackled, workers are hired, crimes are committed, or even how Cupid's arrows find their mark. Conclusions derived from data are affecting our lives and are likely to shape much of the future.

Previous Updates

Meteorologist Steven Bennett used to predict the weather for hedge funds. Now his startup EarthRisk forecasts extreme cold and warmth events up to four weeks ahead, much further in the future than traditional forecasts, for energy companies and utilities. The company has compiled 30 years of temperature data for nine regions and discovered patterns which predict extreme heat or cold. If the temperature falls in the hottest or coldest 25% of the historical temperature distribution, EarthRisk defines this as an extreme event and the company's energy customers can trade power or set their pricing based on its predictions. The company's next challenge is researching how to extend hurricane forecasts from the current 48 hours to up to 10 days.

How is your approach to weather forecasting different from traditional techniques?

Meteorology has traditionally been pursued along two lines. One line has a modeling focus and has been pursued by large government or quasi-government agencies. It puts the Earth into a computer-based simulation and that simulation predicts the weather. That pursuit has been ongoing since the 1950s. It requires supercomputers, it requires a lot of resources (The National Oceanic and Atmospheric Administration in the U.S. has spent billions of dollars on its simulation) and a tremendous amount of data to input to the model. The second line of forecasting is the observational approach. Farmers were predicting the weather long before there were professional meteorologists and the way they did it was through observation. They would observe that if the wind blew from a particular direction, it's fair weather for several days. We take the observational approach, the database which was in the farmer's head, but we quantify all the observations strictly in a statistical computer model rather than a dynamic model of the type of the government uses. We quantify, we catalog, and we build statistical models around these observations. We have created a catalog of thousands of weather patterns which have been observed since the 1940s and how those patterns tend to link to extreme weather events one to four weeks after the pattern is observed.

Which approach is more accurate?

The model-based approach will result in a more accurate forecast but because of chaos in the system it breaks down 1-2 weeks into the forecast. For a computer simulation to be perfect we would need to observe every air parcel on the Earth to use as input to the model. In fact, there are huge swathes of the planet, e.g., over the Pacific Ocean, where we don't have any weather observations at all except from satellites. So in the long range our forecasts are more accurate, but not in the short-range.

What data analysis techniques do you use?

We are using Machine Learning to link weather patterns together, to say when these kind of weather patterns occur historically they lead to these sorts of events. Our operational system uses a Genetic Algorithm for combining the patterns in a simple way and determining which patterns are the most important. We use NaiveBayes to make the forecast. We forecast, for example, that there is 60% chance that there will be an extreme cold event in the northwestern United States three weeks from today. If the temperature is a quarter of a degree less than that cold event threshold, then it's not a hit. We are in the process of researching a neural network, which we believe will give us a richer set of outputs. With the neural network we believe that instead of giving the percentage chance of crossing some threshold, we will be able to develop a full distribution of temperature output, e.g., that it will be 1 degree colder than normal.

How do you run these simulations?

We update the data every day. We have a MatLab-based modeling infrastructure. When we do our heavy processing, we will use hundreds of cores in the Amazon cloud. We do those big runs a couple of dozen times a year.

How do you measure forecast accuracy?

Since we forecast for extreme events, we use a few different metrics. If we forecast an extreme event and it occurs, then that's a hit. If we forecast an extreme event and it does not occur, that's a false alarm. Those can be misleading. If I have made one forecast and that forecast was correct, then I have a hit rate of 100% and a false alarm rate of 0%. But if there were 100 events and I only forecasted one of them and missed the other 99, that's not useful. The detection rate is the number of events that occur which we forecast. We try to get a high hit rate and detection rate but in a long-range forecast detection rate is very, very difficult. Our detection rate tends to be around 30% in a 3-week forecast. Our hit rate stays roughly the same at one week, 2 week, 3 weeks. In traditional weather forecasting the accuracy gets dramatically lower the further out you get.

Why do you want to forecast hurricanes further ahead?

The primary application for longer lead forecasts for hurricane landfall would be in the business community rather than for public safety. For public safety you need to make sure that you give people enough time to evacuate but also have the most accurate forecast. That lead time is typically two to three days right now. If people evacuate and the storm does not do damage in that area, or never hits that area, people won't listen the next time the forecast is issued. Businesses understand probability so you can present a risk assessment to a corporation which has a large footprint in a particular geography. They may have to change their operations significantly in advance of a hurricane so if it's even 30% or 40% probability then they need extra lead time.

What data can you look at to provide an advance forecast?

We are investigating whether building a catalog of synoptic (large scale) weather patterns like the North Atlantic oscillation will work for predicting hurricanes, especially hurricane tracks--so where a hurricane will move. We have quantified hundreds of weather patterns which are of the same amplitude, hundreds of miles across. For heat and cold risks we develop an index of extreme temperature. For hurricanes the primary input is an index of historic hurricane activity rather than temperature. Then you would use Machine Learning to link the weather patterns to the hurricane activity. All of this is a hypothesis right now. It's not tested yet.

What’s the next problem you want to tackle?

We worked with a consortium of energy companies to develop this product. It was specifically developed for their use. Right now the problems we are trying to solve are weather related but that's not where we see ourselves in two or five years. The weather data we have is only an input to a much bigger business problem and that problem will vary from industry to industry. What we are really interested in is helping our customers solve their business problems. In New York City there's a juice bar called Jamba juice. Jamba Juice knows that if the temperature gets higher than 95% degrees in an afternoon in the summer they need extra staff since more people will buy smoothies. They have quantified the staff increase required (but they schedule their staff one week in advance and they only get a forecast one day in advance). They use a software package with weather as an input. We believe that many business are right on the cusp of implementing that kind of intelligence. That's where we expect our business to grow.

A roomful of confused-looking journalists is trying to visualize a Twitter network. Their teacher is School of Data “data wrangler” Michael Bauer, whose organization teaches journalists and non-profits basic data skills. At the recent International Journalism Festival, Bauer showed journalists how to analyze Twitter networks using OpenRefine, Gephi, and the Twitter API.

Bauer's route into teaching hacks how to hack data was a circuitous one. He studied medicine and did postdoctoral research on the cardiovascular system, where he discovered his flair for data. Disillusioned with health care, Bauer dropped out to become an activist and hacker and eventually found his way to the School of Data. I asked him about the potential and pitfalls of data analysis for everyone.

Why do you teach data analysis skills to “amateurs”?

We often talk about how the digitization of society allows us to increase participation, but actually it creates new kinds of elites who are able to participate. It opens up the existing elites so you don't have to be an expensive lobbyist or be born in the right family to be involved, but you have to be part of this digital elite which has access to these tools and knows how to use them effectively. It's the same thing with data. If you want to use data effectively to communicate stories or issues, you need to understand the tools. How can we help amateurs to use these tools? Because these are powerful tools.

If you teach basic data skills, is there a danger that people will use them naively?

There is a sort of professional elitism which raises the fear that people might misuse the information. You see this very often if you talk to national bureaus of statistics, for example, who say “We don't give out our data since it might be misused.” When the Open Data movement started in the U.K. there was a clause in the agreement to use government data which said that you were not allowed to do anything with it which might criticize the government. When we train people to work with data, we also have to train them how to draw the right conclusions, how to integrate the results. To turn data into information you have to put it into context. So we break it down to the simplest level. What does it mean when you talk about the mean? What does it mean if you talk about average income? Or does it even make sense to talk about the average in this context?

Are there common pitfalls you teach people to avoid?

We frequently talk about correlation-causation. We have this problem in scientific disciplines as well. In Freakonomics, economist Steven D. Levitt talks about how crime rates go down when more police are employed, but what people didn't look at was that this all happened in times of economic growth. We see this in medical science too. There was this idea that because women have estrogen they are protected from heart attacks so you should give estrogen to women after menopause. This was all based on retrospective correlation studies. In the 1990s someone finally did a placebo controlled randomized trial and they discovered that hormone replacement therapy doesn't help at all. In fact it harms the people receiving it by increasing the risk of heart attacks.

How do you avoid this pitfall?

If you don't know and understand the assumptions that your experiment is making, you may end up with something completely wrong. If you leave certain factors out of your model and look at one specific thing, that's the only specific thing you can say something about. There was this wonderful example that came out about how wives of rich men have more orgasms. A University in China got hold of the data for their statistics class and they found that they didn't use the education of the women as a parameter. It turns out that women who are more educated have more orgasms. It had nothing to do with the men.

What are the limitations of a using single form of data?

That's one of the dangers of looking at Twitter data. This is the danger of saying that Twitter is democratizing because everyone has a voice, but not everyone has a voice. Only a small percentage of the population use this service and a way smaller proportion are talking a lot. A lot of them are just reading or retweeting. So we only see a tiny snapshot of what is going on. You don't get a representative population. You get a skew in your population. There was an interesting study on Twitter and politics in Austria which showed that a lot of people on there are professionals and they are there to engage. So it's not a political forum. It's a medium for politicians and people who are around politics to talk about what they are doing.

Any final advice?

Integrate multiple data sources, check your facts, and understand your assumptions.

Charts can help us understand the aggregate but they can also be deeply misleading. Here's how to stop lying with charts, without even knowing it. While it’s counterintuitive, charts can actually obscure our understanding of data--a trick Steve Jobs has exploited on stage at least once. Of course, you don’t have to be a cunning CEO to misuse charts; in fact, If you have ever used one at all, you probably did so incorrectly, according to visualization architect and interactive news developer Gregor Aisch. Aisch gave a series of workshops at the International Journalism Festival in Italy, which I attended last weekend, including one on basic data visualization guidelines.

“I would distinguish between misuse by accident and on purpose,” Aisch says.”Misuse on purpose is rare. In the famous 2008 Apple keynote from Steve Jobs , he showed the market share of different smartphone vendors in a 3D pie chart. The Apple slice of the smartphone market, which was one of the smallest, was in front so it appeared bigger.”

Aisch explained in his presentation that 3-D pie charts should be avoided at all costs since the perspective distorts the data. What is displayed in front is perceived as more important than what is shown in the background. That 19.5% of market share belonging to Apple takes up 31.5% of the entire area of the pie chart and the angles are also distorted. The data looks completely different when presented in a different order as shown below.

In fact, the humble pie charts turns out to be an unexpected mine field:

“Use pie charts with care, and only to show part of whole relationships. Two is the ideal number of slices, but never show more than five. Don’t use pie charts if you want to compare values. Use bar charts instead.”

For example, Aisch advises that you don’t use pie charts to compare sales from different years but do use to show sales per product line in the current year. You should also ensure that you don't leave out data on part of the whole:

“Use line charts to show time series data. That’s simply the best way to show how a variable changes over time. Avoid stacked area charts, they are easily mis-interpreted.”

The “I hate stack area charts” post cited in Aisch’s talk explains why:

“Orange started out dominating the market, but Blue expanded rapidly and took over. To the unwary, it looks like Green lost a bit of market share. Not nearly as much as Orange, of course, but the green swath certainly gets thinner as we move to the right end of the chart.”

In fact the underlying data shows that Green’s market share has been increasing, not decreasing. The chart plots the market share vertically, but human beings perceive the thickness of a stream at right angles to its general direction.

Technology companies aren't the only offenders in chart misuse. “Half of the examples in the presentation are from news organizations. Fox News is famous for this,” Aisch explains.”The emergence of interactive online maps has made map misuse very popular, e.g., election results in the United States where big states like Texas which have small populations are marked red. If you build a map with Google Maps there isn't really a way to get around this problem. But other tools aren't there yet in terms of user interface and you need special skills to use them.”

Google Maps also uses the Mercator projection, a method of projecting the sphere of the Earth onto a flat surface, which distorts the size of areas closer to the polar regions so, for example, Greenland looks as large as Africa.

The solution to these problems, according to Aisch, is to build visualization best practices directly into the tool as he does in his own open source visualization tool Datawrapper. “In Datawrapper we set meaningful defaults but also allow you to switch between different rule systems. There's an example for labeling a line chart. There is some advice that Edward Tufte gave in one of his books and different advice from Donna Wong so you can switch between them. We also look at the data so if you visualize a data set which has many rows, then the line chart will display in a different way than if there were just 3 rows.”

The rush to "simplify" big data is the source of a lot of reductive thinking about its utility. Data science practitioners have recently been lamenting how the data gold rush is leading to naive practitioners deriving misleading or even downright dangerous conclusions from data.

The Register recently mentioned two trends that may reduce the role of the professional data scientist before the hype has even reached its peak. The first is the embedding of Big Data tech in applications. The other is increased training for existing employees who can benefit from data tools.

"Organizations already have people who know their own data better than mystical data scientists. Learning Hadoop is easier than learning the company’s business."

This trend has already taken hold in data visualization, where tools like infogr.am are making it easy for anyone to make a decent-looking infographic from a small data set. But this is exactly the type of thing that has some data scientists worried. Cathy O' Neil (aka MathBabe) has the following to say in a recent post:

"It’s tempting to bypass professional data scientists altogether and try to replace them with software. I’m here to say, it’s not clear that’s possible. Even the simplest algorithm, like k-Nearest Neighbor (k-NN), can be naively misused by someone who doesn’t understand it well."

K-nearest neighbors is a method for classifying objects, let's say visitors to your website, by measuring how similar they are to other visitors based on their attributes. A new visitor is assigned a class, e.g., "high spenders," based on the class of its k nearest neighbors, the previous visitors most similar to him. But while the algorithm is simple, selecting the correct settings and knowing that you need to scale feature values (or verifying that you don't have many redundant features) may be less obvious.

You would not necessarily think about this problem if you were just pressing a big button on a dashboard called “k-NN me!”

Here are four problems that typically arise from a lack of scientific rigor in data projects. Anthony Chong, head of optimization at Adaptly, warns us to look out for "science" with no scientific integrity.

Through phony measurement and poor understandings of statistics, we risk creating an industry defined by dubious conclusions and myriad false alarms.... What distinguishes science from conjecture is the scientific method that accompanies it.

Given the extent to which conclusions derived from data will shape our future lives, this is an important issue. Chong gives us four problems that typically arise from a lack of scientific rigor in data projects, but are rarely acknowledged.

Results not transferrable

Experiments not repeatable

Not inferring causation: Chong insists that the only way to infer causation is randomized testing. It can't be done from observational data or by using machine learning tools, which predict correlations with no causal structure.

Poor and statistically insignificant recommendations.

Even when properly rigorous, analysis often leads to nothing at all. From Jim Manzi's 2012 book, Uncontrolled: The Surprising Payoff of Trial-and-Error for Business:

"Google ran approximately 12,000 randomized experiments in 2009, with [only] about 10 percent of these leading to business changes.”

Understanding data isn't about your academic abilities—it's about experience. Beau Cronin has some words of encouragement for engineers who specialize in storage and machine learning. Despite all the backend-as-service companies sprouting up, it seems there will always be a place for someone who truly understands the underlying architecture. Via his post at O'Reilly Radar:

I find the database analogy useful here: Developers with only a foggy notion of database implementation routinely benefit from the expertise of the programmers who do understand these systems—i.e., the “professionals.” How? Well, decades of experience—and lots of trial and error—have yielded good abstractions in this area.... For ML (machine learning) to have a similarly broad impact, I think the tools need to follow a similar path.

Want to climb the mountain? Start learning about data science here. If you know next to nothing about Big Data tools, HP's Dr. Satwant Kaur's 10 Big data technologies is a good place to start. It contains short descriptions of Big Data infrastructure basics from databases to machine learning tools.

This slide show explains one of the most common technologies in the Big Data world, MapReduce, using fruit while Emcien CEO Radhika Subramanian tells you why not every problem is suitable for its most popular implementation Hadoop.

"Rather than break the data into pieces and store-n-query, organizations need the ability to detect patterns and gain insights from their data. Hadoop destroys the naturally occurring patterns and connections because its functionality is based on breaking up data. The problem is that most organizations don’t know that their data can be represented as a graph nor the possibilities that come with leveraging connections within the data."

Efraim Moscovich's Big Data for conventional programmers goes into much more detail on many of the top 10, including code snippets and pros and cons. He also gives a nice summary of the Big Data problem from a developer's point of view.

We have lots of resources (thousands of cheap PCs), but they are very hard to utilize.
We have clusters with more than 10k cores, but it is hard to program 10k concurrent threads.
We have thousands of storage devices, but some may break daily.
We have petabytes of storage, but deployment and management is a big headache.
We have petabytes of data, but analyzing it is difficult.
We have a lot of programming skills, but using them for Big Data processing is not simple.

Infochimps has also created a nice overview of data tools (which features in TechCrunch's top five open-source projects article) and what they are used for.

Finally, GigaOm's programmer's guide to Big Data tools covers an entirely disjointed set of tools weighted towards application analytics and abstraction APIs for data infrastructure like Hadoop.

We're updating this story as news rolls in. Check back soon for more updates.

Big Data: It's a big story with big developments. Starting with this article, we'll branch into more features and add expert interviews, expanding our coverage and understanding of this complex subject. Add your influence to the direction of this story by tweeting the author.

[Image: Flickr user Cory Doctorow]

SpokenLayer: 

Off

Byline

Hide By: 

Off

    

Show more