Technopreneurph.wordpress.com

The Tools of Big Data Science: The Technologies & Languages of Statistical Analysis| By |Andrew Rosenblum

Subscribe to Technopreneurph.wordpress.com

2016-03-19

You’ve read about many of the kinds of big data projects that you can use to learn more about your data in our What Can a Data Scientist Do for You? article—now, we’re going to take a look at tools that data scientists use to mine that data: performing statistical techniques like clustering or linear modeling, and then turning them into a story through visualization and reporting.

You don’t need to know how to use these yourself, but having a sense of the differences between these tools will help you gauge what tools might be best for your business and what skills to look for in a data scientist.

Analytical Tools

R: The Most Popular Language for Data Science

Once the data scientist has completed the often time-consuming process of “cleaning” and preparing the data for analysis, R is a popular software package for actually doing the math and visualizing the results. An open-source statistical modeling language, R has traditionally been popular in the academic community, which means that lots of data scientists will be familiar with it.

R has literally thousands of extension packages that allow statisticians to undertake specialized tasks, including text analysis, speech analysis, and tools for genomic sciences. The center of a thriving open-source ecosystem, R has become increasingly popular as programmers have created additional add-on packages for handling big datasets and parallel processing techniques that have come to dominate statistical modeling today.

Parallel helps R take advantage of parallel processing for both multicore Windows machines and clusters of POSIX (OS X, Linux, UNIX) machines.

Snow helps divvy up R calculations on a cluster of computers, which is useful for computationally intensive processes like simulations or AI learning processes.

Rhadoop and Rhipe allow programmers to interface with Hadoop from R, which is particularly important for the “MapReduce” function of dividing the computing problem among separate clusters and then re-combining or “reducing” all of the varying results into a single answer.

R is used in industries like finance, health care, marketing, business, pharmaceutical development, and more. Industry leaders like Bank of America, Bing, Facebook, and Foursquare use R to analyze their data, make marketing campaigns more effective, and reporting.

Java & the Java Virtual Machine

Organizations that seek to write custom analytics tools from scratch increasingly use the venerable language Java, as well as other languages that run on the Java Virtual Machine (JVM). Java is a variation of the object-oriented C++ language, and because Java runs on a platform-agnostic virtual machine, programs can be compiled once and run anywhere.

The upside of using the JVM over a language written to run directly on the processor is the reduction in development time. This simpler development process has been a draw for data analytics, making JVM-based data mining tools very popular. Also, Hadoop—the popular open-source, distributed big data storage and analysis software—is written in Java.

Java has rich open-source libraries for data mining, including Mahout and Weka, and the JVM provides robust memory management and exception handling. Other programming languages that can be used with the JVM include:

Scala: This programming language has the same efficiency as Java because it’s run on the JVM. However, it’s also become increasingly popular in data mining because it permits developers to use object-oriented programming (OOP) as well as functional programming. Users of Scala include The Guardian, LinkedIn, Foursquare, Novell, Siemens, Twitter, and the SPARK data mining environment at the UC Berkeley AMP Lab.

Clojure: A dialect of the 1980s-era artificial intelligence language LISP, Clojure is a primarily (although not 100%) functional language that also runs on the JVM. Clojure keeps data static and was designed for running concurrent processes. These features are important because, in contrast, object-oriented code executing concurrent processes will sometimes attempt to write to the same variable simultaneously. Keeping data structures immutable avoids this problem. Clojure has access to Java libraries, and the same development efficiencies as Java. Clojure can use the LISP macro facility to integrate with Hadoop and SQL. Users of Clojure include Netflix, Zendesk, Citibank, WalMart Labs, and Spotify.

Python: A High-Level Programming Language with Excellent Data Libraries

Python is a high-level language, meaning that the creators automated certain housekeeping processes in order to make code easier to write. Python has robust libraries that support statistical modeling (Scipy and Numpy), data mining (Orange and Pattern), and visualization (Matplotlib). Scikit-learn, a library of machine learning techniques very useful to data scientists, has attracted developers from Spotify, OKCupid, and Evernote, but can be challenging to master.

Excel: Powerful Data Analytics on a Smaller Scale

Excel can actually accomplish a lot of sophisticated analysis—plus, it’s easy to use and widely available. While it’s not best for analyzing truly massive, unstructured datasets—for example, a massive dataset of some 30 million healthcare records distributed via Hadoop across dozens of servers—it is surprisingly powerful when used for a variety of data analytics projects at a small scale. These can include clustering, optimization, and predictive modeling using supervised AI learning or forecasting techniques.

SAS (Statistical Analysis System): Data Mining Software Suite

Used for advanced analytics, data management, and social media analytics, SAS is a robust suite that’s popular for business intelligence analysis of large data and unstructured datasets. In 2015, SAS topped the Gartner Magic Quadrant list in terms of “ability to execute” in the category of advanced analytics platforms due to the breadth and quality of its predictive modeling and data mining techniques. With a well-regarded visualization tool and integration with open-source tools like R, Hadoop and Python, SAS also puts significant effort into making tools backwards compatible, an important feature when looking at older historical datasets.

Why is backwards compatibility so important? Say, for example, a company’s sales records were prepared for use by SAS in 1998. With backwards compatibility, they can still be read today. In large organizations, employee turnover over the years puts a premium on the continuity of tools. So, when a data scientist retires, you won’t lose the ability to access their work if they preferred older software that no one new to the position knows how to use.

SAS can be costly, has a complicated licensing structure that some customers have found to be annoying, and has a steep learning curve. Although it’s expensive and complicated, it’s a very popular option, with more than 65,000 customers.

IBM: SPSS Modeler and SPSS Analytics

Forrester Research Wave ranks IBM’s advanced data analytics platform as the top offering in the advanced analytics category for its breadth of tools that handle all elements of big data modeling: loading, “cleaning,” preparing, and then predictive modeling, whether using statistical or machine learning techniques.

SPSS Modeller and SPSS Statistics were acquired by IBM in 2009, and have a loyal following among statisticians. These tools integrate Hadoop to facilitate file-system computing using big datasets. The Social Media Analytics product helps data scientists harvest data from Twitter, Facebook, and other platforms to perform customer sentiment analysis. Gartner reports that the IBM advanced analytics platform has lower customer satisfaction ratings than average, largely due to weak customer support, inadequate documentation, and a challenging installation process.

Other makers of highly rated commercial tools for advanced data analytics include SAP, KNIME, RapidMiner, Oracle, and Alteryx.

SQL vs. NoSQL Databases: Tackling the “Messiness” of Big Data

Another important distinction in the world of data is SQL databases vs. NoSQL databases, both of which are well suited to different types of datasets. Here’s a quick look at what makes them different in the context of data analysis.

The traditional “relational” database was designed for an era in which data was far more expensive to collect and to store—and much more carefully organized. Structured Query Language (SQL) has been the means by which programmers transfer data to and from those neatly categorized rows and columns. (Read up on the basics of databases in our Guide to Database Technology.)

However, modern data is not so easily categorized. In 2011, Abhishek Mehta, CEO of big data company Tresata, estimated that only 5% of the world’s information was structured data—and the rest consists of articles, photos, videos, social media posts, machine-to-machine communication, product inventory, and technical documents. So data scientists turned to a different standard for data storage called “NoSQL.”

If the traditional relational database was a carefully curated boutique with a small collection, the NoSQL database is more like a chaotic big box warehouse, with many different kinds of data scattered all over the place. Storing data in bulk in this ad hoc way requires more processing power and storage capacity than if the data were better organized—but that’s why Hadoop has become so popular as a way to divvy up storage and processing tasks on clusters of inexpensive commodity hardware.

Which database is right for you? Once you clearly outline the problem you seek to solve, figure out your objective, and determine the size of the dataset to be handled, you’ll have a much better idea of which one could be best for your needs.

Database management systems you might consider include:

MySQL: Open-Source RDBMS

Purchased by Oracle in 2009, MySQL is a widely used RDBMS (relational database management system) and one part of the LAMP software stack. This free, open-source database management system is used by web applications like WordPress, Drupal, Facebook, Twitter, and YouTube.

MongoDB

The most popular NoSQL database system available on the market is the open-source MongoDB, which has been used by Metlife, The Weather Channel, Bosch, and Expedia. MongoDB has well-regarded customer service, and the tool is particularly popular with startups.

One of the fastest-growing big data projects involving MongoDB is Apache Spark, a distributed computing framework from the Apache Software Project that’s designed to operationalize real-time analytics. Paired up with MongoDB, Spark allows organizations to put real-time analytics reporting to use.

Other commonly used open-source NoSQL databases include HBase, MariaDB, and Cassandra.

Oracle

Oracle has nearly 50% of the traditional relational database market, with products such as Oracle Database and OracleTimesTen. The database behemoth has also entered the market for unstructured data storage with Oracle NoSQL, and for open-source SQL databases that compete with its proprietary offerings. While popular and considered to be top-notch by many, they’re expensive.

SQL Server DBMS: Enterprise-Level Database Management

Microsoft SQL Server DBMS is a competitive enterprise-level database management system that includes support for SQL or noSQL architectures, in-memory computing, the cloud, and analytics on transactions. Existing customers are generally impressed with its performance. Gartner reports that SQL Server has a poor image with developers, which had made the product less popular. Also, customers gave the product’s pricing model the lowest score of any tested.

Other strong performers in the market include SAP, IBM, EnterpriseDB, InterSystems, and MarkLogic.

If your organization has room in the budget, then the full-service option from Oracle is the way to go in terms of reliability and customer service. But if your back-end developer can handle hacking their way through an open-source solution without a huge amount of expensive support calls, then an open-source solution like MariaDB or MongoDB will help you cut costs.

Hadoop: File System Computing

What is “file system computing”? It’s a way to store and tackle the analytics for truly massive datasets. For example, 2 billion data points from sensors on an auto assembly line area that are stored on a cluster of servers, with each connected to multiple drives, would be enormous. Because this kind of dataset is too large to extract from the drives to a place where it can be analyzed, software like Hadoop was created.

Hadoop is an open-source software tool specially designed to help data scientists manage the unwieldiness of big data. It eliminates the need to extract data from the storage devices altogether, bringing the analytics to the data so it can be processed in place. It has increasingly become the industry standard for file system computing projects involving big data, with prominent users including Facebook, Yahoo, and The New York Times.

There are many other platforms that do file system computing, such as SciDB, but Hadoop has risen to the top with user contributions that extend its functionality, like Hive, Pig, Spark, and MapReduce. Even software giants like Microsoft and IBM have created their own Hadoop tools, rather than reinventing the wheel. Learn more about Hadoop’s key features in our Guide to Hadoop.

A/B Testing

Big data analysis involves finding correlations in unexpected places in your data. The technique known as “A/B testing,” in which users see different versions (“A” versus “B”) of a website design in order to see which attracts more clicks, has taken over the tech world. Big data is very good at telling you “what works.”

Consider this example from an Optimizely customer, which proves how A/B testing can overrule older “best practices” and yield better results. A retailer was testing a homepage layout, and had previously been putting any new-related items in one corner of the homepage grid—a carry-over from print and magazine best practices. But, by testing different versions of the design against one another using the Optimizely A/B platform, the company found users responded more to having news-related items front and center. Moving forward, the company always utilized that area of the homepage for items it wanted prioritized. This is how A/B testing works—it can find a correlation between a layout and how users respond to it, then allow you to adjust for the best results.

A/B testing terms to know:

Multivariate testing allows you to test more than two different versions of a page against one another. For example, if you’re testing out the sales conversion rate against two versions of a drop-down menu, three different textual descriptions of a product offer, and two large photos on a page, you’ll have 12 different versions of the page to test, not just an “A” and a “B.”

Multi-platform functionality helps ensure your design or layout performs well on all kinds of devices, from PCs to smartphones.

Optimizely

Optimizely has a simple and intuitive user interface that the company intends for organizations of all sizes. And the easier the tool is to use, the more likely employees are to experiment and discover unknown correlations. Both Forrester Wave and TrustRadius found Optimizely less effective than some competitors with more complicated forms of A/B testing. Optimizely is the most commonly used A/B testing platform across the top 10,000 largest websites. Clients include Microsoft, NBC, The Guardian, Disney, and Starbucks.

Maxymyser

TrustRadius’ top rated A/B testing tool among large “enterprise” customers (along with Monetate), Maxymyser requires web managers to insert a single line of code into a page and then begins analyzing for basic testing. More advanced tests with multiple variables require coding and may end up requiring help from Maxymyser customer support. Because the tool is powerful, web managers may need to allot themselves some training time to get things running smoothly. Clients include Ramada, Alaska Airlines, Virgin Media, and HSBC.

Adobe Target (formerly Test & Target)

This powerful A/B testing tool integrates well with the other components within the Adobe Analytics suite. With Target, web developers can engage in more elaborate segmentation and custom JavaScript-based experimentation. For example, Target has a particularly nimble interface that makes the cumbersome process of multivariate testing simple and assigns different statistical weights. Clients include Marriott, AOL, and Redbox.

Other A/B and multivariate testing platforms include: Monetate, A/B Tasty, Qubit, Visual Website Optimizer, and Unbounce.

The Secret to Building a Team of Top-Notch Distributed Engineers

Download Now

Sources Consulted

Clojure website, http://ift.tt/1i1ZzyM, accessed April 21, 2015.

Jared Dean, Big Data, Data Mining and Machine Learning: Value Creation for Business Leaders and Practitioners, (New York: Wiley, 2014), pp. 43-54.

Donald Feinberg, Merv Adrian, Nick Heudecker, “Magic Quadrant for Operational Database Systems,” http://ift.tt/1Rtzne9, October 16, 2014.

John W. Foreman, Data Smart: Using Data Science to Transform Information Into Insight (New York, Wiley, 2013), pp. 1-28.

Gareth Herschel, Alexander Linden, Lisa Kart, “Magic Quadrant for Advanced Analytics Platforms,” (Gartner: 2015), http://ift.tt/1DMTAKe.

Jonothon Morgan, “Don’t Be Scared of Functional Programming,” Smashing Magazine, July 2, 2014.

“Expert to Expert: Brian Beckmen and Rich Hickey — Inside Clojure,” http://ift.tt/RxUhBW,” accessed April 21, 2015

Abhishek Mehta, “Big Data: Powering the Next Industrial Revolution,” Tableau Software white paper, 2011, http://ift.tt/22pKQa1.

Scikit-Learn website, http://ift.tt/O5hS6h, accessed April 21, 2015.

Viktor Mayer-Schönberger and Kenneth Cukier, Big Data: A Revolution That Will Transform How We Live, Work, and Think (New York: Eamon Dolan/Houghton Mifflin Harcourt, 2013), pp. 45-46.

Joe Stanhope, “The Forrester Wave: Online Testing Platforms, Q1 2013,” http://ift.tt/22pKPD5, February 7, 2013.

Ashlee Vance, “Open Source as a Model for Business Is Elusive,” New York Times, November 29, 2009.

via Technology & Innovation Articles on Business 2 Community http://ift.tt/1Rtzbvf