2013-09-23

To allow our readers compare Hadoop distribution vendors side-by-side, I forwarded the same set of questions to the participants of the recent Hadoop panel at the Gartner Catalyst conference. The questions are coming from Catalyst attendees and from Gartner analysts. I have published the vendor responses in these blog posts:

Cloudera, August 21, 2013
MapR, August 26, 2013
Pivotal, August 30, 2013

In addition to the Catalyst panelists, IBM also volunteered to answer the questions. Today, I publish responses from Paul Zikopoulos, Vice President, Worldwide Technical Sales, IBM Information Management Software.

1. How specifically are you addressing variety, not merely volume and velocity?

While Hadoop itself can store a wide variety of data, the challenge is really the ability

to derive value out of this data and integrate it within their enterprise – connecting the

dots if you will. After all, big data without analytics is…well…just a bunch of data.

Quite simply, lots of people are talking Hadoop – but we want to ensure people are

talking analytics on Hadoop. To facilitate analytics on data at rest in Hadoop, IBM

has added a suite of analytic tools and we believe this to be one of our significant

differentiators in IBM’s Hadoop distribution, InfoSphere BigInsights, when compared

to those from other vendors.

I’ll talk about working with text from a variety perspective in a moment, but a fun

one to talk about is the IBM Multimedia Analysis and Retrieval System (IMARS).

It’s one of world’s largest image classification systems – and Hadoop powers it.

Consider a set of pictures with no metadata and searching for those pictures that include

“Wintersports,” and the system return such pictures. It’s done with the creating of

training sets that can perform feature extraction and analysis. You can check it out for

yourself at: http://mp7.watson.ibm.com/imars/. We’ve also got acoustic modules that our

partners are building around our big data platform. If you consider our rich partner

ecosystem, our Text Analytics Toolkit, the work we’re doing with IMARS, I think IBM

is leading the way when it comes to putting a set of arms around the variety component

of big data.

Pre-built customizable application logic

With more than two dozen pre-built Hadoop Apps, InfoSphere BigInsights helps firms

quickly benefit from their big data platform. Apps include web crawling, data

import/export, data sampling, social media data collection and analysis, machine data

processing and analysis, ad hoc queries, and more. All of these Apps are shipped with full

access to their source code; serving as a launching pad to customize, extend or develop

your own. What’s more, we empower the delivery of these Apps to the enterprise; quite

simply, if you’re familiar with how you invoke and discover Apps on an iPad, then

you’ve got the notion here. We essentially allow you to set up your own ‘App Store”

where users can invoke Apps and manage their run-time, thereby flattening the

deployability costs and driving up the democratization effect. It finds another level yet!

While your core Hadoop experts and programmers can create these discrete Apps, much

like SOA, we enable power users to build their own logic by orchestrating these logic

Apps into full blown applications. For example, if you wanted to grab data from a service

such as BoardReader and merge that with some relational data, then perform a

transformation on that data and run an R script on that – it’s as easy as dragging these Apps from the Apps Palette and connecting them.

There’s also a set of solution accelerators: these are extensive toolkits with dozens of prebuilt

software artifacts that can be customized and used together to quickly build tailormade

solutions for some of the more common kinds of big data analysis. There is a

solution accelerator for machine data (handles data ingest, indexing and search,

sessionization of log data, and statistical analysis), social data (handles ingestion from

social sources like Twitter, analyses historical messages to build social profiles, and

provides framework for in-motion analysis based on historical profiles), and one for

telco-based CDR data.

Spreadsheet-style analysis tool

What’s the most popular BI tool in the world? Likely some form of spreadsheet software.

So to keep business analysts and non-programmers in their comfort zone BigInsights

includes a Web-based spreadsheet-like discovery and visualization facility called

BigSheets. BigSheets lets users visually combine and explore various types of data to

identify “hidden” insights without having to understand the complexities of parallel

computing. Under the covers, it’s generating Pig jobs. BigSheets also lets you run jobs on

a sample of the data, which is really important because in the big data world, you could

be dealing with a lot of data, so why chew up that resource until you’re sure the job

you’ve created works as intended.

Advanced Text Analytics Toolkit

BigInsights helps customers analyze large volumes of documents and messages with its

built-in text-processing engine and library of context-sensitive extractors. This includes a

complete end-to-end development environment that plugs into the Eclipse IDE and offers

a perspective to build text extraction programs that run on data stored in Hadoop. In the

same manner that SQL declarative languages transformed the ability to work with

relational data, IBM has introduced the Annotated Query Language (AQL) for text

extraction, which is similar in look and feel to SQL. Using AQL and the accompanying

IDE, power users can build all kinds of text extraction processes for whatever the task is

at hand (social media analysis, classification of error messages, blog analysis, and so on).

Finally, accompanying this is a run optimizer that compiles the AQL into optimized

execution code that’s run on the Hadoop cluster. This optimizer was built from the

ground up to process and optimize text extraction, which requires a different set of

optimization techniques than typical. In the same manner as BigInsights ships a number

of Apps, the Text Analytics Toolkit involves a number of pre-built extractors that allows

you to pull things such as “Person/Phone”, “URL”, “City” (among many others) from

text (which could be a social media feed, financial document, call record logs, logs

files…you name it). The magic behind these pre-built extractors is that they are really

compiled rules – hundreds of them – from thousands of client engagements. Quite

simply, when it comes to text extraction, we think our platform is going to let you build

your applications fifty percent faster, make them run up to ten times faster than some of

the alternatives we’ve see, and most of all, provide more right answers. After all,

everyone talks about just how fast an answer was returned, and that’s important, but

when it comes to text extraction, how often is the answer right is just as important. This

platform is going to be “more right”.

Indexing and Search facility

Discovery and exploration often involve search, and as Hadoop is well suited to store

large varieties and volumes of data, a robust search facility is needed. Included with the

BigInsights product is InfoSphere Data Explorer (technology that came into the IBM big

data portfolio during the Vivisimo acquisition). Search is all the ‘rage’ these days with

some announcements by other Hadoop vendors in this space. BigInsights includes Data

Explorer for the searching of Hadoop data, however, it can be extended to search all data

assets. So unlike some of the announcements we’ve heard, I think we’re setting a higher

bar. What’s more, the indexing technology behind Data Explorer is positional-based as

opposed to vector-based – and that provides a lot of the differentiating benefits that are

needed in a big data world. Finally, Data Explorer understand security policies. If you’re

only granted access to a portion of a document and it comes up in your search, you still

only have access to the portion that was defined on the source systems. There’s so much

more to this exploration tool – such as automated topical clusters, a portal-like

development environment, and more. Let’s just say we have some serious leadership in

this space.

Integration tools

Data integration is a huge issue today and only gets more complicated with the

variety of big data that needs to be integrated into the enterprise. Mandatory

requirements for big data integration on diverse data sources include: 1) the ability to

extract data from any source, and cleanse and transform the data before moving it into

the Hadoop environment; 2) the ability to run the same data integration processes

outside the Hadoop environment and within the Hadoop environment, wherever most

appropriate (or perhaps run it on data the resides in HDFS without using MapReduce);

3) the ability to deliver unlimited data integration scalability both outside of the

Hadoop environment and within the Hadoop environment, wherever most

appropriate; and 4) the ability to overcome some of Hadoop’s limitations for big data

integration. Relying solely on hand coding within the Hadoop environment is not a

realistic or viable big data integration solution.

Gartner had some comments about how solely relying on Hadoop as an ETL engine

leads to more complexity and costs, and we agree. We think Hadoop has to be a first

class citizen to such an environment, and it is with our Information Server product

set. Information Server is not just for ETL. It is a powerful parallel application

framework that can be used to build and deploy broad classes of big data applications.

For example, it includes design canvas, metadata management, visual debuggers, the

ability to share reusable components, and more. So Information Server can

automatically generate MapReduce code, but it’s also smart enough to know if a

different execution environment might better serve the transformation logic. We like

to think of the work you do in Information Server as “Design the Flow Once – Run

and Scale Anywhere”.

I recently worked with a Pharma-based client than ran blindly down the “Hadoop for

all ETL” path. One of the transformation flows had more than 2,000 lines of code; it

took 30 days to write. What’s more, it had no documentation and was difficult to reuse

and maintain. The exact same logic flow in Information Server was implemented

in just 2 days; it was built graphically and was self-documenting. Performance was

improved, and it was more reusable and maintainable.

Think about it – long before the “ETL via solely Hadoop” it was “ETL via [programming

language of the year]. I saw clients do the same thing 10 years ago

with Perl scripts and they ended up in the same place that clients that are solely using

Hadoop for ETL end up with – higher costs and complexity. Hand coding was

replaced by commercial data integration tools because of these very reasons and we

think most customers do not want to go backwards in time by adopting the high costs,

risks, and limitations of hand coding for big data integration. We think our

Information Server offering is the only commercial data integration platform that

meets all of the requirements for big data integration outlined above.

2. How do you address security concerns?

To secure data stored in Hadoop, BigInsights security architecture uses both private

and public networks to ensure that sensitive data, such as authentication information

and source data, is secured behind a firewall. It uses reverse proxies, LDAP and PAM

authentication options, options for on-disk encryption (including hardware, file

system, and application based implementations), and built-in role based authorization.

User roles are assigned distinct privileges so that critical Hadoop components cannot

be altered by users without access.

But there is more. After all, if you are storing data in an RDBMS or HDFS, you’re

still storing data. Yet so many don’t stop to think about it. We have a rich portfolio of

data security service and work with Hadoop (or soon will). For example, test data

management, data masking, and archiving services in our Optim family. In addition,

InfoSphere Guardium provides enterprise-level security auditing capabilities and

database activity monitoring (DAM) for every mainstream relational database I can

think of; well don’t you need a DAM solution for data stored in Hadoop to support

the many current compliance mandates in the same manner as traditional structured

data sources. Guardium provides a number of pre-built reports that help you

understand who is running what job, are they authorized to run it, and more. If you

ever tried to work with Hadoop’s ‘chatty’ protocol, you know how tedious this would

be to do on your own, if at all. Guardium makes it a snap.

3. How do you provide multi-tenancy on Hadoop? How do you implement multiple unrelated workloads in Hadoop?

While Hadoop itself is moving toward better support of multi-tenancy with Hadoop

2.0 and YARN, IBM already offers some unique capabilities in this area.

Specifically, BigInsights customers can deploy Platform Symphony, which extends

capabilities for multi-tenancy. Business application owners frequently have particular

SLAs that they must achieve. For this reason, many organizations run dedicated

cluster infrastructures for each cluster because this is the only way they can be

assured of having resources when they need them. Platform Symphony solves this

problem with a production-proven multi-tenant architecture. It allows ownership to be

expressed on the grid, ensuring that each tenant is guaranteed a contracted SLA,

while also allowing resources to be shared dynamically so that idle resources are fully

utilized to the benefit of all.

For more complex use cases involving issues like multi-tenancy, IBM offers an

alternative file system to HDFS: GPFS-FPO (General Parallel File System: File

Placement Optimizer). One compelling capability of GPFS to enable Hadoop clusters

to supporting different data sets and workloads is the concept of storage volumes.

This enables you to dedicate specific hardware in your cluster to specific data sets. In

short, you could store colder data not requiring heavy analytics on less expensive

hardware. I like to call this “Blue Suited” Hadoop. There’s no need to change your

Hadoop applications if you use GPFS-FPO and there a host of other benefits it offers

such as snapshots, large block I/O, removal of the need for a dedicated name node,

and more. I want to stress the choice is up to you. You can use BigInsights with all

the open source components, and it’s “round-tripable” in that you can go back to open

source Hadoop whenever, or you can use some of these embrace and extend

capabilities that harden Hadoop for the enterprise. I think what separates us from

some of the other approaches is we are giving you the choice of the file system, and

the alternative we offer has a long standing reputation in the high performance

computing (HPC) arena.

4. What are the top 3 focus areas for the Hadoop community in the next 12 months to make the framework more ready for the enterprise?

One focus area will be a shift from customers simply looking for High Availability to

a requirement for Disaster Recovery. Today DR in Hadoop is primarily limited to

snapshots of HBase; HDFS file datasets and distributed copy. Customers are starting

to ask for automated replication to a second site and automated cluster failover for

true disaster recovery. For customers who require disaster recovery today, GPFS-FPO

(IBM’s alternative to HDFS) includes flexible replication and recovery capabilities.

Disaster Recovery

One focus area will be a shift from customers simply looking for High Availability to

a requirement for Disaster Recovery. Today DR in Hadoop is primarily limited to

snapshots of HBase; HDFS file datasets and distributed copy. Customers are starting

to ask for automated replication to a second site and automated cluster failover for

true disaster recovery. For customers who require disaster recovery today, GPFS-FPO

(IBM’s alternative to HDFS) includes flexible replication and recovery capabilities.

Data Governance and Security

As businesses begin to depend on Hadoop for more of their analytics, data

governance issues become increasingly critical. Regulatory compliance for many

industries and jurisdictions demand strict data access control requirements, the ability

to audit data reads, and the ability to exactly track data lineage – just to name a few

criteria. IBM is a leader in data governance, and has a rich portfolio to enable

organizations to exert tighter control over their data. This portfolio has been

significantly extended to factor in big data, effectively taming Hadoop.

GPFS can also enhance security by providing ACL (access control list) with file and

disk level access control so only those applications, or data itself can be isolated to

privileged users, applications or even physical nodes in highly secured physical

environments. For example, if you have a set of data that can’t be deleted because of

some regulatory compliance, you can create that policy in GPFS. You can also create

immutability policies, and more.

Business-Friendly Analytics

Hadoop has for most of its history been the domain of parallel computing experts. But

as it’s becoming an increasingly popular platform for large-scale data storage and

processing, Hadoop needs to be accessible by users with a wide variety of skillsets.

BigInsights is responding to this need by providing BigSheets (the spreadsheet-based

analytics tool we mentioned in question 1), and an App-store style framework to

enable business users to run analytics functions, mix, match, and chain apps together,

and visualize their results.

5. What are the advancements in allowing data to be accessed directly from native source without requiring it to be replicated?

One example of allowing data to be accessed directly in Hadoop is the industry shift

to improve SQL access to data stored in Hadoop; it’s ironic isn’t it? The biggest

movement in the NoSQL space is … SQL! Despite all of the media coverage of

Hadoop’s ability to manage unstructured data, there is pent-up demand to use Hadoop

as a lower cost platform to store and query traditional structured data. IBM is taking a

unique approach here with the introduction of Big SQL, which is included in

BigInsights 2.1. Big SQL extends the value of data stored within Hive or HBase by

making it immediately query-able with a richer SQL interface than that provided by

HiveQL. Specifically, Big SQL provides full ANSI SQL 92 support for sub-queries,

more SQL 99 OLAP aggregation functions and common table expressions, SQL 2003

windowed aggregate functions, and more. Together with the supplied ODBC and

JDBC drivers, this means that a broader selection of end-user query tools can

generate standard SQL and directly query Hadoop. Similarly, Big SQL provides

optimized SQL access to HBase with support for secondary indexes and predicate

pushdown. We think Big SQL is the ‘closest to the pin’ of all the SQL solutions out

there for Hadoop, and there are a number of them. Of course, Hive is shipped in

BigInsights, so the work that’s being done there is by default part of BigInsights. We

have some big surprises coming in this space, so stay close to it. But the underlying

theme here is that Hadoop needs SQL interfaces to help democratize it and all

vendors are shooting for a hole in one here.

6. What % of your customers are doing analytics in Hadoop? What types of analytics? What’s the most common? What are the trends?

Given that BigInsights includes analytic tools in addition to the open source Apache

Hadoop components, many of our customers are doing analytics in Hadoop. An

example of a customer doing analytics is Constant Contact Inc. who is using

BigInsights to analyze 35 billion annual emails to guide customers on best dates &

times to send emails for maximum response. This has resulted in increased

performance of their customers’ email campaigns by 15 to 25%.

A health bureau in Asia is using BigInsights to develop a centralized medical imaging

diagnostics solution. An estimated 80 percent of healthcare data is medical imaging

data, in particular radiology imaging. The medical imaging diagnostics platform is

expected to significantly improve patient healthcare by allowing physicians to exploit

the experience of other physicians in treating similar cases, and inferring the

prognosis and the outcome of treatments. It will also allow physicians to see

consensus opinions as well as differing alternatives, helping reduce the uncertainty

associated with diagnosis. In the long run, these capabilities will lower diagnostic

errors and improve the quality of care.

Another example is a global media firm concerned about privacy of their digital

content. The firm monitors social media sites (like Twitter, Facebook, and so on) to

detect unauthorized streaming of their content, quantify the annual revenue loss due

to piracy, and analyze trends. For them, the technical challenges involved processing

a wide variety of unstructured and semi-structured data.

The company selected BigInsights for its text analytics and scalability. The firm is

relying on IBM’s services and technical expertise to help it implement its aggressive

application requirements.

Vestas models weather to optimize the placement of turbines, maximizing generation

and longevity of their turbines. They’ve been able to take a month out of preparation

and placement work with their models and reduce turbine placement identification

from weeks to hours using there 1400+ node Hadoop cluster. If you were to take the

wind history they store of the world and capture it as HD TV, you’d be sitting down

and watching your television for 70 years to get through all the data they have – over

2.5 PB and growing to 6 PB more. How impactful is their work? In 2012, they were

awarded Computerworld’s Honors Program in its Search for New Heroes for “the

innovative application of IT to reduce waste, conserve energy, and the creation of

new product to help solve global environment problems”. IBM is really proud to be a

part of that.

Having said that, a substantial number of customers are using BigInsights as a landing

zone or staging area for a data warehouse, meaning it is being deployed as a data

integration or ETL platform. In this regard our customers are seeing additional value

in our Information Server offering combined with BigInsights. Not all data

integration jobs are well suited for MapReduce, and Information Server has the added

advantage of being able to move and transform data from anywhere in the enterprise

and across the internet to the Hadoop environment. Information Server allows them to

build data integration flows quickly and easily, and deploy the same job both within

and external to the Hadoop environment, wherever most appropriate. It delivers the

data governance and operational and administrative management required for

enterprise-class big data integration.

7. What does it take to get a CIO to sign off on a Hadoop deployment?

We strive to establish a partnership with the customer and CIO to reduce risk and

ensure success in this new and exciting opportunity for their company. Because the

industry is still in the early adopter phase, IBM is focused on helping CIOs

understand the concrete financial and competitive advantages that Hadoop can bring

to the enterprise. Sometimes this involves some form of pilot project – but as

deployments increase across different industries, customers are starting to understand

the value right away. We are starting to see more large deals for enterprise licenses of

BigInsights as an example.

8. What are the Hadoop tuning opportunities that can improve performance and scalability without breaking user code and environments?

For customers that want optimal “out of the box” performance without having to

worry about tuning parameters we have announced PureData System for Hadoop, our

appliance offering that ships with BigInsights pre-integrated.

For customers who want better performance than Hadoop provides natively, we offer

Adaptive MapReduce, which optionally can replace the Hadoop scheduler with

Platform Symphony. IBM completed big data benchmarking of significance

employing BigInsights and Platform Symphony. These benchmarks include the

SWIM benchmark (Statistical Workload Injector for MapReduce), which is a

benchmark representing a real-world big data workload developed by University of

California at Berkley in close cooperation with Facebook. This test provides rigorous

measurements of the performance of MapReduce systems comprised of real industry

workloads. Platform Symphony Advanced Edition accelerated SWIM/Facebook

workload traces by approximately 6 times.

9. How much of Hadoop 2.0 do you support, given that it is still Alpha pre Apache?

Our current release 2.1 of BigInsights ships with Hadoop V 1.1.1, and was released in

mid-2013, when Hadoop 2 was still in alpha state. Once Hadoop 2 is out of alpha/beta

state, and production ready, we will include it in a near-future release of BigInsights.

IBM strives to provide the most proven, stable versions of Apache Hadoop

components.

10. What kind of RoI do your customers see on Hadoop investments – cost reduction or revenue enhancement?

It depends on the use case. Earlier we discussed Constant Contact Inc., whose use of

BigInsights is driving revenue enhancement through text analytics.

An example of cost reduction is a large automotive manufacturer who has deployed

BigInsights as the landing zone for their EDW environment. In traditional terms, we

might think of the landing zone as the atomic level detail of the data warehouse, or

the system of record. It also serves as the ETL platform, enhanced with Information

Server for data integration across the enterprise. The cost savings result by only

replicating an active subset of the data to the EDW, thereby reducing the size and cost

of the EDW platform. So 100 percent of the data is stored on Hadoop, and about

40% gets copied to the warehouse. Over time, Hadoop also serves as the historical

archive for EDW data and remains query-able.

As Hadoop becomes more widely deployed in the enterprise, regardless of the use case, the ROI increases.

11. Are Hadoop demands being fulfilled outside of IT? What’s the percentage? Is it better when IT helps?

The deployments we are seeing are predominantly being done within IT, although

with an objective of better serving their customers. Once deployed, tools such as

BigSheets make it easier for end users by providing analytics capability without the

need to write MapReduce code.

12. How important will SQL become as mechanism for accessing data in Hadoop? How will this affect the broader Hadoop ecosystem?

SQL access to Hadoop is already important, with a lot of the demand coming from

customers who are looking to reduce the costs of their data warehouse platform. IBM

introduced Big SQL earlier this year and this is an area where we will continue to

invest. It affects the broader Hadoop ecosystem by turning Hadoop into a first class

citizen of a modern enterprise data architecture and offloading some of the work

traditionally done by an EDW. EDW costs can be reduced by using Hadoop to store

the atomic level detail, perform data transformations (rather than in-database), and

serve as query-able archive for older data. IBM can provide a reference architecture

and assist customers looking to incorporate Hadoop into their existing data warehouse

environment in a cost effective manner.

13. For years, Hadoop was known for batch processing and primarily meant MapReduce and HDFS. Is that very definition changing with the plethora of new projects (such as YARN) that could potentially extend the use cases for Hadoop?

Yes, it will extend the use cases for Hadoop. In fact, we have already seen the start of

that with the addition of SQL access to Hadoop and the move toward interactive SQL

queries. We expect to see a broader adoption of Hadoop as an EDW landing zone and

associated workloads.

14. What are some of the key implementation partners you have used?

[Answer skipped.]

15. What factors most affect YOUR Hadoop deployments (eg SSDs; memory size.. and so on)? What are the barriers and opportunities to scale?

IBM is in a unique position of being able to leverage our vast portfolio of hardware

offerings for optimal Hadoop deployments. We are now shipping PureData System

for Hadoop, which takes all of the guesswork out of choosing the components and

configuring a cluster for Hadoop. We are leveraging what we have learned through

delivering the Netezza appliance, and the success of our PureData Systems to

simplify Hadoop deployments and improve time to value.

For customers who want more flexibility, we offer Hadoop hardware reference

architectures that can start small and grow big with options for those who are more

cost conscious, or for those who want the ultimate in performance. These reference

architectures are very prescriptive by defining the exact hardware components needed

to build the cluster and are defined in the IBM Labs by our senior architects based on

our own testing and customer experience.

Follow Svetlana on Twitter @Sve_Sic

Srmsblog.burtongroup.com

The Charging Elephant — IBM