To allow our readers compare Hadoop distribution vendors side-by-side, I forwarded the same set of questions to the participants of the recent Hadoop panel at the Gartner Catalyst conference. The questions are coming from Catalyst attendees and from Gartner analysts. I have published the vendor responses in these blog posts:
Cloudera, August 21, 2013
MapR, August 26, 2013
Pivotal, August 30, 2013
In addition to the Catalyst panelists, IBM also volunteered to answer the questions. Today, I publish responses from Paul Zikopoulos, Vice President, Worldwide Technical Sales, IBM Information Management Software.
1. How specifically are you addressing variety, not merely volume and velocity?
While Hadoop itself can store a wide variety of data, the challenge is really the ability
to derive value out of this data and integrate it within their enterprise – connecting the
dots if you will. After all, big data without analytics is…well…just a bunch of data.
Quite simply, lots of people are talking Hadoop – but we want to ensure people are
talking analytics on Hadoop. To facilitate analytics on data at rest in Hadoop, IBM
has added a suite of analytic tools and we believe this to be one of our significant
differentiators in IBM’s Hadoop distribution, InfoSphere BigInsights, when compared
to those from other vendors.
I’ll talk about working with text from a variety perspective in a moment, but a fun
one to talk about is the IBM Multimedia Analysis and Retrieval System (IMARS).
It’s one of world’s largest image classification systems – and Hadoop powers it.
Consider a set of pictures with no metadata and searching for those pictures that include
“Wintersports,” and the system return such pictures. It’s done with the creating of
training sets that can perform feature extraction and analysis. You can check it out for
yourself at: http://mp7.watson.ibm.com/imars/. We’ve also got acoustic modules that our
partners are building around our big data platform. If you consider our rich partner
ecosystem, our Text Analytics Toolkit, the work we’re doing with IMARS, I think IBM
is leading the way when it comes to putting a set of arms around the variety component
of big data.
Pre-built customizable application logic
With more than two dozen pre-built Hadoop Apps, InfoSphere BigInsights helps firms
quickly benefit from their big data platform. Apps include web crawling, data
import/export, data sampling, social media data collection and analysis, machine data
processing and analysis, ad hoc queries, and more. All of these Apps are shipped with full
access to their source code; serving as a launching pad to customize, extend or develop
your own. What’s more, we empower the delivery of these Apps to the enterprise; quite
simply, if you’re familiar with how you invoke and discover Apps on an iPad, then
you’ve got the notion here. We essentially allow you to set up your own ‘App Store”
where users can invoke Apps and manage their run-time, thereby flattening the
deployability costs and driving up the democratization effect. It finds another level yet!
While your core Hadoop experts and programmers can create these discrete Apps, much
like SOA, we enable power users to build their own logic by orchestrating these logic
Apps into full blown applications. For example, if you wanted to grab data from a service
such as BoardReader and merge that with some relational data, then perform a
transformation on that data and run an R script on that – it’s as easy as dragging these Apps from the Apps Palette and connecting them.
There’s also a set of solution accelerators: these are extensive toolkits with dozens of prebuilt
software artifacts that can be customized and used together to quickly build tailormade
solutions for some of the more common kinds of big data analysis. There is a
solution accelerator for machine data (handles data ingest, indexing and search,
sessionization of log data, and statistical analysis), social data (handles ingestion from
social sources like Twitter, analyses historical messages to build social profiles, and
provides framework for in-motion analysis based on historical profiles), and one for
telco-based CDR data.
Spreadsheet-style analysis tool
What’s the most popular BI tool in the world? Likely some form of spreadsheet software.
So to keep business analysts and non-programmers in their comfort zone BigInsights
includes a Web-based spreadsheet-like discovery and visualization facility called
BigSheets. BigSheets lets users visually combine and explore various types of data to
identify “hidden” insights without having to understand the complexities of parallel
computing. Under the covers, it’s generating Pig jobs. BigSheets also lets you run jobs on
a sample of the data, which is really important because in the big data world, you could
be dealing with a lot of data, so why chew up that resource until you’re sure the job
you’ve created works as intended.
Advanced Text Analytics Toolkit
BigInsights helps customers analyze large volumes of documents and messages with its
built-in text-processing engine and library of context-sensitive extractors. This includes a
complete end-to-end development environment that plugs into the Eclipse IDE and offers
a perspective to build text extraction programs that run on data stored in Hadoop. In the
same manner that SQL declarative languages transformed the ability to work with
relational data, IBM has introduced the Annotated Query Language (AQL) for text
extraction, which is similar in look and feel to SQL. Using AQL and the accompanying
IDE, power users can build all kinds of text extraction processes for whatever the task is
at hand (social media analysis, classification of error messages, blog analysis, and so on).
Finally, accompanying this is a run optimizer that compiles the AQL into optimized
execution code that’s run on the Hadoop cluster. This optimizer was built from the
ground up to process and optimize text extraction, which requires a different set of
optimization techniques than typical. In the same manner as BigInsights ships a number
of Apps, the Text Analytics Toolkit involves a number of pre-built extractors that allows
you to pull things such as “Person/Phone”, “URL”, “City” (among many others) from
text (which could be a social media feed, financial document, call record logs, logs
files…you name it). The magic behind these pre-built extractors is that they are really
compiled rules – hundreds of them – from thousands of client engagements. Quite
simply, when it comes to text extraction, we think our platform is going to let you build
your applications fifty percent faster, make them run up to ten times faster than some of
the alternatives we’ve see, and most of all, provide more right answers. After all,
everyone talks about just how fast an answer was returned, and that’s important, but
when it comes to text extraction, how often is the answer right is just as important. This
platform is going to be “more right”.
Indexing and Search facility
Discovery and exploration often involve search, and as Hadoop is well suited to store
large varieties and volumes of data, a robust search facility is needed. Included with the
BigInsights product is InfoSphere Data Explorer (technology that came into the IBM big
data portfolio during the Vivisimo acquisition). Search is all the ‘rage’ these days with
some announcements by other Hadoop vendors in this space. BigInsights includes Data
Explorer for the searching of Hadoop data, however, it can be extended to search all data
assets. So unlike some of the announcements we’ve heard, I think we’re setting a higher
bar. What’s more, the indexing technology behind Data Explorer is positional-based as
opposed to vector-based – and that provides a lot of the differentiating benefits that are
needed in a big data world. Finally, Data Explorer understand security policies. If you’re
only granted access to a portion of a document and it comes up in your search, you still
only have access to the portion that was defined on the source systems. There’s so much
more to this exploration tool – such as automated topical clusters, a portal-like
development environment, and more. Let’s just say we have some serious leadership in
this space.
Integration tools
Data integration is a huge issue today and only gets more complicated with the
variety of big data that needs to be integrated into the enterprise. Mandatory
requirements for big data integration on diverse data sources include: 1) the ability to
extract data from any source, and cleanse and transform the data before moving it into
the Hadoop environment; 2) the ability to run the same data integration processes
outside the Hadoop environment and within the Hadoop environment, wherever most
appropriate (or perhaps run it on data the resides in HDFS without using MapReduce);
3) the ability to deliver unlimited data integration scalability both outside of the
Hadoop environment and within the Hadoop environment, wherever most
appropriate; and 4) the ability to overcome some of Hadoop’s limitations for big data
integration. Relying solely on hand coding within the Hadoop environment is not a
realistic or viable big data integration solution.
Gartner had some comments about how solely relying on Hadoop as an ETL engine
leads to more complexity and costs, and we agree. We think Hadoop has to be a first
class citizen to such an environment, and it is with our Information Server product
set. Information Server is not just for ETL. It is a powerful parallel application
framework that can be used to build and deploy broad classes of big data applications.
For example, it includes design canvas, metadata management, visual debuggers, the
ability to share reusable components, and more. So Information Server can
automatically generate MapReduce code, but it’s also smart enough to know if a
different execution environment might better serve the transformation logic. We like
to think of the work you do in Information Server as “Design the Flow Once – Run
and Scale Anywhere”.
I recently worked with a Pharma-based client than ran blindly down the “Hadoop for
all ETL” path. One of the transformation flows had more than 2,000 lines of code; it
took 30 days to write. What’s more, it had no documentation and was difficult to reuse
and maintain. The exact same logic flow in Information Server was implemented
in just 2 days; it was built graphically and was self-documenting. Performance was
improved, and it was more reusable and maintainable.
Think about it – long before the “ETL via solely Hadoop” it was “ETL via [programming
language of the year]. I saw clients do the same thing 10 years ago
with Perl scripts and they ended up in the same place that clients that are solely using
Hadoop for ETL end up with – higher costs and complexity. Hand coding was
replaced by commercial data integration tools because of these very reasons and we
think most customers do not want to go backwards in time by adopting the high costs,
risks, and limitations of hand coding for big data integration. We think our
Information Server offering is the only commercial data integration platform that
meets all of the requirements for big data integration outlined above.
2. How do you address security concerns?
To secure data stored in Hadoop, BigInsights security architecture uses both private
and public networks to ensure that sensitive data, such as authentication information
and source data, is secured behind a firewall. It uses reverse proxies, LDAP and PAM
authentication options, options for on-disk encryption (including hardware, file
system, and application based implementations), and built-in role based authorization.
User roles are assigned distinct privileges so that critical Hadoop components cannot
be altered by users without access.
But there is more. After all, if you are storing data in an RDBMS or HDFS, you’re
still storing data. Yet so many don’t stop to think about it. We have a rich portfolio of
data security service and work with Hadoop (or soon will). For example, test data
management, data masking, and archiving services in our Optim family. In addition,
InfoSphere Guardium provides enterprise-level security auditing capabilities and
database activity monitoring (DAM) for every mainstream relational database I can
think of; well don’t you need a DAM solution for data stored in Hadoop to support
the many current compliance mandates in the same manner as traditional structured
data sources. Guardium provides a number of pre-built reports that help you
understand who is running what job, are they authorized to run it, and more. If you
ever tried to work with Hadoop’s ‘chatty’ protocol, you know how tedious this would
be to do on your own, if at all. Guardium makes it a snap.
3. How do you provide multi-tenancy on Hadoop? How do you implement multiple unrelated workloads in Hadoop?
While Hadoop itself is moving toward better support of multi-tenancy with Hadoop
2.0 and YARN, IBM already offers some unique capabilities in this area.
Specifically, BigInsights customers can deploy Platform Symphony, which extends
capabilities for multi-tenancy. Business application owners frequently have particular
SLAs that they must achieve. For this reason, many organizations run dedicated
cluster infrastructures for each cluster because this is the only way they can be
assured of having resources when they need them. Platform Symphony solves this
problem with a production-proven multi-tenant architecture. It allows ownership to be
expressed on the grid, ensuring that each tenant is guaranteed a contracted SLA,
while also allowing resources to be shared dynamically so that idle resources are fully
utilized to the benefit of all.
For more complex use cases involving issues like multi-tenancy, IBM offers an
alternative file system to HDFS: GPFS-FPO (General Parallel File System: File
Placement Optimizer). One compelling capability of GPFS to enable Hadoop clusters
to supporting different data sets and workloads is the concept of storage volumes.
This enables you to dedicate specific hardware in your cluster to specific data sets. In
short, you could store colder data not requiring heavy analytics on less expensive
hardware. I like to call this “Blue Suited” Hadoop. There’s no need to change your
Hadoop applications if you use GPFS-FPO and there a host of other benefits it offers
such as snapshots, large block I/O, removal of the need for a dedicated name node,
and more. I want to stress the choice is up to you. You can use BigInsights with all
the open source components, and it’s “round-tripable” in that you can go back to open
source Hadoop whenever, or you can use some of these embrace and extend
capabilities that harden Hadoop for the enterprise. I think what separates us from
some of the other approaches is we are giving you the choice of the file system, and
the alternative we offer has a long standing reputation in the high performance
computing (HPC) arena.
4. What are the top 3 focus areas for the Hadoop community in the next 12 months to make the framework more ready for the enterprise?
One focus area will be a shift from customers simply looking for High Availability to
a requirement for Disaster Recovery. Today DR in Hadoop is primarily limited to
snapshots of HBase; HDFS file datasets and distributed copy. Customers are starting
to ask for automated replication to a second site and automated cluster failover for
true disaster recovery. For customers who require disaster recovery today, GPFS-FPO
(IBM’s alternative to HDFS) includes flexible replication and recovery capabilities.
Disaster Recovery
One focus area will be a shift from customers simply looking for High Availability to
a requirement for Disaster Recovery. Today DR in Hadoop is primarily limited to
snapshots of HBase; HDFS file datasets and distributed copy. Customers are starting
to ask for automated replication to a second site and automated cluster failover for
true disaster recovery. For customers who require disaster recovery today, GPFS-FPO
(IBM’s alternative to HDFS) includes flexible replication and recovery capabilities.
Data Governance and Security
As businesses begin to depend on Hadoop for more of their analytics, data
governance issues become increasingly critical. Regulatory compliance for many
industries and jurisdictions demand strict data access control requirements, the ability
to audit data reads, and the ability to exactly track data lineage – just to name a few
criteria. IBM is a leader in data governance, and has a rich portfolio to enable
organizations to exert tighter control over their data. This portfolio has been
significantly extended to factor in big data, effectively taming Hadoop.
GPFS can also enhance security by providing ACL (access control list) with file and
disk level access control so only those applications, or data itself can be isolated to
privileged users, applications or even physical nodes in highly secured physical
environments. For example, if you have a set of data that can’t be deleted because of
some regulatory compliance, you can create that policy in GPFS. You can also create
immutability policies, and more.
Business-Friendly Analytics
Hadoop has for most of its history been the domain of parallel computing experts. But
as it’s becoming an increasingly popular platform for large-scale data storage and
processing, Hadoop needs to be accessible by users with a wide variety of skillsets.
BigInsights is responding to this need by providing BigSheets (the spreadsheet-based
analytics tool we mentioned in question 1), and an App-store style framework to
enable business users to run analytics functions, mix, match, and chain apps together,
and visualize their results.
5. What are the advancements in allowing data to be accessed directly from native source without requiring it to be replicated?
One example of allowing data to be accessed directly in Hadoop is the industry shift
to improve SQL access to data stored in Hadoop; it’s ironic isn’t it? The biggest
movement in the NoSQL space is … SQL! Despite all of the media coverage of
Hadoop’s ability to manage unstructured data, there is pent-up demand to use Hadoop
as a lower cost platform to store and query traditional structured data. IBM is taking a
unique approach here with the introduction of Big SQL, which is included in
BigInsights 2.1. Big SQL extends the value of data stored within Hive or HBase by
making it immediately query-able with a richer SQL interface than that provided by
HiveQL. Specifically, Big SQL provides full ANSI SQL 92 support for sub-queries,
more SQL 99 OLAP aggregation functions and common table expressions, SQL 2003
windowed aggregate functions, and more. Together with the supplied ODBC and
JDBC drivers, this means that a broader selection of end-user query tools can
generate standard SQL and directly query Hadoop. Similarly, Big SQL provides
optimized SQL access to HBase with support for secondary indexes and predicate
pushdown. We think Big SQL is the ‘closest to the pin’ of all the SQL solutions out
there for Hadoop, and there are a number of them. Of course, Hive is shipped in
BigInsights, so the work that’s being done there is by default part of BigInsights. We
have some big surprises coming in this space, so stay close to it. But the underlying
theme here is that Hadoop needs SQL interfaces to help democratize it and all
vendors are shooting for a hole in one here.
6. What % of your customers are doing analytics in Hadoop? What types of analytics? What’s the most common? What are the trends?
Given that BigInsights includes analytic tools in addition to the open source Apache
Hadoop components, many of our customers are doing analytics in Hadoop. An
example of a customer doing analytics is Constant Contact Inc. who is using
BigInsights to analyze 35 billion annual emails to guide customers on best dates &
times to send emails for maximum response. This has resulted in increased
performance of their customers’ email campaigns by 15 to 25%.
A health bureau in Asia is using BigInsights to develop a centralized medical imaging
diagnostics solution. An estimated 80 percent of healthcare data is medical imaging
data, in particular radiology imaging. The medical imaging diagnostics platform is
expected to significantly improve patient healthcare by allowing physicians to exploit
the experience of other physicians in treating similar cases, and inferring the
prognosis and the outcome of treatments. It will also allow physicians to see
consensus opinions as well as differing alternatives, helping reduce the uncertainty
associated with diagnosis. In the long run, these capabilities will lower diagnostic
errors and improve the quality of care.
Another example is a global media firm concerned about privacy of their digital
content. The firm monitors social media sites (like Twitter, Facebook, and so on) to
detect unauthorized streaming of their content, quantify the annual revenue loss due
to piracy, and analyze trends. For them, the technical challenges involved processing
a wide variety of unstructured and semi-structured data.
The company selected BigInsights for its text analytics and scalability. The firm is
relying on IBM’s services and technical expertise to help it implement its aggressive
application requirements.
Vestas models weather to optimize the placement of turbines, maximizing generation
and longevity of their turbines. They’ve been able to take a month out of preparation
and placement work with their models and reduce turbine placement identification
from weeks to hours using there 1400+ node Hadoop cluster. If you were to take the
wind history they store of the world and capture it as HD TV, you’d be sitting down
and watching your television for 70 years to get through all the data they have – over
2.5 PB and growing to 6 PB more. How impactful is their work? In 2012, they were
awarded Computerworld’s Honors Program in its Search for New Heroes for “the
innovative application of IT to reduce waste, conserve energy, and the creation of
new product to help solve global environment problems”. IBM is really proud to be a
part of that.
Having said that, a substantial number of customers are using BigInsights as a landing
zone or staging area for a data warehouse, meaning it is being deployed as a data
integration or ETL platform. In this regard our customers are seeing additional value
in our Information Server offering combined with BigInsights. Not all data
integration jobs are well suited for MapReduce, and Information Server has the added
advantage of being able to move and transform data from anywhere in the enterprise
and across the internet to the Hadoop environment. Information Server allows them to
build data integration flows quickly and easily, and deploy the same job both within
and external to the Hadoop environment, wherever most appropriate. It delivers the
data governance and operational and administrative management required for
enterprise-class big data integration.
7. What does it take to get a CIO to sign off on a Hadoop deployment?
We strive to establish a partnership with the customer and CIO to reduce risk and
ensure success in this new and exciting opportunity for their company. Because the
industry is still in the early adopter phase, IBM is focused on helping CIOs
understand the concrete financial and competitive advantages that Hadoop can bring
to the enterprise. Sometimes this involves some form of pilot project – but as
deployments increase across different industries, customers are starting to understand
the value right away. We are starting to see more large deals for enterprise licenses of
BigInsights as an example.
8. What are the Hadoop tuning opportunities that can improve performance and scalability without breaking user code and environments?
For customers that want optimal “out of the box” performance without having to
worry about tuning parameters we have announced PureData System for Hadoop, our
appliance offering that ships with BigInsights pre-integrated.
For customers who want better performance than Hadoop provides natively, we offer
Adaptive MapReduce, which optionally can replace the Hadoop scheduler with
Platform Symphony. IBM completed big data benchmarking of significance
employing BigInsights and Platform Symphony. These benchmarks include the
SWIM benchmark (Statistical Workload Injector for MapReduce), which is a
benchmark representing a real-world big data workload developed by University of
California at Berkley in close cooperation with Facebook. This test provides rigorous
measurements of the performance of MapReduce systems comprised of real industry
workloads. Platform Symphony Advanced Edition accelerated SWIM/Facebook
workload traces by approximately 6 times.
9. How much of Hadoop 2.0 do you support, given that it is still Alpha pre Apache?
Our current release 2.1 of BigInsights ships with Hadoop V 1.1.1, and was released in
mid-2013, when Hadoop 2 was still in alpha state. Once Hadoop 2 is out of alpha/beta
state, and production ready, we will include it in a near-future release of BigInsights.
IBM strives to provide the most proven, stable versions of Apache Hadoop
components.
10. What kind of RoI do your customers see on Hadoop investments – cost reduction or revenue enhancement?
It depends on the use case. Earlier we discussed Constant Contact Inc., whose use of
BigInsights is driving revenue enhancement through text analytics.
An example of cost reduction is a large automotive manufacturer who has deployed
BigInsights as the landing zone for their EDW environment. In traditional terms, we
might think of the landing zone as the atomic level detail of the data warehouse, or
the system of record. It also serves as the ETL platform, enhanced with Information
Server for data integration across the enterprise. The cost savings result by only
replicating an active subset of the data to the EDW, thereby reducing the size and cost
of the EDW platform. So 100 percent of the data is stored on Hadoop, and about
40% gets copied to the warehouse. Over time, Hadoop also serves as the historical
archive for EDW data and remains query-able.
As Hadoop becomes more widely deployed in the enterprise, regardless of the use case, the ROI increases.
11. Are Hadoop demands being fulfilled outside of IT? What’s the percentage? Is it better when IT helps?
The deployments we are seeing are predominantly being done within IT, although
with an objective of better serving their customers. Once deployed, tools such as
BigSheets make it easier for end users by providing analytics capability without the
need to write MapReduce code.
12. How important will SQL become as mechanism for accessing data in Hadoop? How will this affect the broader Hadoop ecosystem?
SQL access to Hadoop is already important, with a lot of the demand coming from
customers who are looking to reduce the costs of their data warehouse platform. IBM
introduced Big SQL earlier this year and this is an area where we will continue to
invest. It affects the broader Hadoop ecosystem by turning Hadoop into a first class
citizen of a modern enterprise data architecture and offloading some of the work
traditionally done by an EDW. EDW costs can be reduced by using Hadoop to store
the atomic level detail, perform data transformations (rather than in-database), and
serve as query-able archive for older data. IBM can provide a reference architecture
and assist customers looking to incorporate Hadoop into their existing data warehouse
environment in a cost effective manner.
13. For years, Hadoop was known for batch processing and primarily meant MapReduce and HDFS. Is that very definition changing with the plethora of new projects (such as YARN) that could potentially extend the use cases for Hadoop?
Yes, it will extend the use cases for Hadoop. In fact, we have already seen the start of
that with the addition of SQL access to Hadoop and the move toward interactive SQL
queries. We expect to see a broader adoption of Hadoop as an EDW landing zone and
associated workloads.
14. What are some of the key implementation partners you have used?
[Answer skipped.]
15. What factors most affect YOUR Hadoop deployments (eg SSDs; memory size.. and so on)? What are the barriers and opportunities to scale?
IBM is in a unique position of being able to leverage our vast portfolio of hardware
offerings for optimal Hadoop deployments. We are now shipping PureData System
for Hadoop, which takes all of the guesswork out of choosing the components and
configuring a cluster for Hadoop. We are leveraging what we have learned through
delivering the Netezza appliance, and the success of our PureData Systems to
simplify Hadoop deployments and improve time to value.
For customers who want more flexibility, we offer Hadoop hardware reference
architectures that can start small and grow big with options for those who are more
cost conscious, or for those who want the ultimate in performance. These reference
architectures are very prescriptive by defining the exact hardware components needed
to build the cluster and are defined in the IBM Labs by our senior architects based on
our own testing and customer experience.
Follow Svetlana on Twitter @Sve_Sic