Blogs.gartner.com

The Charging Elephant — MapR

2013-08-26

To allow our readers compare Hadoop distribution vendors side-by-side, I forwarded the same set of questions to the participants of the recent Hadoop panel at the Gartner Catalyst conference. The questions are coming from Catalyst attendees and from Gartner analysts.

Today, I publish responses from Jack Norris, Chief Marketing Officer at MapR.

1. How specifically are you addressing variety, not merely volume and velocity?

MapR has invested heavily in innovations to provide a unified data platform that can be used for a variety of data sources, such as clickstreams to real-time applications that leverage sensor data.

The MapR platform for Hadoop also integrates a growing set of functions including MapReduce, file-based applications, interactive SQL, NoSQL databases, search and discovery, and real-time stream processing. With MapR, data does not need to be moved to specialized silos for processing, data can be processed in place.

This full range of applications and data sources benefit from MapR’s enterprise-grade platform and unified architecture for files and tables. The MapR platform provides high availability, data protection and disaster recovery to support mission-critical applications.

MapReduce:

MapR provides world record performance for MapReduce operations on Hadoop. MapR holds the Minute Sort world record by sorting 1.5 TB of data in one minute. The previous Hadoop record was less than 600 GB. With an advanced architecture that is built in C/C++ and that harnesses distributed metadata and an optimized shuffle process, MapR delivers consistent high performance.

File-Based Applications:

MapR is a 100% POSIX compliant system that fully supports random read-write operations. By supporting industry standard NFS, users can mount a MapR cluster and execute any file-based application, written in any language, directly on the data residing in the cluster. All standard tools in the enterprise including browsers, UNIX tools, spreadsheets, and scripts can access the cluster directly without any modifications.

SQL:

There are a number of applications that support SQL access against data contained in MapR including Hive, Hadapt and others. MapR is also spearheading the development of Apache Drill that brings ANSI SQL capabilities to Hadoop. Apache Drill, inspired by Google’s Dremel project, delivers low latency interactive query capability for large-scale distributed datasets. Apache Drill supports nested/hierarchical data structures, schema discovery and is capable of working with NoSQL, Hadoop as well as traditional RDBMS. With ANSI SQL compatibility, Drill supports all of the standards tools that the enterprise uses to build and implement SQL queries.

Database:

MapR has removed the trade-offs organizations face when looking to deploy a NoSQL solution. Specifically, MapR delivers ease of use, dependability and performance advantages for HBase applications.. MapR provides scale, strong consistency, reliability and continuous low latency with an architecture that does not require compactions or background consistency checks. From a performance standpoint, MapR delivers over a million operations per second from just a 10-node cluster.

Search:

MapR is the first Hadoop distribution to integrate enterprise-grade search. On a single platform customers can now perform predictive analytics, full search and discovery; and conduct advanced database operations. The MapR enterprise-grade search capability works directly on Hadoop data but can also index and search standard files without having to perform any conversion or transformation. All search content and results are protected with enterprise-grade high availability and data protection, including snapshots and mirrors enabling a full restore of search capabilities.

By integrating the search technology of the industry leader, LucidWorks, MapR and its customers benefit from the added value that LucidWorks Search delivers in the areas of security, connectivity and user management for Apache Lucene/Solr.

Stream Processing:

MapR provides a dramatically simplified architecture for real-time stream computational engines such as Storm. Streaming data feeds can be written directly to the MapR platform for Hadoop for long-term storage and MapReduce processing. Because MapR enables data streams to be written directly to the MapR cluster, MapR allows administrators to eliminate queuing systems such as Kafka or Krestel and perform publish-subscribe models within the data platform. Storm can then ‘tail’ a file to which it wishes to subscribe, and as soon as new data hits the file system, it is injected into the Storm topology. This allows for strong Storm/Hadoop interoperability, and a unification & simplification of technologies onto one platform.

2. How do you address security concerns?

MapR is pushing the envelope on Hadoop security. MapR integrates with Linux security (PAM) and works with any user directory: Active Directory, LDAP, NIS. This enables access control and selective administrative access for portions of the cluster. Only MapR supports logical volumes so individual directories or groups of directories can be grouped into volumes. Not only access control but data protection policies, quotas and other management aspects can be set by volume.

MapR has also delivered strong wire-level authentication and encryption that is currently in beta. This powerful capability includes both kerberos and non-Kerberos options. MapR security features include fine-grained access control including full POSIX permissions on files and directories, ACLs on tables, column families, columns, cells; ACLs on MapReduce jobs and queues; and administration ACLs on cluster and volumes.

3. How do you provide multi-tenancy on Hadoop? How do you implement multiple unrelated workloads in Hadoop?

MapR includes advanced monitoring, management, isolation and security for Hadoop that leverage and build on volume support within the MapR Distribution. MapR clusters provide powerful features to logically partition a physical cluster to provide separate administration, data placement, job execution and network access. These capabilities enable organizations to meet the needs of multiple users, groups and applications within the same cluster.

MapR enables users to specify exactly which nodes will run each job, to take advantage of different performance profiles or limit jobs to specific physical locations. Powerful wildcard syntax lets users define groups of nodes and assign jobs to any combination of groups. Administrators can leverage MapReduce queues and ACLs to restrict specific users and groups to a subset of the nodes in the cluster. Administrators can also control data placement, enabling them to isolate specific datasets to a subset of the nodes.

4. What are the top 3 focus areas for the Hadoop community in the next 12 months to make the framework more ready for the enterprise?

1) YARN – will help expand the use of non Mapreduce jobs on the Hadoop platform

2) HA (High Availability) – There has been some great work done and plans but there are still many issues to be addressed. These include the ability to automatically recover from multiple failures, and the auto starting of failed jobs so tasks pick up where they left off.

3) Snapshots –This is a step in the right direction but the snapshots are akin to a copy function which is not coordinated across a cluster snapshots. In other words, no point-in-time consistency across the cluster. Many applications will not be able to work properly and any file that is open will not include the latest data in the snapshot.

5. What are the advancements in allowing data to be accessed directly from native source without requiring it to be replicated?

The upcoming NFS support does not support POSIX compliance and will not work with most applications. HDFS is an append-only (or read-only) file system, and the NFS protocol can only be supported by a file system that supports random writes. HDFS doesn’t support random writes.

The gateway has to temporarily save all the data to its local disk (/tmp/.hdfs-nfs) prior to writing it to HDFS. This is needed because HDFS doesn’t support random writes, and the NFS client on many operating systems will reorder the write operations even when the application is writing sequentially. NFS clients often reorders writes. Sequential writes can arrive at the NFS gateway at random order. The gateway directory is used to temporarily save out-of-order writes before writing to HDFS. This impacts performance and the ability to effectively support multiple users and read/write support.

6. What % of your customers are doing analytics in Hadoop? What types of analytics? What’s the most common? What are the trends?

Hadoop is used today for a wide variety of applications most of these involve analytics, some use cases such as data warehouse offload use Hadoop to perform ETL to offload processing and long term data storage on Hadoop. We have customers today that have transformed their business operations with our distribution for Hadoop. Some examples include redefining fraud detection, using new data sources to create new products and services, improving revenue optimization through deeper segmentation and recommendation engines, and integrating on-line and on-premise purchase paths for better customer targeting.

The biggest challenge to Big Data adoption is the need for organizations to abandon some long held assumptions about data analytics. These assumptions include:

The cost of analytics is exponential

Project success requires the definition of the questions you need answered ahead of time

Specialized analytic clusters are required for in-depth analysis

Complex models are the key to better analysis

Big Data, and specifically Hadoop, expands linearly by leveraging commodity servers in a cluster that can expand to thousands of servers. If you need additional capacity, simply add additional servers.

Hadoop supports a variety of data sources and doesn’t require an administrator to define or enforce a schema. Users have broad flexibility with respect to the analytic and programming techniques that they can employ, and they can change the level of granularity for their analysis. Unlike a data warehouse environment, any of these analytic changes does not result in downtime related to modifying the system or redefining tables.

One of the interesting insights that Google has shared is that algorithms on large data outperform more complex models on smaller data sets. In other words, rather than spending time in investing in complex model develop expand your data set first.

All of these changes make it much easier and faster to benefit from Hadoop than other approaches, so the message to organizations is to deploy Hadoop to process and store data and the analytics will follow.

7. What does it take to get a CIO to sign off on a Hadoop deployment?

Given the advantages of Hadoop it’s fairly easy to get sign off on the initial deployment. The growth from there is due to the proven benefits that the platform provides.

8. What are the Hadoop tuning opportunities that can improve performance and scalability without breaking user code and environments?

This is an area that MapR has focused on with underlying architecture improvements that drive much high performance (2-5 times) with the same hardware.

9. How much of Hadoop 2.0 do you support, given that it is still Alpha pre Apache?

MapR fully participates in the open source community and includes open source packages as part of our distribution. That said, we test and harden packages and wait until they are production ready before including them directly into our distribution. Hadoop 2.0 is not at the stage yet that we have released it as part of our distribution to customers.

10. What kind of RoI do your customers see on Hadoop investments – cost reduction or revenue enhancement?

We see tremendous payback on Hadoop deployments. Because MapR provides all of the HA, data protection, and standard file access that are delivered by enterprise storage environments at a much lower cost we’re driving ROI on investments that payback within the year. One of our large telecom clients generated an 8 figure ROI with their MapR deployment.

11. Are Hadoop demands being fulfilled outside of IT? What’s the percentage? Is it better when IT helps?

There’s a difference between test and lab development and production. We see a lot of experimentation outside of IT or cloud deployments be utilized but on premise, production deployments are directed by IT. The experience of IT is helpful when organizations are trying to assess data protection requirements, SLAs, and integration procedures.

12. How important will SQL become as mechanism for accessing data in Hadoop? How will this affect the broader Hadoop ecosystem?

SQL is important but the impact of Hadoop is for the next generation of analytics. SQL makes it easier for business analysts to interact with data contained in a Hadoop cluster, but SQL is one piece, integration with search, machine learning, and HBase are important aspects and the use cases/applications that are driving business results that go far beyond SQL. The ultimate query language that works across structured and unstructured data is search. In a sense it is an unstructured query that is simple for broad users to understand. A search index paired with machine learning constructs is a powerful tool to drive business results.

13. For years, Hadoop was known for batch processing and primarily meant MapReduce and HDFS. Is that very definition changing with the plethora of new projects (such as YARN) that could potentially extend the use cases for Hadoop?

Absolutely, but YARN does not address the fundamental limitation of Hadoop and that is the data storage layer (HDFS). Even with YARN opening up the types of jobs that can be run on Hadoop it does nothing to address the batch load requirements of the write-once storage layer. At MapR we are excited about running YARN with our next generation file system that provides POSIX, random read/write, support.

14. What are some of the key implementation partners you have used?

We’ve used integrators that are focused on data as well as large system integrators.

15. What factors most affect YOUR Hadoop deployments (eg SSDs; memory size.. and so on)? What are the barriers and opportunities to scale?

MapR has made significant innovations to reduce the footprint and memory requirements for Hadoop processing. We take advantage of greater disk density per node and many of our customers have 16 disks of 3TB drivers associated with each node. We also take advantage of next generation flash and SSD storage with performance comparisons 7X faster than other distributions.

Jack Norris, Chief Marketing Officer at MapR

Jack has over 20 years of enterprise software marketing experience. He has demonstrated success from defining new markets for small companies to increasing sales of new products for large public companies. Jack’s broad experience includes launching and establishing analytic, virtualization, and storage companies and leading marketing and business development for an early-stage cloud storage software provider. Jack has also held senior executive roles with EMC, Rainfinity (now EMC), Brio Technology, SQRIBE, and Bain and Company. Jack earned an MBA from UCLA Anderson and a BA in Economics with honors and distinction from Stanford University.

Follow Svetlana on Twitter @Sve_Sic