2016-02-15

Hadoop is now enterprise software.

There, I said it. I know lots of readers in the IT space still look at Hadoop as an interloper, or worse, part of the rogue IT problem. But better than 50% of the enterprises we spoke with are running Hadoop somewhere within the organization. A small percentage are running Mongo, Cassandra or Riak in parallel with Hadoop, for specific projects. Discussions on what ‘big data’ is, if it is a viable technology, or even if open source can be considered ‘enterprise software’ are long past. What began as proof of concept projects have matured into critical application services. And with that change, IT teams are now tasked with getting a handle on Hadoop security, to which they response with questions like “How do I secure Hadoop?” and “How do I map existing data governance policies to NoSQL databases?”

Security vendors will tell you both attacks on corporate IT systems and data breaches are prevalent, so with gobs of data under management, Hadoop provides a tempting target for ‘Hackers’. All of which is true, but as of today, there really have not been major data breaches where Hadoop play a part of the story. As such this sort of ‘FUD’ carries little weight with IT operations. But make no mistake, security is a requirement! As sensitive information, customer data, medical histories, intellectual property and just about every type of data used in enterprise computing is now commonly used in Hadoop clusters, the ‘C’ word (i.e.: Compliance) has become part of their daily vocabulary. One of the big changes we’ve seen in the last couple of years with Hadoop becoming business critical infrastructure, and another – directly cause by the first – is IT is being tasked with bringing existing clusters in line with enterprise compliance requirements.

This is somewhat challenging as a fresh install of Hadoop suffers all the same weak points traditional IT systems have, so it takes work to get security set up and the reports being created. For clusters that are already up and running, no need to choose technologies and a deployment roadmap that does not upset ongoing operations. On top of that, there is the additional challenge that the in-house tools you use to secure things like SAP, or the SIEM infrastructure you use for compliance reporting, may not be suitable when it comes to NoSQL.

Building security into the cluster

The number of security solutions that are compatible – if not outright built for – Hadoop is the biggest change since 2012. All of the major security pillars – authentication, authorization, encryption, key management and configuration management – are covered and the tools are viable. Most of the advancement have come from the firms that provide enterprise distributions of Hadoop. They have built, and in many cases contributed back to the open source community, security tools that accomplish the basics of cluster security. When you look at the threat-response models introduced in the previous two posts, every compensating security control is now available. Better still, they have done a lot of the integration legwork for services like Kerberos, taking a lot of the pain out of deployments.

Here are some of the components and functions that were not available – or not viable – in 2012.

LDAP/AD Integration – Technically AD and LDAP integration were available in 2012, but these services have both been advanced, and are easier to integrate than before. In fact, this area has received the most attention, and integration is as simple as a setup wizard with some of the commercial platforms. The benefits are obvious, as firms can leverage existing access and authorization schemes, and defer user and role management to external sources.

Apache Ranger – Ranger is one of the more interesting technologies to come available, and it closes the biggest gap: Module security policies and configuration management. It provides a tool for cluster administrators to set policies for different modules like Hive, Kafka, HBase or Yarn. What’s more, those policies are in context to the module, so it sets policies for files and directories when in HDSF, SQL policies when in Hive, and so on. This helps with data governance and compliance as administrators set how a cluster should be used, or how data is to be accessed, in ways that simple role based access controls cannot.

Apache Knox – You can think of Knox in it’s simplest form as a Hadoop firewall. More correctly, it is an API gateway. It handles HTTP and REST-ful requests, enforcing authentication and usage policies of inbound requests, and blocking everything else. Knox can be used as a virtual moat’ around a cluster, or used with network segmentation to further reduce network attack surface.

Apache Atlas – Atlas is a proposed open source governence framework for Hadoop. It allows for annotation of files and tables, set relationships between data sets, and even import meta-data from other sources. These features are helpful for reporting, data discovery and for controlling access. Atlas is new and we expect it to see significant maturing in coming years, but for now it offers some valuable tools for basic data governance and reporting.

Apache Ambari – Ambari is a facility for provisioning and managing Hadoop clusters. It helps admins set configurations and propagate changes to the entire cluster. During our interviews we we only spoke to two firms using this capability, but we received positive feedback by both. Additionally we spoke with a handful of companies who had written their own configuration and launch scripts, with pre-deployment validation checks, usually for cloud and virtual machine deployments. This later approach was more time consuming to create, but offered greater capabilities, with each function orchestrated within IT operational processes (e.g.: continuous deployment, failure recovery, DevOps). For most, Ambari’s ability to get you up and running quickly and provide consistent cluster management is a big win and a suitable choice.

Monitoring – Hive, PIQL, Impala, Spark SQL and similar modules offer SQL or pseudo-SQL syntax. This means that the activity monitoring, dynamic masking, redaction and tokenization technologies originally developed for relational platforms can be leveraged by Hadoop. The result is we can both alert and block on misuse, or provide fine-grained authorization (i.e.: beyond role based access) by altering queries or query result sets based upon users metadata. And, as these technologies are examining queries, they offer an application-centric view of events that is not always capture in log files.

Your first step in addressing these compliance concerns is mapping your existing governance requirements to a Hadoop cluster, then deciding on suitable technologies to meet data and IT security requirements. Next you will need to deploy technologies that provide security and reporting functions, and setting up the policies to enforce usage or detect misuse. Since 2012, many technologies have come available which can address common threats without killing scalability and performance, so there is no need to reinvent the wheel. But you will need to assemble these technologies into complete program, so there is work to be done. Let’s sketch out some over-arching strategies, then provide a basic roadmap to get there.

Security Models

Walled Garden



The most common approach today is a ‘walled garden’ security model. You can think of this as the ‘moat’ model used for mainframe security for many years; place the entire cluster onto its own network, and tightly control logical access through firewalls or API gateways, and access controls for user or app authentication. In practice with this model there is virtually no security within the NoSQL cluster itself, as data and database security is dependent upon the outer ‘protective shell’ of the network and applications that surrounds the database. The advantage is simplicity; any firm can implement this model with existing tools and skills without performance or functional degradation to the database. On the downside, security is fragile; once a failure of the firewall or application occurs, the system is exposed. This model also does not provide means to prevent credentialed users from misusing the system or viewing/modifying data stored in the cluster. For organizations not particularly worried about security, this is a simple, cost effective approach.

Cluster Security

Unlike relational databases which function like a black-box, Hadoop exposes it’s skeleton to the network. Inter-node communication, replication and other cluster functions occur between many machines, through different types of services. Securing a Hadoop cluster is more akin to securing an entire data center than a traditional database. That said, for the best possible protection, building security into cluster operations is critical. This approach leverages security tools built in – or third-party products integrated into – the NoSQL cluster. Security in this case is systemic and built to be part of the base cluster function.



Tools may include SSL/TLS for secure communication, Kerberos for node authentication, transparent encryption for data at rest security, and identity and authorization (e.g.: groups, roles) management just to name a few. This approach is more difficult as there are a lot more moving parts and areas where some skill are required. The setup of several security functions targeted at specific risks to the database infrastructure takes time. And, as third-party security tools are often required, typically more expensive. However, it does secure a cluster from attackers, rogue admins, and the occasional witless application programmer. It’s the most effective, and most comprehensive, approach to Hadoop security.

Data Centric Security

Big data systems typically share data from dozens of sources. As firms do not always know where their data is going, or what security controls are in place when it is stored, they’ve taken steps to protect the data regardless of where it is used. This model is called data-centric security because the controls are part of the data, or in some cases, the presentation layer of a database.



The three basic tools that support data-centric security are tokenization, masking and data element encryption. You can think of tokenization just like a subway or arcade token; it has no cash value but can be used to ride the train or play a game. In this case a data token is provided in lieu of sensitive data – it’s commonly used in credit card processing systems to substitute credit card numbers. The token has no intrinsic value other than as a reference to the original value in some other (e.g.: more secure) database. Masking is another very common tool used to protect data elements while retaining the aggregate value of a data set. For example, firms may substitute an individual’s social security number with a random number, or change their name randomly from a phone book, or replace a date value with a random date within some range. In this way the original – sensitive – data value is removed entirely from query results, but the value of the data set is preserved for analysis. Finally, data elements can be encrypted and passed without fear of compromise; only legitimate users with the right encryption keys can view the original value.

The data-centric security model provides a great deal of security when the systems that process data cannot be fully trusted. And many enterprises, given the immaturity of the technology, do not fully trust big data clusters to protect information. But a data-centric security model requires careful planning and tool selection, as it’s more about information lifecycle management. You define the controls over what data can be moved, and what protection must be applied before it is moved. Short of deleting sensitive data, this is the best model when you must populate a big data cluster for analysis work but cannot guarantee security.

These models are very helpful for conceptualizing how you want to approach cluster security. And they are really helpful when trying to get a handle on resource allocation; what approach is your IT team comfortable managing and what tools do you have the funds to acquire? That said, the reality is that firms no longer wholly adhere to any single model. They use a combination of two. For some firms we interviewed they used application gateways to validate requests, and they use IAM and transparent encryption to provide administrative Segregation of Duties on the back end. In another case, the highly multi-tenent nature of the cluster meant they relied heavily on TLS security for session privacy, and implemented dynamic controls (e.g.: masking, tokenization and redaction) for fine grained controls over data.

In our next post we will close this series with a set of succinct technical recommendations which help for all use cases.

- Adrian Lane
(0) Comments
Subscribe to our daily email digest

Show more