2014-11-21

We recently hosted, “Innovation Without Compromise: The Challenges of Securing Big Data,” a roundtable discussion with IDC, Visa, and Cloudera around the changing world of big data security. During this live discussion, we received many questions from our audience and we didn’t have time to answer them all.

To continue the discussion on security in Hadoop, Eddie Garcia (Chief Security Architect, Cloudera), Carl Olofson (Research Vice President, IDC), and Michael Versace (Global Research Director, IDC) have provided their answers to the additional audience questions.

1. Are data management activities (e.g. data governance, data lineage, data quality, and/or master data management) a necessary or required element for properly securing data?

Carl Olofson –

Yes, absolutely; and they are listed in the right order. As soon as you take data into a Hadoop cluster, you are its custodian, and must know what it is, and what needs to be secured. Secondly, if the data is being used for any detailed business purpose, its lineage is important (where it came from) as well as general data quality enforcement. Data quality is critical if the resulting calculations are to be believed. Master data management applies if the data relates to other data managed in the environment, and disambiguation, reconciliation, or harmonization with such data is necessary.

Michael Versace -

These activities are fundamental IT activities – a necessity for reliable, serviceable, and available IT systems and business operations.  Data quality and lineage are business, operational, and/or technical outcomes or goals.  Data governance and master data management are techniques used to achieve these outcomes. As soon as you take data into a Hadoop cluster, for example, you are its custodian if in fact you’ve declared this data as a corporate asset.  Declaration is the first step in data management.  Once you declare data as a company asset, you must know what it is, and how needs to be secured and protected (part of data classification, performed after data is declared an asset). If the data is being used for any detailed business purpose, data quality and lineage (integrity, authenticity, and source system) must be enabled.  Data quality is critical if the resulting calculations are to be believed. Master data management applies if the data relates to other data managed in the environment, and disambiguation, reconciliation, or harmonization with such data is necessary.

2. Since Hadoop is open source, doesn’t that make it more exposed to security threats?

Carl Olofson -

The first line of defense for Hadoop is at the operating system level, since access to the cluster and its nodes’ files (HDFS is normally deployed on internal or direct attached storage) is controlled by OS security. Getting past that means that all the data is vulnerable to someone who can code Map/Reduce, or who can simply read a sequential file. Hadoop’s being open source neither increases nor diminishes that vulnerability, but Hadoop distributors can provide additional means of security, including encryption and access control based on standard protocols, that lessen the risk. All that having been said, Hadoop data is no more vulnerable than that kept on any other file system.

3. With so many new projects appearing in Hadoop, how can security keep up?

Carl Olofson -

Good question. It is important to have general security policies in place, including definitions of users, groups, and privileges, and a team for administering those definitions. It is also important to have tools that enable the application of those policies to Hadoop clusters without a lot of duplicated effort. Also, a methodology that emphasizes the analysis and classification of incoming data for security purposes is key, so that each such effort is not a “one-off.”

4. How does Cloudera’s solution help organizations meet data privacy requirements, particularly as countries change their regulations?

Eddie Garcia -

That is a two-part question – there is a technology aspect and a non-technology aspect. From a technology perspective Cloudera enables organizations to meet the security requirements of confidentiality, integrity, and availability. This means only users that should have access to data may access it and data cannot be tampered with or modified – all while providing high availability and scalability. These are the basic building blocks that allow organizations to store data and meet data privacy requirements.

However, it is up to an organization to define and abide by their internal process to insure privacy, which can go beyond just protecting the data and into what’s acceptable use of the data. This is the non-technology aspect. For example, an organization may store customer data as part of their business, but an employee may be using this data for outside research or sharing it with partners, which is outside of the original use when collected.

5. What is a typical security design and implementation?

Eddie Garcia -

A typical Hadoop security design consists in first locking down the operating system and Java. The next step is to enable Kerberos and tie it to a centralized user identity system such as Active Directory or LDAP – to provide authentication for Hadoop. With a now hardened Hadoop, you can then set up more granular permissions to data on only need to know basis. Tools like Apache Sentry and Cloudera Navigator provide the additional security layers for unified authorization and comprehensive audits. More details can be found in the blog series, “Bigger Data, Smaller Problems: Taking Hadoop Security to the Next Level.”

6. Do you suggest using a firewall to secure the Hadoop perimeter?

Eddie Garcia -

Yes, we suggest using not only one firewall but several of them. For each individual server, the Linux built-in firewall should be enabled to only allow traffic to the appropriate port and services. All other ports and services should be disabled. iptables are built into most Linux distributions and provide this capability.

Additionally, it is best to run the cluster itself in its own sub-network, only accepting traffic only from pre-defined, user-facing edge nodes.

7. How does data modeling relate to security?

Carl Olofson -

Data modeling is a means for better understanding the data and applying security to it. Also, if the model includes some concept of inheritance “is-a” relationships, as opposed to “has-a” relationships, then security is easier to manage because it can be applied at the higher abstract layer.

8. How do you secure data without schemas or structures?

Eddie Garcia -

To secure data, you start with the basics of permissions and access control. This ensures users that don’t have access to the data cannot view it. To further protect data, you must also encrypt the data at rest. This provides comprehensive security for all your data.

When you need to enable access to portions of data that isn’t defined by a schema or structure, there are a couple of methods you can use. One method is by tokenizing or masking the data. By using partner software, you can hide portions of the data from select users. Another method is to run jobs that filter out the portioned data and create redacted versions that can be shared with users. This method is best used if there is little concern of data duplication within your Hadoop environment.

9. What are some differences in security when hosting data on-behalf of a customer, such as with multi-tenancy?

Carl Olofson -

The data must be secured, so that even your own people can’t see it; it belongs to the customer. If you are called upon to develop applications for the customer, you need to generate test data or mask sensitive data before using it for such a purpose. In a multi-tenancy scenario, tenants may not see each other’s data or even know of each other’s existence. This is not normally a problem since a Hadoop cluster is deployed on dedicated resources. If you are using a virtualized approach, it must still operate in such a way that the tenant’s view of the cluster is as a single, stand-alone cluster that is not part of anything else.

10. How does Cloudera address fine-grained access at the HDFS level, while still addressing data sharing and data privacy requirements without splitting files?

Eddie Garcia -

HDFS supports POSIX-style permissions for fine-grained access at the HDFS level. Permissions are assigned to owner, group, and others, and for each of these user categories permissions can be set for read, write, and execute.

For example a directory may be set with permissions of owner read/write, group read/write containing files that are owner read/write and group read.

One current limitation is fine-grained permissions in HDFS within a structured data file, for example within a JSON or XML file protect name-value pairs. While this is not a common use case, we are working on technology to enable this in the future.

11. When will there be access control at the HDFS level?

Eddie Garcia –

Access control exists today for files and directories in HDFS and access control for subsets of data within a file is on the roadmap.

12. How does Cloudera address auditing?

Eddie Garcia -

Cloudera Navigator is an integrated part of Cloudera’s enterprise data hub and is the only end-to-end data governance tool for Hadoop. Navigator provides comprehensive auditing, data lineage, and data discovery for data governance. All log files from the Hadoop cluster – including Hive, Impala, and Flume – are automatically collected and available in Navigator for simple search and audit.

13. Is data encryption available in Cloudera’s enterprise data hub?

Eddie Garcia –

Yes, Cloudera’s enterprise data hub offering includes Cloudera Navigator Encrypt and Key Trustee, which provide at-rest encryption and enterprise-grade key management. Navigator Encrypt can be used on HDFS data nodes as well as on other files or directories where sensitive data may reside, including local temporary logs or spill files. Navigator Key Trustee can not only protect encryption keys but also other sensitive Hadoop security assets such as SSL certificates.

14. After Kerberos is enabled, are there any issues with client access to Impala and Hive? What about with third-party clients like Informatica and SAS?

Eddie Garcia –

Impala and Hive work with Kerberos as well as many third-party clients. Cloudera has over 1,300 partners and we offer certification programs to make sure their solutions also integrate well with Cloudera’s solution (including Kerberos). For more information on our partners and to see which have certified solutions, visit the Partner Page.

We hope you found these answers helpful. For more information on Cloudera’s comprehensive, compliance-ready security and governance solution, visit cloudera.com/security

Show more