This blog was penned by Dr. Geoffrey Malafsky who is Founder of the PSIKORS Institute and Phasic Systems, Inc.
We have completed our TechLab series with Cloudera. Its objective was to explore the ability of Hadoop in general, and Cloudera’s distribution in particular, to meet the growing need for rapid, secure, adaptive merging and correction of core corporate data. I call this Corporate Small Data which is:
“Structured data that is the fuel of an organization’s main activities, and whose problems with accuracy and trustworthiness are past the stage of being alleged. This includes financial, customer, company, inventory, medical, risk, supply chain, and other primary data used for decision making, applications, reports, and Business Intelligence. This is Small Data relative to the much ballyhooed Big Data of the Terabyte range.”1
Corporate Small Data does not include the predominant Big Data examples which are almost all stochastic use cases. These can succeed even if there is error in the source data and uncertainty in the results since the business objective is getting trends or making general associations. In stark contrast are deterministic use cases, where the ramifications for wrong results are severely negative, such as for executive decision making, accounting, risk management, regulatory compliance, and security.
I introduced the concept of Data Normalization which is drawn from science and hence is a Data Science approach to solving the corporate data challenge. In this mode, Data Normalization is not the same as in data modeling (e.g. third normal form), but attacks the larger problem since it “combines subject matter knowledge, governance, business rules, and raw data. …. Indeed, the types of challenges in corporate Small Data are solved with an order of magnitude less expense, time, and organizational friction and with much higher complexity in many scientific and engineering fields. “2 Data Normalization is similar to normalizing statistics where the objective is to directly compare values by eliminating errors, biases, and scale differences. It is used in financial regulations where it comes with an ever changing set of policies and rules3. This notion is also used in real science studies to constrain and thereby make solvable calculations (i.e. for those interested I refer to normalizing the wavefunction which is the heart of quantum mechanics). Doing this successfully requires a more complete, evolvable, and integrated knowledge of all aspects of a challenge instead of a segmented approach where individual parts are transferred minus real knowledge exchange across groups.
Normalizing data needs adjustable and powerful computing tools. “A critical success factor is enabling iterative, variable, and transparent results tuned to the personal and organizational work tempo of analysts, managers, and business product delivery. In almost all mission-critical activities, the specific requirements of the business on the data environment are neither static nor known at a level of detail sufficient to supply traditional tools and methods. This leads to the accursed business-technical organizational chasm. Into this fray comes a little noticed aspect of Hadoop: easily customizable parallel computing. We all know Hadoop provides parallel processing of distributed data. That is why it was built. But, I am talking about real computations (not trivial things like Word Count) that a real CFO needs to have done, done right, done now, and then redone tomorrow. This is where Data Normalization is most beneficial.”4
The ultimate goal of data management in general and Data Normalization specifically is to enable meaningful analysis that can then be used for human decision making, secure and trusted software applications, and meeting legal and regulatory compliance. Data for data’s sake is not the goal. It is the adjective “meaningful” that should be embraced. Making it meaningful brings in another important verb: adjudicate. This means “to make an official decision about who is right in a dispute.”5 All of the different uses, definitions, assumed meanings, and commonly unknown semantics of real data make this imperative. Simple metadata management, governance, and especially semantic technology cannot make progress if it is divorced from a combined view of design (i.e. architecture) and actual data values by a combined group of people. Doubters exist; refuters yell; tools are bought; and still no progress is made.
To reiterate from earlier in our blog series: the first and most important question is: What is a meaningful analysis and how does it differ from other analyses? A meaningful analytic product is one that presents information in a concise and easy to understand manner where the underlying data has gone through an iterative series of investigation, normalization, review, and most importantly adjudication. Perfection is not required, obviously, but sought.
Cloudera’s Hadoop distribution provides a complete management and execution environment for parallel processing and very large data storage. This seems perfectly suited to enable the iterative, constantly updated, multiple method analysis needed to explore, understand, and digest data into meaningful products all within the daily tempo of modern organizations and analysts.
Let us review what we have done and the results from the TechLab. In the spirit of a real educational laboratory, I will establish the grading metrics and then assess what I learned working with Cloudera’s Hadoop distribution. The metrics are grouped into the following:
Environment, configuration, and documentation (ECD)
Hadoop operations (HO)
Group
Item
Description
Comment
Grade
ECD
Flexibility, stability, security
Required working environment has multiple options commonly used in corporations that are known to be stable and secure.
Uses Linux with a few common distributions and versions but is limited to these exact versions. This is due to changes in Linux but does limit ability to deploy Hadoop. Linux is proven for web servers as stable and secure. Our testing showed Linux versions having significant hardware conflicts requiring special fixes and in some cases unusable over-the-counter components.
B
Installation, configuration, documentation
Software components can be installed with either automation or easy manual functions with sufficient documentation to guide specific steps and overcome known and likely detailed technical problems.
General Hadoop distributions have good basic documentation of their individual packages but there are many of them. Cloudera adds a coordinating management system to both install and continuously manage. They also extend the documentation over the basic Apache project sites. This includes training, forums, and updated help manuals. Some documentation is slightly out-of-date and required finding information external to Cloudera.
A-
HO
HDFS
HDFS is configured, and reconfigurable with ease. Monitoring of real-time actions and status is provided and yields clear, understandable information. Tools enable getting files, putting files, and managing files similar to major Operating Systems like Windows and Linux.
The default Hadoop management web sites are provided in the base Hadoop installation. Cloudera’s manager adds a richer coordinating and management capability for all components. This includes a more centralized way to monitor HDFS for its configuration, status, and get right to the actual files containing the configurations on all nodes. The HUE web site enables integrated file browsing, opening, uploading, downloading, and folder management along with functions for other parts of the Hadoop environment.
A
YARN
YARN operations can be configured, managed, started, and monitored with ease and clarity. Custom Java jobs can be run not limited by MapReduce functions.
Processing custom Java code in YARN is still a more complicated task since it requires using and understanding Hadoop specific libraries and functions. The YARN Resource Manager architecture is a better and more flexible framework than Hadoop 1 but still requires expertise outside of most corporate environments. Cloudera’s HUE web site provides a Job Designer that allows a simpler way to submit custom Java, as well as other job types, into Hadoop using the Oozie workflow component with real-time monitoring and management of previously run jobs. However, the documentation on this specific capability is weak and the syntax of referencing files within the HUE scope relative to HDFS and local disk is neither obvious nor described well. Once setup, the HUE functions are a great aid above base Hadoop.
A-
Data queries
Data stored in Hadoop data stores can be queried as efficiently aa traditional data systems using standard or very close to standard commands and APIs
Detailed timing tests were done with the same data in Hive, Impala, and an external RDBMS. Impala was much faster than the others, and was 20x faster than the RDBMS. This test did not provide all the configuration and hardware to maximize performance of each data store but did provide the same level of small cluster resources. An interesting result was the fastest speed using plain text files instead of the specialized compressed formats like the newest one Parquet. Importantly, it was very easy to change among file formats, partitions, and add externally modified data files into the data store compared to traditional RDBMS.
A+
Component integration
Hadoop components work seamlessly together and can be monitored and adjusted continuously for status, performance, and system architecture design.
Basic Hadoop has a good well engineered ecosystem of components. Apache deserves kudos for shepherding this complex multi-faceted software engineering environment. However, there are still many potential points of conflict and especially version incompatibility to make it a large investment in time and personnel to make work in a real corporate environment. Cloudera’s Manager and their packaged configuration excel at providing a nicely integrated set of components including the ability to add and remove them without much trouble. Given the complex nature of the system of 10-20 components, it is inevitable that there will be conflicts and failures at some point in time. Cloudera’s Manager makes it much easier to investigate and solve these challenges in a reasonably short time.
A
Other components and adaptability
There are a lot of components in the Hadoop ecosystem and more being added as Apache incubated projects continually. These should work as part of the overall integrated environment without disrupting existing operations, and match the real-time monitoring and management of the primary components HDFS and YARN. The components have APIs for common programming methods for basic functions with the documentation to enable rapid development.
Cloudera’s Manager includes all the components deployed and tracks updateable versions. It allows deploying the current versions to new nodes as well as new components. All components can be monitored and even traced to their actual configuration and log files. Some of the many settings are either not obvious or not well described but the default values work well which removes the need to understand every setting before having an operational environment. The base Hadoop projects provide APIs for basic functions. In particular, HDFS uses REST APIs. While Cloudera’s distribution does not add new functionality to these APIs, they have reused them for new components such as using the Hive JDBC classes to connect to Impala in Java. However, some of the key functions and library references required by this JDBC driver were incorrectly stated in the documentation which I had to figure out by using JDBC driver source code and finding answers in other Hadoop distribution forums.
A
Finally, I will assess the overarching question of whether Hadoop is ready for core corporate data uses. Again, it is important to define the scope of this assessment. It does not include typical Big Data use cases where very large quantities of data are sifted and stored looking for patterns. This is a statistical use case and is out of scope. We concentrate on the mission-critical use of widely used and business dependent data management and processing. This assessment necessarily is subjective. I encourage you to review and question my assumptions and results, as well as perform your own experiments.
Item
Grade
Ability to be corporate data store for documents and other unstructured non-database information
A+
Ability to be corporate data store for mission critical structured data with security, high transactional performance, and security
A-
Ability to be Data Normalization center with low-cycle time, integrated business-technical functionality
A+
Ability to integrate with existing applications and business activity workflow
A-
Confidence in performance, reliability, and data processing
B
Conclusion
Cloudera’s Hadoop distribution is definitely ready for corporate use. It is a powerful, low-cost technology opportunity to help solve long-standing challenges in mission-critical corporate data. It is not as finely tuned to current corporate data use cases as traditional systems but that is not a negative as corporate use cases are and need to change rapidly to overcome serious limitations in meaningful use (actually the lack thereof). The lower score in confidence reflects more a need for published use cases in trusted information channels that can convince corporate management than any inherent flaw in the software environment itself. Great progress is being made in this regard and hopefully TechLab helped provide meaningful details to aid understanding and implementation.
My short conclusion is that Cloudera’s distribution is a strong asset to starting a solid Data Normalization project now, and can be an excellent total storage center without much difficulty. While solid in core corporate data capability, it should be put into a prototype development system now and be given parallel copies of corporate data stores to experiment with and become confident that it can process data and integrate with applications with minimal effort. This will enable a transition to production with confidence in 12-18 months.
But, do not simply install a Hadoop cluster and expect it to solve organizational bound challenges like governance, data correctness, or even whether the CFO trusts your reports. If done smartly, Hadoop will open avenues of new organizational success in concert with its technological power.
Please remember to hang up your lab coats before you leave.
Good night.
Learn more: TechLab with Cloudera: Exploring Hadoop for Corporate Small Data >>
Dr. Geoffrey Malafsky is Founder of the PSIKORS Institute and Phasic Systems, Inc. A former nanotechnology scientist for the US Naval Research Center, he developed the PSIKORS Model for data normalization. He currently works on normalizing data for one of the largest IT infrastructures in the world.
[1] Hadoop for Small Data, Geoffrey Malafsky, Blog #1 for TechLab with Cloudera, http://vision.cloudera.com/hadoop-for-small-data/
[2] Normalizing Corporate Small Data With Hadoop and Data Science, Geoffrey Malafsky, Blog #2 for TechLab with Cloudera, http://vision.cloudera.com/normalizing-corporate-small-data-with-hadoop-and-data-science/
[3] See for example, How will the Federal Reserve normalize monetary policy when it becomes appropriate to do so?, http://www.federalreserve.gov/faqs/how-will-fed-normalize-monetary-policy-when-it-becomes-appropriate-to-do-so.htm
[4] Need for Speed: Parallelizing Corporate Data, Geoffrey Malafsky, Blog#3 for TechLab with Cloudera, http://vision.cloudera.com/need-for-speed-parallelizing-corporate-data/
[5] http://www.merriam-webster.com/dictionary/adjudicate