Natishalom.typepad.com

In Memory Computing (Data Grid) for Big Data

2013-01-01

The Drive for In Memory Computing?

Memory-based databases and caching product has been available for over a decade. However, so far they have been used in a fairly small niche in the data management solution market.

There have been multiple advances in the industry in both hardware and software architecture, which makes memory-based computing more relevant today than in the past, as outlined in the diagram below.

In a nutshell, the availability of new classes of hardware with the support of 64bit CPU can now support 2TB on a single device. In addition, the advances in software architecture and solutions toward distributed architecture and cloud make it easier to utilize these new hardware capabilities.

In-Memory Computing

In many ways, In-Memory Computing is a close relative of In-Memory Databases. As with many databases, it was designed to enable all the data management aspects that are often expected from traditional databases, such as queries and transactions, with the difference that the data is managed on RAM devices and not disks and thus comes with potentially x1000 better performance and latency according to various benchmarks.

The main differences between the traditional in-memory databases and in-memory-based-computing are that In-Memory-Computing is:

1. Designed for distributed and elastic environments

2. Designed for In-Memory data processing

Executing the code where the data is:

The fact that we can store our data in the same address space as our application is the biggest gain.

Unlike disk and even flash disk devices, we can access our data by reference and thus perform complex data manipulation without any serialisation/de-seralization overhead. With the new class of dynamic languages such as Java, JavaScript, JRuby, and Scala, it is also significantly easier to pass complex logic over the wire and execute it on a remote device.

In-Memory Computing relies heavily on that capability, and exposes a new class of complex data processing capabilities that fits well with the distributed nature of the data through real-time map/reduce and stream-based processing as a core element of its architecture.

The Big Data Context

According to our recent survey more than 70 percent of the responders said their business requires real-time processing of big data -- either in large volumes, at high velocity, or both.

Interestingly enough another survey by Ventana Research indicated that one of the biggest technical challenges in Big Data is the lack of real-time capabilities (67%). The report also indicated that many of the organizations are planning to use in-memory databases (40%) as part of their Big Data stack. This places the In-Memory Database as a second choice, before specialised DBMS (33%) and Hadoop (32%). One of the conclusions this survey leads to is that organizations see Data Warehouse Appliances and In-Memory Databases as one of their first choices to deal with the lack of real-time capabilities.

No One-Size-Fits-All Solution

While In-Memory Databases fit well in the planned Big Data stack, it is clear that theres no one-size-fits-all solution. The Big Data stack is going to be based on a blend of various technologies, each covering different aspects of the challenges of Big Data, from batch to real-time, from vertical to horizontal solutions, etc.

The question is: How do we integrate them all, without adding even more complexity to an already complex system?

In this post I will focus specifically on one of the approaches the we used for combining In-Memory Computing together with other Big Data solutions, such as Hadoop and Cassandra.

Putting In-Memory Computing Togther with a NoSQL DB

One of the main motivations to integrate in-memory-based solutions with a NoSQL DB is to reduce the cost per GB of data.

Putting our entire data purely in-memory can be too costly especially for data that were not going to access frequently.

There are various approaches to doing this -- the approach we found most useful is a two-tier approach.

With the two-tier approach the In-Memory Computing systems run separately from the NoSQL database, which acts as the long-term storage.

The Challenge

The main challenge with this approach is the complexity that is associated with synchronising two separate data systems. Specifically, how to ensure that data that is written into the front end In Memory Computing engine gets populated into the NoSQL database reliably, and vice versa.

The Solution

To deal with this challenge we used a similar approach to the one that we used before with RDBMS. Have an implicit plug-in that gets called whenever new data is written and populate it into the underlying database. The plug-in also deals with pre-loading of the data when the system starts. In the RDBMS world we used frameworks like Hibernate to deal with the implicit mapping of the data between the in-memory front end and the underlying database.

Working with Dynamic Data Structure (a.k.a Document Model)

When we tried to apply the same approach with NoSQL databases we could no longer rely on Hibernate as the default framework for mapping the data between the two data systems, as NoSQL databases like Cassandra tend to be fundamentally different from traditional RDBMSs. The main difference is the use of dynamic data structures, a.k.a the Document Model.

To deal with dynamic data structure we added the following hooks:

Introducing new documents and objects: Users can choose to write or load data in various forms -- Document for non-structured data or Objects or POJOs for structured or semi-structured data.

Introducing and loading new meta data: To map the data to and from the NoSQL database we also added the ability to introduce new meta data and load the meta data of the object before the actual data is loaded.

Introducing new indexes: In NoSQL databases you cannot effectively access data that is not indexed. For that purpose we included the ability to introduce indexes on the fly.

You can see how this works specifically with Cassandra in this post -- From The Memory Grid to Cassandra

The Benefit -- Best of both worlds

The main benefit of this two-tier approach is that it allows us to take the best of the memory and file-based approaches without adding too much complexity. The two tiers behave and work as one data system from an end-user perspective. The bits and pieces of how the two system get synchronized is carved out of the system.

Furthermore, the two-tier approach opens up a new degree of flexibility in how we design our Big Data system.

New Degree of Flexibility

If we look at the entire data flow from the point in which a user interacts with our system (this is where we expect low latency and high degree of consistency to our analytics systems; where we record those actions and analyze them, latency is less of an issue and we can also relax some of the consistency constraints), we can see that each stage in our data processing has different onsistency, latency and performance requirements.

With the two-tier approach there is more flexibility in dealing with those different requirements and still keep everything working as if it were one big system from a usability perspective. Here are a few examples of how this setup can work:

Consistent data flow from Real-Time to Batch: The integration enables us to handle real-time data processing at in-memory speed and deal with more long-term data processing through the underlying database.

Performance & latency: The In-Memory Computing system can handle the event processing before the data gets into the database. Or, another approach is to keep the last day (or days) in-memory and the rest of the data in the NoSQL database.

Mixed consistency model: The In-Memory Computing system is often built for extreme consistency where NoSQL databases often work best with eventual consistency. Usually, the consistency requirements are more relevant at the front end of the system, and becomes less relevant as the data gets older. The combined approach enables us to set our front end for extreme consistency and back end for eventual consistency.

Deterministic behavior: In many cases, we must ensure that a given set of data can be served under constant performance. Many databases use an LRU based cache to optimise the data access. The limitation of this approach is that the speed at which we can access our data becomes non-deterministic as we often do not control which data is served through the database cache; thus, in some cases we will get a fast response time if we hit the cache and in other cases the same operation can be 10 times longer if we miss the cache. By splitting our in-memory data from our file-based storage we get more explicit control over which data is served at in-memory speed and which data is not, thus ensuring consistent access to that data.

Faster ETL: By front-ending our Big Data storage with In Memory Computing we can also speed up the time it takes to pre-process and load data to our long-term data system. In this context, we can push the filtering, validation, compression and other aspect of our data processing into memory before it goes into our long-term databases.

Final Words

Big Data systems are complex beast and it is clear that the one-size-fits-all approach doesn't work.

On the other hand, having too many data systems increases the complexity of managing our Big Data system almost exponentially; and our ability to ensure consistent behaviour, data integrity and reliable synchronization across the varous systems becomes an almost un-manageable task if done manually.

Adding real-time capabilities to our Big Data system is a classic area where the kind of integration described in this article is needed. The integration between In-Memory and File-Based approaches as two separate tiers also introduces additioanl areas of flexibility in how we can handle often contradictory requirements such as consistency, scale, latency, and cost. Instead of trying to come up with a least common denominator, we can optimize each tier to the area it fits best.

References:

From The Memory Grid to Cassandra

GigaSpaces Big Data Survey

The Challenges of Big Data Survey (Ventana Research)

In Memory Computing (Wikipedia)

http://www.ventanaresearch.com/uploadedFiles/Content/Landing_Pages/Ventana_Research_Big_Data_Benchmark_Research_Presentation.pdf