2016-09-28

Integrating disparate data silos is one of the essential functions of an enterprise system.  Perhaps you have CRM data in Salesforce, and customer managed profile information in an external customer portal but the data needs to be synchronized.  In the past there were two primary methods for integrating data silos:  The first generation of enterprise integration is Extract, Transform, and Load (ETL);  A second generation, which has been common for around a decade, is the Enterprise Service Bus (ESB) architecture.

We now need to do more than just periodically copy data from one system to another. We need data pipelines that can ingest terabytes of data, process it in a multitude of ways, and store it in a variety of systems. Our apps need to be smarter by weaving together disparate systems and providing intelligence from those aggregations.

Apache Kafka has emerged as a next generation event streaming system to connect our distributed systems through fault tolerant and scalable event-driven architectures. This enables new types of intelligent and engagement applications, especially those that are powered by the new Salesforce Einstein technologies which brings AI to everyone. Today Heroku has announced the availability of Apache Kafka as a managed service, making it easy for you to begin building a new generation of smarter applications!

Integration with ETL and ESBs

Before looking at the benefits of Apache Kafka, let’s look closer at ETL and ESBs.

ETL is simple because of a very defined wiring between a data producer and consumer.  For instance, to manage an ETL process I will have a program / script that pulls a data set from one system, does some field mapping, and maybe some calculations, then inserts or updates rows in another system.  These kinds of processes are “batch” processes that run on a specified interval (e.g. daily).  With ETL we have to program for the possibility that the process can fail before completing a batch, and write manual logic so that new records are not duplicated, and calculations are correct.

ETL processes that rely on vertical scaling, like traditional SQL-based systems, are often challenging to scale horizontally.  So if one million rows need to be processed there isn’t a clear and easy way to divide (i.e. shard) the work into chunks and distribute those chunks across multiple processing nodes.  So ETL is simple but definitely lacks many of the things we would need to do it right.

ESBs decouple the wiring between the data producer and consumer thus providing more flexibility in how systems are glued together.  With an ESB, changes can be immediately propagated across systems and many subscribers can easily receive change notifications, and do their own transformations.

The primary downside of an ESB for integration is that there are no guarantees around delivery of a message.  If a consumer is down, then it will miss messages.  Also ESBs do not have any ordering guarantees or easy way to shard consumption.

Fault Tolerant & Scalable Integrations

While maintaining the consumer / producer programming model, Kafka overcomes many of the downsides of ETL and ESBs with three main features:

Messages are immutable and durable for a specified amount of time so delivery can be guaranteed and distributed

Clustering provides high availability and partitioning / sharding

Messages are ordered within the context of a partition

With these three features, Kafka provides a resilient way to build integrations that can tolerate failure and scale massively.

Most Kafka systems ingest data from many sources including user interactions (app & web), telemetry data, or data change events (i.e. change data capture).  Once ingested any system can subscribe to the events on a named topic.  For example, you may have a mobile app where every time a user views an item, that interaction is sent to Kafka.  Then you might have a consumer which stores the raw events in a big data store.  A second consumer might do some analysis on the data like how many items a user looks at per hour, which is then stored in another system like Salesforce.



Salesforce already has a way to get data events out (Streaming API) and changes in (REST or SOAP API) so it is pretty straightforward to fit it into an architecture with Kafka.  Check out the “Streaming Salesforce Events to Heroku Kafka” blog post to get started with connecting the Salesforce Streaming API to Heroku Kafka.

Getting Started with Apache Kafka on Heroku

Kafka is a great fit for handling massive amounts of data because it is naturally distributed.  This facilitates handling user interactions that then need to be analyzed by another system, possibly for calculating aggregates, machine learning, etc.  Let’s walk through how to do this in a Node.js application on Heroku.

We will be using the Dreamhouse sample application as the foundation.  To add sending user interactions to Kafka we will start with the Dreamhouse Web App and add the Kafka pieces.

Step 1) Provision Heroku Kafka

First we will need to provision a Kafka cluster by adding the Apache Kafka on Heroku Addon to the app on Heroku.  A new topic will need to be created in Kafka.  You can use the Heroku CLI and Kafka Plugin to do that by running:

heroku kafka:topics:create -a HEROKU_APP_NAME interactions

Step 2) Add Kafka Client Library

Then our Node app needs a new dependency on the no-kafka library (in the package.json file):

"no-kafka": "git://github.com/crcastle/kafka.git#3.0"

Step 3) Setup Kafka Authentication

Since our connection to Kafka will be encrypted we need a little helper script that will set up the necessary certs.  Put the following env.sh in the root directory of your project: https://github.com/dreamhouseapp/dreamhouse-web-app/blob/kafka/env.sh

Now you need a new file named Procfile that uses the env.sh to setup the environment before starting the app:

web: ./env.sh && npm start

Step 4) Connect to Kafka

Everything is now ready to use Kafka in the app.  The server.js is the main app source.  First, import the no-kafka module:

Since we will be producing messages to Kafka we now need to setup a Producer:

Step 5) Publish Interaction Events to Kafka

With the Kafka producer configured we can now send messages using the producer.send() method.  To intercept all of the page requests to the server we can add a function to the Express web framework main controller:

That is it!  We now are sending user interactions to Kafka where any system can then consume the data for whatever storage or processing is needed.  Once the changes are redeployed to Heroku, you should be up and running!  If you want to see all of the messages on the interactions topic you can run the following command (Note: This only works in the Heroku Common Runtime):

heroku kafka:topics:tail interactions -a HEROKU_APP_NAME

There are all sorts of different consumers we could now hook to this topic.  For instance, we could stream these raw interaction events into a big data store for later analysis or machine learning.  We could also real-time stream processing on the data to calculate aggregates and store those into a database so that dashboards or other visualization tools could have instant analytics.  This could also be used for plain data integration as a real-time, fault-tolerant, and horizontally scalable ETL pipeline.

To learn more about Heroku Kafka check out the release announcement!

Show more