Making it Simple
In the world of Big Data, the increasingly common task of pipelining data from disparate data sources into an OLAP friendly medium can quickly get complex and messy, especially when there are multiple data sources.
This article is for the teams who are not already moving terabytes of data and therefore not familiar with the tooling required. We have been through the trenches on a recent project and wanted to share how we sifted through the myriad of new and established tooling options and developed a simple low cost approach using relatively new tools that we could hand over to a client to operate themselves with ease.
The Project
This project was driven by the need for consolidated business analytics to drive real time BI for both internal operational matters, C-Level transparency and extend slivers of the data to customers as part of their reporting for services used. The data had to be extendable with custom visualizations and dashboards from the same source-of-truth. The client realized that it was time to own the data from the various SaaS and PaaS data sources it utilizes and combine it with data from it’s own B2C app to provide a single pane of glass view for everything from operational to user metrics. Like Buffer, who cautions peers to avoid depending too heavily on third-parties for storing analytic data, they realized the right way to do this is to bring the date in house under their control.
Step 1 – Map the data topology
With any data pipeline implementation like this, there are basic decisions to be made:
Where is the data coming from?
Where is the data going?
How will it be used?
How is the data getting from (A to B) then (C to B) then (D to B)..etc?
For this project we needed to get data from multiple data sources into Amazon Redshift:
Why Redshift?
It became quickly apparent that Big Data solutions like Hadoop, Spark, Kudu, Druid, Cassandra, etc. were probably overpowered and too complex beyond satisfying the main goal here: delivering analytics data to the business as soon as possible. For this project, Redshift proved to be a great fit: it’s SQL, it’s scalable, it’s affordable, it’s very easy to use and still allows for plenty of room to grow.
Project Requirements
Data needed to be streamed from the following into Redshift.
PostgreSQL RDS (in a VPC)
Google Analytics
AppsFlyer
HubSpot
SendGrid
JIRA
The data would need to be transformed and sanitized.
Then to make matters more tricky, the data architecture needed to be flexible enough to swap out the PostgreSQL data source for a new Riak data source in the near future with limited refactoring.
The architecture needed to be flexible enough to handle ongoing development on all the data sources.
The overall solution needed to be able to transform the source data from PostgreSQL (and likewise Riak) data models into one more friendly to a column store, i.e. star schema.
All of this data consolidation was to drive internal and customer-facing analytic visualizations, delivered by Chartio, a PaaS BI tool with embedding capabilities.
The implementation and knowledge transfer of the solution needed to be delivered in less than three weeks.
Data Pipeline – Build vs Buy
Obviously, we could have home-rolled this pipeline using technologies like Kinesis, Kafka or RabbitMQ (the latter already in use by the client). However, these were too complex for the need and the time frame, given that each entailed writing custom data-sourcing code, infrastructure to support them, and custom Redshift batching logic. Even a comprehensive pipeline solution like DataVirtuality would have been more than was needed and offered a steeper implementation curve.
There had to be a simpler way to approach this. Given the project requirements, we narrowed down only five needs for a pipeline solution:
Must integrate with the desired sources
Must support SSH tunnelling for AWS VPCs
Must be able to transform data on the fly
Must lend itself to very rapid iteration and development
Must be secure — as this is customer data we’re talking about here
We researched our options, and found three solution candidates that seemed capable of doing most of the heavy lifting: Textur, FiveTran, and Alooma.
Textur
Textur was the most initially exciting solution, and a client favorite, in part due to the price point. They offered integration with over 25 inputs, simple source configuration around connection, intervals, batching, and collection, table-level replication, and most importantly, data-agnostic pipelining. Unfortunately, at the time, we had to reject Textur as well due to poor SSH tunneling (an issue they have since confirmed as resolved) and limited transformation support. I think, given some time for this technology to mature, this will become an excellent solution in the future.
FiveTran
Of the three, FiveTran had the most numerous input integrations — which was appealing. It also offered secure connections, sync intervals of 1 hour, data agnostic pipelines and rapid integration. Unfortunately, during the course of testing, we had to reject this option due to issues with SSH tunnelling (an issue that has since been resolved) and lack of sufficient transform capabilities — they use a more direct schema mapping. I wouldn’t discount this technology, though, as there was also a lot to love.
Alooma
While the specific input integration list was initially smaller, they offered an elegant and straight-forward design that proved to be the best fit:
Full control of input configuration
Secure connections and support for ssh tunnelling
Collection intervals of mere minutes
Automatic event to table mappings in Redshift
For integrations not supported, generic REST and Java or Python SDK integrations
The real deal-closer, though, was a deployable transform engine, written in Python, for manipulating events as they passed through the stream. This proved to be exceptionally powerful, allowing us to coerce data types, sanitize data, support one to many event explosion, and transform data schemas. Plus, Alooma can store events in Amazon S3 and then replay them as needed. This let us rapidly iterate as we refined our transformations.
An example transformation of incoming events from Postgres and Google Analytics:
After the transformations are applied, events are mapped to configurable table’s within Redshift. The mapping functionality provides automatic mapping functionality with type inference, which can also be overridden manually:
The resulting configuration is a clearly defined plumbing system of input source to Redshift output:
Because of these killer features and others we were able to spend less time on pipeline-building and more time on developing custom queries for dashboarding the data with analytics and visualizations in Chartio.
Alooma was the clear choice for this project.
Future-proofing New Data Sources with Alooma
As noted, flexibility for ongoing development was key, particularly looking ahead to replacing PostgreSQL data with Riak. With Alooma, this proved to be very streamlined. The generic SDK input allowed for pushing Riak events up from RabbitMQ, a system already in place for our client, and could be configured alongside PostgreSQL until we made the switch. Likewise, the ability to write custom transformations and automatic event field-to-column mapping allowed for near seamless extensibility as new columns were added or removed from the data sources.
Even for systems not bound by our specific set of requirements, this kind of flexibility is invaluable since, as your data grows and evolves, the purpose served by your pipeline system can, and probably will, change with it.
Alooma Support
Alooma’s support team was great to work with and felt like an extension to our project. They gave us a dedicated Slack channel that allowed us to iterate rapidly, which was crucial given our demanding timeline. They also granted us a very generous trial period to confirm the product fit our needs before buying and were very helpful in both answering questions and working with us on custom configurations.
Takeaways
In the end it’s helpful to consider what would have been the outcome had we hand-rolled this entire infrastructure using a traditional Big Data pipeline design, for example, a Kafka and Hadoop setup.
Could the same goals have been accomplished? Yes
Would the infrastructure be as flexible for future changes? Yes
Would such an implementation be possible in the same timeframe? Unlikely
Could the same ease of implementation and onboarding be achieved? Doubtful
Unless you’re already fairly well versed in the Kafka-to-Hadoop design (or some similar infrastructure) or your data is already huge, the home-rolled solution is likely guaranteed to significantly increase the design, development, implementation, and maintenance burden surrounding it. It’s good to know there are tools like Alooma that give you Big Data capabilities off the shelf even when your data is not that big. Helping to reduce complexity and making it simple to focus on the real goal which is analyzing your data.
The post Data Pipelines with Alooma for BI Analytics appeared first on AgilData.