Insideanalysis.com

An Analytics Platform for the Internet of Things

2014-12-01

In his interview with Bloor Group CEO Eric Kavanagh on November 4, 2014, ParStream CTO and Co-founder Mike Hummel explained that his company’s product is the first analytics platform for the Internet of Things (IoT) space. What makes ParStream unique is instead of simply collecting IoT data, it’s focused on providing meaningful analysis.

Hummel discussed how the sheer scale of IoT data presents a variety of challenges. For example, a single gas turbine delivers 1.8 billion data elements per hour. To deal with these challenges, Hummel and his team made the decision to build ParStream from scratch. By taking the approach of building a fully distributed columnar database with high availability features, ParStream is able to offer a solution that’s 100% focused on real-time analytics on mass data.

Because the ParStream team realized early on that the value of IoT data rapidly degrades over time, their platform is heavily focused on deriving immediate value from data. Additionally, they want to make visual analysis as useful as possible. That’s why in addition to providing broad integration compatibility through standard SQL support, ParStream recently announced a formal partnership with Datawatch. The goal of this partnership is to provide maximum visualization performance with minimal network loads.

Eric Kavanagh: Ladies and gentlemen, hello and welcome back once again to Inside Analysis. My name is Eric Kavanagh. I will be your host for another conversation with a thought leader in the whole field of information management and big data and the Internet of Things. That’s going to be our discussion today. We’re talking to Mike Hummel, CTO and Co-founder of ParStream. Mike, welcome to the show.

Mike Hummel: Hello. Thank you very much.

Eric Kavanagh: Sure thing. So ParStream just had a pretty big launch recently around an analytic platform for the Internet of Things. Can you talk about what you launched and why it’s important?

Mike Hummel: Yes, absolutely. ParStream is the first analytics platform for the IoT space. Now, of course there are several platforms out there that focus on the IoT market but ParStream is the first one that focused on analyzing the huge amount of data that is collected in the IoT space.

Eric Kavanagh: We talk about Internet of Things — all these devices that are going to have sensors on them. This stuff has been around for a while in the manufacturing world, which is kind of a leader in the space. But now with the Internet of Things, you’re talking about literally hundreds of millions and really billions of devices that are connected to the Internet. I presume you’re talking about things like cell phones but also all kinds of other devices with sensors, right? Can you give us a take on which things in the Internet of Things you are focused on?

Mike Hummel: That’s absolutely right. This IoT arena was previously called end-to-end, machine- to-machine communication. Which actually means you collect a huge amount of data, typically in the industrial environment, and try to make sense of the data, use it to reduce cost and increase sales, and increase the ability of your production and of your product. We focused on enterprise applications in the IoT space, and the big difference is that nowadays sensors are so much cheaper and data processing is so much cheaper.

You literally have millions and millions of sensors everywhere that are delivering data at a rate we have not seen in the past. Just to give one example, gas turbines deliver 1.8 billion data elements per hour. Every gas turbine. In the past it was simply not possible to make sense of this data, neither in real time nor in post processing. So the data was actually not used and it didn’t make sense to put so many sensors on them. And that is the big change, you get way more data. The data is coming in faster and the data is created distributedly. So you don’t create all the data at one location. With your example of the mobile devices, that’s obvious. The data is created everywhere. But all the industrial context is created at every manufacturing site, in the telcos –based at every cell tower, out there at every waste pump in the field.

So, data created distributedly is a specific challenge. And we are facing these challenges with ParStream. We have presented the ParStream IoT platform to solve these challenges.

Eric Kavanagh: This is really interesting stuff because what you’re talking about is an order of magnitude in terms of the change in data volumes. In fact, that’s probably a couple orders of magnitude or even three and that’s a huge difference. So, if I understand it correctly, what you have done is designed a solution that is purpose-built for being able to ingest these large amounts of data but also to build applications around analytics on the data. So it’s really multiple parts here, right? There’s taking the data in, there’s identifying patterns, there’s the whole process of a sort of work bench, I’m guessing, where analysts can build analytical applications that are specifically designed to focus on this big data coming from the Internet of Things, right?

Mike Hummel: Absolutely right. In this whole big data era, it is all about making it possible to analyze big data. Now the focus is on “How can I derive immediate value from the data?” Because we have realized, specifically in the IoT space, that the value of data degrades rapidly over time. This means the faster we use the data we measure and we collect from the sensor, the more value we can derive from it. That’s simply understood if you look at the huge machinery or at a production facility. If you see that something is going wrong, you can do something now. If you wait until the next day when you have analyzed the data in your data warehouse, you have produced a lot of crap.

Eric Kavanagh: That’s exactly right. You know I’m a big fan of clichés, and the cliché that comes to mind in this context is the old one that says, “A stitch in time saves nine.” Meaning, if you wait too long, you’re going to have to do a lot more work to solve a particular problem. What’s so exciting about doing analysis on Internet of Things data is that you can, as you suggest, solve problems as they start or theoretically, if you’re using predictive analytics even before they happen, you could prevent them. But I’m guessing that what we’re really going to see is a whole new wave of applications that are designed to identify patterns of machines interacting with each other and either causing problems or creating opportunities. It certainly has to be huge in the cloud infrastructure space, I have to think, companies, just to name a few of the big ones: Amazon or Microsoft or Rackspace would love a like this. Because in that heavy duty, high-cost environment of cloud infrastructure, downtime is really, really expensive. That’s one of the big drivers for this kind of solution. You can really save not just a lot of money but a lot of headaches and you can solve really big problems much faster by using this kind of technology, right?

Mike Hummel: So true. We have worked in this environment you’ve just described. If you just imagine you monitor every message that goes from any server to any other server in a data center and you want to store the data over time and derive conclusions from it about where you have network issues — this is a real challenge. That was the reason why we engineered a new database technology that is built from scratch. And around this database we built the IoT platform. The biggest challenge is actually that you have to be able to continuously import massive amounts of data, analyze it in real time, but also analyze it together with historical data to see if you have seen this pattern before and deliver better insight. That is the real challenge. Huge intake, complex codes and working loads of historic data at the same time.

Eric Kavanagh: Yes, that’s fantastic stuff. I’m looking at some of the information on your website and it talks about how in addition to the parsing database which of course is at the core of all this you’ve got some additional functionality around time-series analytics. It’s my understanding that time-series analytics is so important. It’s important to have a solution that allows you to integrate very quickly, because the question of where to place those markers in the time series is a really big question. Is it ten seconds? Is it two minutes? Is it an hour? You’ll see different patterns evolving depending upon where you placed those markers, right?

Mike Hummel: Oh yes, absolutely. And basically, almost all the data you work with and specifically in IoT time-series data, it’s streaming in, you have to work with it. But the biggest challenge we have found is if you work in advanced analytics with time-series data you actually want to pick out, for example, dropped calls for a certain incident out of the time-series data and throw it into advanced analytics algorithms like machine learning. But you only want to take out the dropped calls with a little bit of the data before that issue happened and after that.

Because then the machine learning algorithms deliver best insight if you give them homogeneous issues. But picking out such data of a huge amount you have collected, that is a big challenge, for example, a Hadoop-based system, because you want to have 1,000 issues and always a little bit before and afterward. And ParStream is specifically good in doing such selective querying of the data because we have, and we will talk about the database in a second, we have a wide range of encoded business instances that help you to do this.

Eric Kavanagh: Yeah. That’s interesting. You talked about what makes it special. Could you get into some detail on what you just described there?

Mike Hummel: Yes. We have built a fully distributed columnar database with high availability features, and it’s a software-only solution that runs on any infrastructure on Linux distributions. But what makes it really special is a thing we call HPCI (High Performance Compressed Index). During the import, we actually build highly compressed bitmap indexes that are stored in addition to the data. And in addition to the data, and we can actually analyze data much, much faster by doing the filtering and the aggregations and of course ORDER BY and LIMIT operations on the bitmap index itself, so there is no need to scan through the data.

You can do all the analytics, especially combinatorial analytics, on the compressed bitmap index without the need for decompression. And that massively reduces the data volume you have to pump through the memory channel, i.e., massive reduction in infrastructure cost. The typical reduction factor is 15 to 20 compiled Hadoop solutions.

Eric Kavanagh: Wow. That’s a big deal. When you magnify that out by all the different devices and all the kinds of data that you’re trying to take in, it’s a real game-changer because it alters what is possible and that allows you to ask many more questions and different questions and iterate faster. To me that has always been one of the keys to quality analytics. You want to have an environment where the business analyst has a great deal of freedom and latitude to play around with the data. Because as good as any tool is, you still need people to curate what it’s doing and those people are the analysts. If those people have the right tools they could do amazing things, but it all takes time. I mean, even though you’re able to figure out very quickly what’s happening, still you need some person there paying attention to the patterns who’s going to make a decision and say, “Aha! We need to change something now.” Even if you automate something, still a person has to be there to make the decision to automate it. So it gets me very excited because you’re really enabling the business analyst to do their job, right?

Mike Hummel: Absolutely. Advanced analytics is solely there to deliver hints. A human being then has to look at these hints to better understand if you just have a correlation or if it’s the cause for the issue or for certain behavior. For that, you need to support interactive analytics on mass data to deliver let’s say, one or two second query response time even if you work on billions of data records. We have shown that, in many cases, on stage or at our customers in production, we can deliver on this without any pre-aggregation and we drill down in every dimension. That is the beauty of having a super fast database core engine in a virtual platform.

Eric Kavanagh: That’s a really good point. Drill down and drill sideways and really explore because it’s that exploratory process that yields the kind of analysis, which can lead to game-changing decisions that the organization makes. That’s great stuff. Let me throw this one at you: one of my favorite mantras with respect to the Internet of Things is this line I like to throw out sometimes that “Machines don’t lie.” That’s one of my favorite parts about working with machine-generated data is that a machine is not going to choose to give you false information or make up something if it feels insecure, which people can sometimes do. I think that’s a really solid aspect of machine-generated data. Now granted, you have to watch for new patterns and there are new devices that come out and have new signatures and so forth. But I’m guessing that a company like yours, if you stay focused on these types of use cases, you’re going to be building up an incredible library of awareness of all these different types of machines. And then the interaction between them, that’s going to be where some of the real kernels of value come, right?

Mike Hummel: Yes, yes. And the difference between the old style BI and data warehouse guys and people working in IoT couldn’t be bigger. I mean, in the data warehouse world you created ETL processes that removed all the outliers and said, “Oh, that’s bad data–throw it away.” In the IoT space, you specifically look at these outliers and say, “Oh! Something happened there. Let me try to understand it.”

Eric Kavanagh: Let’s talk a little bit more about the database and where it came from. Can you talk about what first went into the kernel of your database system and how you architect it for this whole Internet of Things solution? I mean you talked about the HPCI. What else can you tell us about how the ParStream database works? I love the fact that it’s distributed because that’s a very specific challenge to run a distributed database. Can you give us some ideas as to what makes it special in that environment?

Mike Hummel: Ah, yes. We built ParStream entirely from scratch. It keeps us with a 100% focus on real-time analytics on mass data. We took a lot of stuff away that makes other databases slow.

ParStream, it’s a modern transactional database. We focus on doing analytics of mass data that doesn’t change anymore after you’ve imported it, like loss data, transaction data, and sensor data. Though we could build a lot less architecture that is able to import data without slowing down query processing. Query processing is not slowing down imports. So you can really do both things at the same time. You can even separate it out to run on different hardware. You can run the import on some servers and the querying on other servers. This is the real beauty because it allows us to work specifically in the IoT field with huge amounts of incoming data and doing analytics on this new and historical data together. That is one major thing we did.

The second one was that we really processed the data where it starts. So we worked with decentralized storage on the server. We only map parts of the data into memory. So we are not an in-memory database. We actually persist all the data, but memory maps are only fractions of the index put into memory, which massively increases of cache hit rate. What we also did was look at this indexing technology, and I think that was the most important part. Because when we started we had to look at MonetDB, which is a super fast in-memory columnar database, but for our application it was 100 times too slow.

We understood why this is the case. All columnar databases are actually clogging up the memory channel between memory and CPU, and we massively worked on reducing the data volume that has to be transferred from the memory into the CPU. The solution was to work on compressed bitmapping indexes. That is the key. We have patterns on that and we have super fast algorithms we have engineered over many years. That is our key differentiator from our architectural point that results in super fast queries at a very, very low total cost of ownership.

Eric Kavanagh: You know, you’re reminding me of something else. And maybe you can shed some light on this. Some of the other folks in our industry I have talked to, they will discuss issues like using R, the statistical library programming language, and a lot of those algorithms tend to be very core bound. It sounds to me like you have done some pretty intelligent architecture work around solving that kind of problem. Does that make sense?

Mike Hummel: That makes a lot of sense. I mean R is basically sitting on top of a database pulling all the data out of the database that is relevant and then caching it in memory. Now for typical applications with other databases, it is a really, really bad pattern because for example, you suck in 50 million rows into memory, and that takes a long time to get it out of the database.

Even if you do not run R inside of the ParStream database, for example, you have your own purpose-built application sitting on top of the database you would like to continue to use, you can do this with ParStream. Because if you are selective on what data you want to throw in R, you want to only analyze dropped calls, bad quality in your production process, you don’t have to pull a hundred million out. You can run selective queries, pull out what is of interest to you and then process it in your application. But of course you can also run algorithms within ParStream. We provide a UDX. We call it user defined API, which allows you to run algorithms within the query processing that are defined by the end users. You can write them in C++ code and actually move them into the database as a shared object. That’s a very powerful way to include your own algorithms like geo-distributed analytics or a geo-clustering transformation that you can directly implement within the database.

Eric Kavanagh: That’s great stuff. Let’s just talk about a couple more of these key issues because I think it really is in the nexus of all these bits of functionality that the magic of a solution like this comes out. You talk about alerts and actions to monitor various data streams and create and qualify alerts. Well, that’s what you need if you want real-time results. You want to be able to set up filters that are monitoring various data streams, like we keep mentioning dropped calls, well that’s a really big issue and it’s a very annoying part of any kind of customer experience with a phone when the call gets dropped. So when you notice the calls dropped number ticking up, there are several things going on. One, there’s some kind of infrastructure issue. Maybe there’s a big concert in downtown Austin, for example. And all these kids are hitting their cell phones at the same time, so you could probably alert some infrastructure people to make some adjustments to re-route some traffic. You could also hit the marketing department and say, “Hey! All these people are getting dropped calls. Let’s maybe send them a coupon for $5 or something.” Right?

Mike Hummel: Yes. I mean alert and action is an integral part of every IoT solution. Most companies use CEP engines. We are not competing with CEP engines. CEP engines are really good if you need milliseconds or microsecond response times on events. But in the case that you described, and we see a hell of a lot of them, where you have one or two seconds or even longer to analyze the situation with the rule and maybe compare the current situation with the history and say, “Oh! The dropped calls are more than 50% over the average of a Monday morning between 9 and 11 o’clock.” That is a far more valuable insight than just say, at the moment I have more dropped calls than the limit one of the engineers set. You can bring current data in relation to historical data and that is our selling point on the alert and action.

Eric Kavanagh: None of these patterns make sense unless you have some context around it and that historical data is hugely useful in terms of being able to make those comparisons. It’s like an analog, right? You’re able to compare and contrast what was happening with what is happening, and that’s when you can notice some significant changes and that’s actually a great segue to what will be my last question here. You talked about Visualization Connect, which allows integration with Datawatch, a visualization tool for IoT analytics, which in many ways is the last mile of analysis. In many ways I call it the first mile, because that whole visual experience is what allows you to explore the data, mixing and matching data sets and looking for those patterns, looking for those outliers. That visualization part is really important, right?

Mike Hummel: Absolutely. Yes. ParStream works with a number of visualization products very, well. Because we support standard SQL as a query language, it works with basically every tool. But we specifically picked Datawatch for this announcement because they are somehow unique. Datawatch has a so-called streaming interface, which means from the database, we can trigger Datawatch to update or refresh a diagram based on the data from the database.

So this means, if we see that new data is coming in for a certain area, we can trigger refresh and if a front-end visualization product has to pull the database every few seconds we ensure that the most recent version is actually on the screen. And that reduces the network load, which reduces the load on the database and delivers a far better result. Therefore, we picked Datawatch to make this announcement together.

Eric Kavanagh: That is fantastic stuff. Folks, we’ve been talking to Mike Hummel. CTO and co-founder ParStream. Hop online to ParStream.com. I have to say this is very compelling stuff. So, congratulations Mike. I think you guys are on to something pretty hot here. Good job.

Mike Hummel: Thank you very much for this opportunity. Thank you.

Eric Kavanagh: You bet. We’ll talk to ParStream again early next year folks. So, thanks again to you Mike for your time today and like I say this is great stuff. Folks, you’ve been listening to Inside Analysis. Take care. Bye-bye.