Introduction
This (long!) blog is based on my forthcoming book: Data Science for Internet of Things.
It is also the basis for the course I teach Data Science for Internet of Things Course. Welcome your comments. Please email me at ajit.jaokar at futuretext.com - Email me also for a pdf version if you are interested in joining the course
Here, we start off with the question: At which points could you apply analytics to the IoT ecosystem and what are the implications? We then extend this to a broader question: Could we formulate a methodology to solve Data Science for IoT problems? I have illustrated my thinking through a number of companies/examples. I personally work with an Open Source strategy (based on R, Spark and Python) but the methodology applies to any implementation. We are currently working with a range of implementations including AWS, Azure, GE Predix, Nvidia etc. Thus, the discussion is vendor agnostic.
I also mention some trends I am following such as Apache NiFi etc
The Internet of Things and the flow of Data
As we move towards a world of 50 billion connected devices, Data Science for IoT (IoT analytics) helps to create new services and business models. IoT analytics is the application of data science models to IoT datasets. The flow of data starts with the deployment of sensors. Sensors detect events or changes in quantities. They provide a corresponding output in the form of a signal. Historically, sensors have been used in domains such as manufacturing. Now their deployment is becoming pervasive through ordinary objects like wearables. Sensors are also being deployed through new devices like Robots and Self driving cars. This widespread deployment of sensors has led to the Internet of Things.
Features of a typical wireless sensor node are described in this paper (wireless embedded sensor architecture). Typically, data arising from sensors is in time series format and is often geotagged. This means, there are two forms of analytics for IoT: Time series and Spatial analytics. Time series analytics typically lead to insights like Anomaly detection. Thus, classifiers (used to detect anomalies) are commonly used for IoT analytics to detect anomalies. But by looking at historical trends, streaming, combining data from multiple events(sensor fusion), we can get new insights. And more use cases for IoT keep emerging such as Augmented reality (think – Pokemon Go + IoT)
Meanwhile, sensors themselves continue to evolve. Sensors have shrunk due to technologies like MEMS. Also, their communications protocols have improved through new technologies like LoRA. These protocols lead to new forms of communication for IoT such as Device to Device; Device to Server; or Server to Server. Thus, whichever way we look at it, IoT devices create a large amount of Data. Typically, the goal of IoT analytics is to analyse the data as close to the event as possible. We see this requirement in many ‘Smart city’ type applications such as Transportation, Energy grids, Utilities like Water, Street lighting, Parking etc
IoT data transformation techniques
Once data is captured through the sensor, there are a few analytics techniques that can be applied to the Data. Some of these are unique to IoT. For instance, not all data may be sent to the Cloud/Lake. We could perform temporal or spatial analysis. Considering the volume of Data, some may be discarded at source or summarized at the Edge. Data could also be aggregated and aggregate analytics could be applied to the IoT data aggregates at the ‘Edge’. For example, If you want to detect failure of a component, you could find spikes in values for that component over a recent span (thereby potentially predicting failure). Also, you could correlate data in multiple IoT streams. Typically, in stream processing, we are trying to find out what happened now (as opposed to what happened in the past). Hence, response should be near real-time. Also, sensor data could be ‘cleaned’ at the Edge. Missing values in sensor data could be filled in(imputing values), sensor data could be combined to infer an event(Complex event processing), Data could be normalized, we could handle different data formats or multiple communication protocols, manage thresholds, normalize data across sensors, time, devices etc
Applying IoT Analytics to the Flow of Data
Overview
Here, we address the possible locations and types of analytics that could be applied to IoT datasets.
(Please click to expand diagram)
Some initial thoughts:
IoT data arises from sensors and ultimately resides in the Cloud.
We use the concept of a ‘Data Lake’ to refer to a repository of Data
We consider four possible avenues for IoT analytics: ‘Analytics at the Edge’, ‘Streaming Analytics’ , NoSQL databases and ‘IoT analytics at the Data Lake’
For Streaming analytics, we could build an offline model and apply it to a stream
If we consider cameras as sensors, Deep learning techniques could be applied to Image and video datasets (for example CNNs)
Even when IoT data volumes are high, not all scenarios need Data to be distributed. It is very much possible to run analytics on a single node using a non-distributed architecture using Python or R systems.
Feedback mechanisms are a key part of IoT analytics. Feedback is part of multiple IoT analytics modalities ex Edge, Streaming etc
CEP (Complex event processing) can be applied to multiple points as we see in the diagram
We now describe various analytics techniques which could apply to IoT datasets
Complex event processing
Complex Event Processing (CEP) can be used in multiple points for IoT analytics (ex : Edge, Stream, Cloud et).
In general, Event processing is a method of tracking and analyzing streams of data and deriving a conclusion from them. Complex event processing, or CEP, is event processing that combines data from multiple sources to infer events or patterns that suggest more complicated circumstances. The goal of complex event processing is to identify meaningful events (such as opportunities or threats) and respond to them as quickly as possible.
In CEP, the data is at motion. In contrast, a traditional Query (ex an RDBMS) acts on Static Data. Thus, CEP is mainly about Stream processing but the algorithms underlining CEP can also be applied to historical data
CEP relies on a number of techniques including for Events: pattern detection, abstraction, filtering, aggregation and transformation. CEP algorithms model event hierarchies and detect relationships (such as causality, membership or timing) between events. They create an abstraction of an event-driven processes. Thus, typically, CEP engines act as event correlation engines where they analyze a mass of events, pinpoint the most significant ones, and trigger actions.
Most CEP solutions and concepts can be classified into two main categories: Aggregation-oriented CEP and Detection-oriented CEP. An aggregation-oriented CEP solution is focused on executing on-line algorithms as a response to event data entering the system – for example to continuously calculate an average based on data in the inbound events. Detection-oriented CEP is focused on detecting combinations of events called events patterns or situations – for example detecting a situation is to look for a specific sequence of events. For IoT, CEP techniques are concerned with deriving a higher order value / abstraction from discrete sensor readings.
CEP uses techniques like Bayesian networks, neural networks, Dempster- Shafer methods, kalman filters etc. Some more background at Developing a complex event processing architecture for IoT
Streaming analytics
Real-time systems differ in the way they perform analytics. Specifically, Real-time systems perform analytics on short time windows for Data Streams. Hence, the scope of Real Time analytics is a ‘window’ which typically comprises of the last few time slots. Making Predictions on Real Time Data streams involves building an Offline model and applying it to a stream. Models incorporate one or more machine learning algorithms which are trained using the training Data. Models are first built offline based on historical data (Spam, Credit card fraud etc). Once built, the model can be validated against a real time system to find deviations in the real time stream data. Deviations beyond a certain threshold are tagged as anomalies.
IoT ecosystems can create many logs depending on the status of IoT devices. By collecting these logs for a period of time and analyzing the sequence of event patterns, a model to predict a fault can be built including the probability of failure for the sequence. This model to predict failure is then applied to the stream (online). A technique like the Hidden Markov Model can be used for detecting failure patterns based on the observed sequence. Complex Event Processing can be used to combine events over a time frame (ex in the last one minute) and co-relate patterns to detect the failure pattern.
Typically, streaming systems could be implemented in Kafka and spark
Some interesting links on streaming I am tracking:
Newer versions of kafka designed for iot use cases
Data Science Central: stream processing and streaming analytics how it works
Iot 101 everything you need to know to start your iot project – Part One
Iot 101 everything you need to know to start your iot project – Part Two
Edge Processing
Many vendors like Cisco and Intel are proponents of Edge Processing (also called Edge computing). The main idea behind Edge Computing is to push processing away from the core and towards the Edge of the network. For IoT, that means pushing processing towards the sensors or a gateway. This enables data to be initially processed at the Edge device possibly enabling smaller datasets sent to the core. Devices at the Edge may not be continuously connected to the network. Hence, these devices may need a copy of the master data/reference data for processing in an offline format. Edge devices may also include other features like:
• Apply rules and workflow against that data
• Take action as needed
• Filter and cleanse the data
• Store local data for local use
• Enhance security
• Provide governance admin controls
IoT analytics techniques applied at the Data Lake
Data Lakes
The concept of a Data Lake is similar to that of a Data warehouse or a Data Mart. In this context, we see a Data Lake as a repository for data from different IoT sources. A Data Lake is driven by the Hadoop platform. This means, Data in a Data lake is preserved in its raw format. Unlike a Data Warehouse, Data in a Data Lake is not pre-categorised. From an analytics perspective, Data Lakes are relevant in the following ways:
We could monitor the stream of data arriving in the lake for specific events or could co-relate different streams. Both of these tasks use Complex event processing (CEP). CEP could also apply to Data when it is stored in the lake to extract broad, historical perspectives.
Similarly, Deep learning and other techniques could be applied to IoT datasets in the Data Lake when the Data is ‘at rest’. We describe these below.
ETL (Extract Transform and Load)
Companies like Pentaho are applying ETL techniques to IoT data
Deep learning
Some deep learning techniques could apply to IoT datasets. If you consider images and video as sensor data, then we could apply various convolutional neural network techniques to this data.
It gets more interesting when we consider RNNs(Recurrent Neural Networks) and Reinforcement learning. For example – Reinforcement learning and time series – Brandon Rohrer How to turn your house robot into a robot – Answering the challenge – a new reinforcement learning robot
Over time, we will see far more complex options – for example for Self driving cars and the use of Recurrent neural networks (mobileeye)
Some more interesting links for Deep Learning and IoT:
http://stats.stackexchange.com/questions/8000/proper-way-of-using-recurrent-neural-network-for-time-series-analysis
Noisy Time Series Prediction using a Recurrent Neural Network and Grammatical Inference – https://clgiles.ist.psu.edu/papers/MLJ-finance.pdf
https://www.quora.com/Can-recurrent-neural-networks-with-LSTM-be-used-for-time-series-prediction
Time series forecasting with recurrent neural networks http://www.neural-forecasting-competition.com/downloads/NN3/methods/24-NN3_Mahmoud_Abunasr_competition_complete.pdf
and an article by Sibhanjan Das and me on Deep Learning and IoT with H2O @kdnuggets
Optimization
Systems level optimization and process level optimization for IoT is another complex area where we are doing work. Some links for this
http://blog.tsia.com/blog/how-iot-process-optimization-can-improve-customer-outcomes
http://www.intel.co.uk/content/dam/www/public/us/en/documents/white-papers/industrial-optimizing-manufacturing-with-iot-paper.pdf
Visualization
Visualization is necessary for analytics in general and IoT analytics is no exception
Here are some links
iotdataviz and also
Nick Bearman’s work on R and Spatial analytics
NOSQL databases
NoSQL databases today offer a great way to implement IoT analytics. For instance,
Apache Cassandra for IoT
MongoDB and IoT tutorial
Other IoT analytic techniques
In this section, I list some IoT technologies where we could implement analytics
spatial analytics for IoT – ex from ESRI
Implementation of IoT analytics on DSPs
Augmented reality – ex in conjunction with Pokemon Go
GPUs – there is some extensive work done by Nvidia in this space especially in relation to Deep learning
I am also following Apache NiFi with great interest with features like Birectional flow of data, perimeter of control changes and data provenance
A Methodology to solve Data Science for IoT problems
We started off with the question: Which points could you apply analytics to the IoT ecosystem and what are the implications? But behind this work is a broader question: Could we formulate a methodology to solve Data Science for IoT problems? I am exploring this question as part of my teaching both online and at Oxford University along with Jean-Jacques Bernard.
Here is more on our thinking:
CRISP-DM is a Data mining process methodology used in analytics. More on CRISP-DM HERE and HERE (pdf documents).
From a business perspective (top down),we can extend CRISP-DM to incorporate the understanding of the IoT domain i.e. add domain specific features. This includes understanding the business impact, handling high volumes of IoT data, understanding the nature of Data coming from various IoT devices etc
From an implementation perspective(bottom up), once we have an understanding of the Data and the business processes, for each IoT vertical : We first find the analytics (what is being measured, optimized etc). Then find the data needed for those analytics. Then we provide examples of that implementation using code. Extending CRISP-DM to an implementation methodology, we could have Process(workflow), templates, code, use cases, Data etc
For implementation in R, we are looking to initially use Open source R and Spark and the h2o.ai API
Make the methodology practical by considering high volume IoT data problems, Project management methodologies for IoT, IoT analytics best practices etc
Most importantly, we will be Open and Open Sourced
There are some parallels with this thinking with Big Data Business maturity model index and to Systems thinking – my favourite systems thinking text is An Introduction to General Systems Thinking – Gerald M. Weinberg
When used in teaching for my course, it has parallels to the Concept-context pedagogy (pdf) where the concepts are tied to the practise in terms of Projects which take center stage
Conclusion
We started off with the question: At which points could you apply analytics to the IoT ecosystem and what are the implications? And extended this to a broader question: Could we formulate a methodology to solve Data Science for IoT problems? The above is comprehensive but not absolute. For example, you can implement deep learning algorithms on mobile devices (Qualcomm snapdragon machine learning development kit for mobile mobile devices). So, even as I write it, I can think of exceptions!
This article is part of my forthcoming book on Data Science for IoT and also the courses I teach
Welcome your comments. Please email me at ajit.jaokar at futuretext.com - Email me also for a pdf version if you are interested. If you want to be a part of my course please see the testimonials at Data Science for Internet of Things Course.