In Part I, I went over the overview of use cases, scale, ingestion and storage. In Part II, I will cover the real-time streaming of data to longer-term storage for analytics. I will also go over the configuration of Power BI for real-time Q&A analytics.
Once the data arrives in an Event Hub, it won’t stay for long. Event Hubs have a configurable retention period of 1-7 days, so you need to “read” the data out fairly quickly. To do this, you need a service that can scale to millions of records per second and offer the flexibility to interpret the data along the way by minor forms of aggregation. For this, Azure Stream Analytics is the perfect solution.
Stream Analytics service has built in integration with Event Hubs and will process ingested events in real-time, comparing multiple real-time streams. It can also compare real-time streams together with historical values and models which enables the detection of anomalies and the ability to trigger an alert when a specific error or condition appears in the stream.
How it works
Stream Analytics works on old IPO pattern: Input, Process, Output. First, you configure your Input. In my case: the Event Hub where my data is coming in. Next, you configure an Output. Options for this: SQL database, Blob storage, Event Hub, Power BI or Table Storage. With this flexibility, you can flow data into a relational database, into files or even into a data warehouse or Hadoop (HortonWorks HDInsight). Lots of options, but for my example, I chose Azure SQL because it offered me the most flexibility with running SQL queries and “seeing” the data. Likely, Hadoop would be a stronger big data option. It is worth noting that you have the flexibility to configure multiple Inputs and multiple Outputs here.
In between Input and Output is Process. The Process configuration is nothing more than a Stream Analytics Query Language statement. Similar to an ETL pattern if you are familiar with that. Stream Analytics Query Language is a subset of T-SQL, so it looks a lot like a database SQL query. Here is the example of mine to give you an idea. You can see how this allows you to perform some level of aggregation (SUM, AVG, etc) as the data is flowing through.
The CASE statement you see is a result of having multiple versions of my IOT data generating application, where some of the results being returned were NULL.
Long-term Data Storage
As mentioned, Event Hubs have a configurable retention of 1-7 days, therefore you need to read the data off into a longer-term storage solution. If using Stream Analytics, you have the option of configuring a number of Outputs, as mentioned above. For my example, I chose Azure SQL Database.
Azure SQL Database
There are a few very easy requirements you need to set up in order for data to flow through Stream Analytics into a SQL database. Because the SQL database is already in Azure, I did not need to share any “connection string” or password with the Stream Analytics service. It’s all UI-driven, drop down configuration. I just needed to Configure my SQL database server to allow other Azure services to connect through its firewall.
Other than that, I only needed to ensure that my database table schema matched the schema in my Stream Analytics query. Everything else went smoothly.
After moving data to a longer-term storage medium, any number of scenarios could occur. I’ll list a few here to spark some ideas:
- move data into Hadoop (HDInsight) for further calculation/processing
- move data back on-premises for long-term storage (destroy cloud copy)
- leave transactional data in cloud (low-cost), aggregate subsets of the data and move that data…
- etc. etc.
Data does not need to live in the cloud forever.
Think of IOT as enabling the rapid, massive scale in order to collect millions of data points per second. Something that would be difficult and costly ($ and people) to do on-premises. However, once you have the data and can aggregate/compute on it and eventually act upon it, it’s up to you what you want to do with it. Data does not need to live in the cloud forever. Bringing it back on-prem can be a very strategic decision that most cost-effectively utilizes your on-prem Hadoop or storage environments.
I knew when I started I wanted to use Power BI as the analytics and visualization tool. While you can configure Power BI as an Output directly from Stream Analytics, I added the extra step of writing the data in a SQL database for demo purposes. I wanted to test scalability, additional aggregation steps and frankly, just to see it work.
In Power BI, I created two dashboards. For simplification, let’s call them “Real-Time Dashboard” and “Data Investigation Dashboard”. The real-time dashboard was to show me data refreshed near-real-time (seconds) and the data investigation dashboard was to enable me with the ability to do Power BI Q&A: natural language ad-hoc queries to gain insights into correlations in the data.
Power BI has the ability to connect directly to an Azure SQL database. Because there is no additional “ingestion” for Power BI, it can execute queries directly at the SQL database and offer me a real-time dashboard that is easily configured via drag & drop.
Data Investigation Dashboard
Power BI can also accept an Excel PowerPivot Workbook or Power BI Designer file as a way to connect to the data. For this dashboard, I used Excel Power Query to load my Azure SQL database connection into a Power Pivot workbook (aka Data Model). From there, I could add things like synonyms. For example, my data field for CPU was “CPU”, but I added a synonym called “processor” for my natural language Q&A. I also added a few new calculated columns for date time formatting, etc.
It is important to note in Power Query, I specified that NO DATA is actually loaded into Excel. Therefore, you will not run into row or file size limitations, or anything like that. Data is refreshed in Power BI about every 15 minutes or so (manually or on a schedule).
Once the file was created, I stored it on my OneDrive for Business and created my second dashboard in Power BI. Keeping it in ODB allows me to edit it without having to re-upload it to Power BI. It also allows me to easily share dashboards with colleagues.
I now have the ability to execute queries like these to build out my dashboard:
“show average processor over time as line chart” OR
“show average RAM by computer submitted today”