Sergey Nivens - Fotolia
With periodic batch-processing analytics unable to provide the up-to-the-second information required for making instantaneous decisions on the factory floor, in retail stores or in process control, enterprises are now rapidly adopting real-time streaming analytics. The traditional cloud-based storage model is now giving way to something new: in-memory analytics processing of big data streams.
One company specializing in this new area is Striim, based in Palo Alto, Calif. Striim does real-time data collection, in-memory processing of data, in-memory analytics and delivery of data to different targets on premises or in the cloud. Company Founder and CTO Steve Wilkes recently gave an exclusive interview to SearchCloudApplications.
How does in-memory processing of data for real-time streaming analytics, versus first writing to storage, change the way we need to think about applications?
Steve Wilkes: It turns out we are producing way too much data to store and analyze afterwards. You have to think about getting information from the data to start with -- what is important about this data? It could be that you are condensing down lots of individual data points to salient information, or it could be that you're trying to find a bigger trend. When you think about real-time processing, you have to decide: What is it that I need to know right now, and what is it about my data that could make it volatile or perishable? There are certain types of data that are perishable, that lose their value almost immediately.
We do have programming languages that are well-suited for number crunching -- R, Python and SQL among them. What is your recommendation when it comes to in-memory streaming analytics?
Wilkes: We do all the processing in a SQL-like language. Anyone familiar with SQL -- almost every developer, data scientist, business analyst and even some executives -- can get value out of the data by writing real-time, in-memory queries against continuous streams of [incoming] data. When you think about a stream of data, it's like a particle beam from an accelerator. It's hard to see a single particle, because they all move so fast. But if you put things into separate windows -- say, the last minute's worth of data or last hundred events -- then you can do familiar aggregate functions on those. And you can join them – 'If I see something in this stream, was there something that occurred in this other data source within 30 seconds of it?'
This conjures up the scenario of monitoring a steady stream of data from an in-flight jet engine that's functioning essentially as an internet of things (IoT) edge device. There's more data than you could possibly need, but you need some immediately if an anomaly occurs.
Wilkes: Whether it's an airliner, car or oil rig, there's two types of data that you are streaming into the cloud or your data center. One is data that you may want to collect and aggregate for later statistical analysis. But there are also things -- emergencies -- that you need to react to immediately. By moving some of the analytics to the edge, you can react immediately.
Some data can be batched, such as processing a day's retail sales transactions. But other data -- equities trades for one -- must be handled in true real time. The problem is that most stream processing is not fully real-time but does microbatching at the millisecond level.
Wilkes: There are situations in which you need to apply microbatching technology, because you may be dealing with the last minute's worth of data to calculate a moving average. That is an in-memory window. You can apply microbatch-driven queries to a true event-driven microarchitecture, so you should start with that and apply microbatching as you need to.
You speak about the intersection of IoT, enterprise and cloud. What are the dynamics of that?
Wilkes: The cloud side is enterprises using scalable, expandable, elastic architectures. You need to get data into the cloud, and we would recommend doing that in a real-time, streaming fashion. There are also security aspects, especially if you think of the cloud as an extension to your data center. The IoT side can be cloud-based, especially consumer IoT, but with the majority of IoT, you are talking about devices that may be on premises -- on a factory floor, in retail or in a hospital. You want to do some of the processing on premises, but you may also want to move large data sets into the cloud for machine learning or deep analytics. For us, as a streaming analytics platform, these intersections are key, because they imply you are moving stuff around.
Steve WilkesStriim co-founder and CTO
When you think about IoT infrastructure, you have to think about what processing do we do where. You can do some processing in the devices themselves, but there's only so much you can do. Then, you can think about edge gateways and the processing you can do there. These miniservers can be quite powerful. They scale well, and you can have a lot of them. Then, you have on-premises requirements, maybe to handle data that you're not allowed to move into the cloud -- personally identifiable or [medical] patient information. But you can anonymize patient data and push that into the cloud to do machine learning.
You can discover trends for which you were never looking in the first place.
Wilkes: That's the real beauty of unsupervised machine learning. By using statistical and neural net approaches and other algorithms, machine learning can spot things that a human couldn't, because you'd have to correlate too many different things.
What is changing for mobile and cloud application developers?
Wilkes: It's becoming easier. One thing that was preventing people from thinking about real-time analytics was the notion that doing things in real time was going to be more expensive, that it was a big infrastructure investment. That has changed. The reduced cost of memory has allowed people to do more in memory as opposed to cheap storage where you throw data on disk and process it afterwards. Platforms like ours can help with the plumbing, scalability, security and base functionality for doing analytics.
Developers can pull together integrations that weren't possible before.
Wilkes: If you think of IoT as just another source of data, you can now combine that in memory, in real time, with other information assets. There's a difference between streaming data from a device and doing analysis on that and combining that same data with other asset databases. Now, you have much more information and can make better predictions.
With memory costs continuing to drop and processing growing more powerful, where will real-time streaming analytics be a year from now?
Wilkes: People can think about moving more existing batch processing into the streaming world. Think of end-of-day processing that could take several hours [from disk] -- if you can move that into memory with streaming analytics, the instant the business day ends, you can have your output. And you can do this throughout the day to see current sales, web traffic and what people are looking at in real time. That allows you to react as things are happening rather than after the fact, which you would do with batch.
Discover the value of streaming analytics
Streaming analytics: What you need to know
IBM's Nagui Halim: The father of streaming analytics