photobank.kiev.ua - Fotolia
In 2003, Nagui Halim built and led a large interdisciplinary research team, which created a new architecture for high-speed, adaptive stream processing and analytics. For this new system to manage and analyze massive volumes of continuous data streams, Halim developed foundational concepts and designed the architecture of a new computing platform, which came to be called System S, and was launched in 2009. Honored for those achievements as an IBM Fellow in 2011, today, Halim is the director of the IBM Streams business unit. He spoke exclusively with SearchCloudApplications.
How far have we come from System S?
Nagui Halim: We've come a tremendous way. In 2003, it was a whiteboard discussion about fundamentals. My vision was to create a technology that would have durability, and last for many decades and be for streaming what database products are to repositories and persistence. I wanted something that would be able to address every conceivable streaming problem that could be defined.
Halim: Complex event processing is a programming model that says if you're going to program something, you're going to do it in this way. Streaming is a superset that deals with a lot of other issues -- scaling of the problem, system issues, reliability, elasticity of resources, changing problem sets and actual machine learning or open analysis on the data that comes in. We do speech or video processing that goes way beyond what a complex event processing system would do.
In thinking about streaming analytics from the perspective of developers writing the applications that leverage data, what must they know in terms of real-time data ingestion?
Halim: Streaming problems are different from classic analytics problems in that the data is constantly moving. Because different mathematical techniques are required, IBM developed a special-purpose language for streams, called SPL (Streams Processing Language). The idea was to express directly in a new programming paradigm what you're doing with windowing and aggregation, the dynamics of adding streams and removing them, and connecting to new parts of the application. It could take the language constructs or the programs that people wrote and do a lot of optimization under the covers related to streaming. This way, we could get into low-latency application performance characteristics and also look at high availability.
How has IBM addressed the proprietary nature of SPL in an age of open source?
Halim: SPL gave us flexibility in terms of what we could do under the covers, but the problem with being a proprietary language is that we didn't succeed in interesting the general community in developing alternatives. We did add the ability to program the Streams platform with Java, and we're working on additional language support. That's relatively new.
What does SPL provide for developers who want reusable or libraries for specific streaming scenarios?
Halim: To be effective in creating applications, you need what we've loosely called toolkits, the availability of very sophisticated analytical packages that can run as part of an application. When you are handed a system or a platform, be it Java or SPL, and you're asked to do a call center application, you don't want to start off by having to write a speech-processing program. Instead, you want to have a speech-processing component to the overall application. Over the last half-decade, we developed a number of these SPL toolkits, including financial services and geospatial.
Latency is a huge concern. Apache Spark is microbatching and doesn't stand up to scrutiny for doing streaming analytics. Don't you need to go to a technology like Apache Flink, or the new IBM Quarks for the Internet of Things?
Halim: Flink is a true streaming platform. Spark is batching, but with small batches. You couldn't use Spark for high-speed financial trading, because you'd have to wait for quantities of information to arrive before you could do any processing. It's not clear to me why you would choose Spark now that we have ways of doing true streaming.
Applications also need to handle different types of streams simultaneously.
Nagui Halimdirector of IBM Streams business unit at IBM
Halim: You might want to take structured digital streams and combine the analysis with unstructured things. For example, a stock trader will look at the newswire feeds concurrently to see if there are news items that could affect trading or stock prices, or have a bearing on the fundamentals of a particular company. These things that combine structured data analysis with unstructured video or speech can give you new insight. One principle of streaming analytics is that the more context you can have from a variety of different sources, the better your results will be.
What is the most difficult issue for a company considering streaming analytics?
Halim: Business processes may need to change to adjust to the availability of real-time conclusions. If you're running a semiconductor manufacturing plant, and you weren't used to getting information about a particular set of tools that were out of alignment, and all of a sudden, you have that, you need to figure out how the organization is going to respond to that information.
What about DevOps considerations?
Halim: These applications tend to run continuously, so there has to be good DevOps to make sure all systems around the core streaming system are working properly and have good recovery capabilities built-in. Having this continuous operation across the whole enterprise is a stressful thing, but it's necessary once you move into the continuous 24/7 domain.
Time to get on board with analytics as a service?
Combining structured, unstructured data is fraught with challenges
Do you know how to deal with the DevOps movement?
Don't forget IoT when implementing analytics
Avoid these traps when deploying predictive analytics
Hadoop and Kafka creators discuss the rise of big data streaming analytics