BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Amazon made a big splash in December with the release of Kinesis. Although it has many of the limitations of a relatively new offering, we can make some observations.
Amazon Kinesis is a managed, scalable, cloud-based service that allows real-time processing of large data streams. A major benefit of Kinesis is the speed with which it can be provisioned and scaled -- it is no exaggeration to say that the service can begin in seconds. The flexibility inherent in Kinesis gives it the ability to absorb data from anything that can call a Web service.
Kinesis is designed for real-time data; all data is purged after 24 hours. Data streams are managed by DynamoDB, the NoSQL database from Amazon Web Services (AWS). Kinesis currently provides four methods to process this rapidly collected data: a Kinesis API application, a Kinesis Client Library application, an Elastic MapReduce (EMR) Connector and Storm Spout.
What is Kinesis good for?
Amazon Kinesis would be helpful for any organization that takes advantage of real-time or near real-time access to large stores of data. Examples include:
- Collecting log data from an application every couple of minutes, listing recent errors by number and identifying queries that are running slow, which allows administrators to see issues before they affect application performance.
- Tracking customer's clickstreams within an application, by customer or aggregate, in real-time to see how functionality is being used.
- Having flexibility for future requirements with an available data stream
Divide and conquer
With Kinesis, the data collection process is separate from data processing. Systems that input data via a Web service call are referred to as "producers." These producers push data into "streams." Many Kinesis applications can consume data from one stream without any application affecting the other. This separation of data makes it easy to adapt processing to report and act on changing information needs.
However, all of this functionality comes at a cost. Some plugins are available that remove changes in code requirements, such as Log4J Appender, but, overall, Kinesis does not have many. There is a similar shortage of plugins for the processing. Only Amazon EMR and Apache Sprout integration are available as connectors currently; otherwise, custom code will be necessary, whether offered by Amazon or developed privately.
Goin' to the library
AWS recommends using the Kinesis Client Library to build applications over the API directly. The library -- currently available in Java -- provides patterns to write a fault-tolerant, distributed application faster than other approaches. Amazon also recommends that the application is deployed within an Elastic Compute Cloud (EC2) auto-scaling group; however, the app creators can host it anywhere they choose -- even on-premises.
Data streams' scaling is managed by a "shard." Each shard is metered to 1 MB per second of writes and 5 MB per second of reads. Scalability is handled by increasing the number of shards per stream; however, Kinesis has no means of automatically changing the number of shards, so the amount of data producers input must be accurately gauged or they must adjust the number of shards manually. Further, any Kinesis application must be designed to handle changes to the shards. This is another reason why Amazon recommends the Client Library over direct API calls for custom applications.
Real-time or near real-time analytics of big data is a fairly new field, but there are two pieces to Kinesis where alternatives do exist: the producer side (collection and persistence of data) and the processing side (Kinesis applications). On the production side, there are Apache Kafka and DB platforms such as Redis. On the processing side is Apache Storm, which Kinesis can use for processing data via the Kinesis Storm Spout, and Apache Spark. Another option would be services such as Cloudera.
Amazon has taken some of its core technology and produced a bare-bones offering in Kinesis. It can excel for organizations that require custom, real-time data reporting and analysis. Although it is relatively easy to attach applications as producers to a stream, there are nontrivial implications. If past experience is any indication of what will happen with Kinesis, then over time, connectors will become available to spare most people many integration challenges.
About the author:
Casey Benko is the president BLT Global Ventures LLC. BLT works with companies to leverage cloud services like AWS, Salesforce and Zuora.