Mindmap for data stream analysis frameworks

In this post, I'd briefly discuss the evolution timeline of data stream processing frameworks and current state. A famous paper by Stonebraker et el few years ago established some guidelines for truly streaming, fault tolerant computational frameworks. The mind map drawn in this post can be used as rough checklist to select/discard a streaming framework based on your application requirements.

Similarities

While there are differences in how streaming approaches process events, there are some commonalities like:

Most of them achieve cluster mode execution via Yarn, Mesos and use Zookeper for cluster state management (for master-slave high availability).
Many of them prepare a directed acyclic graph of computation which is presented to the master node in cluster which decides how to perform the computation in most efficient and parallel manner.
Apache Kafka appears as a commonly employed approach as a fault tolerant event source. (though these frameworks provide connectors for many other data sources too.)

The Differences...

One of the expectations in Stonebraker's paper from a stream processing framework was to provide "exactly once delivery guarantee" which means on system failures the system will make sure that already processed events will not be re-processed unless they are explicitly replayed from the source.

As we can see only few frameworks today guarantee that. Apache Storm does not provide that guarantee and if exactly once processing is your application's requirement then Storm will not be a good choice. Storm leaves it for developers to correctly handle mutable state when there is recovery from error. Storm Trident overcomes that limitation but Trident is a micro-batching applied over streaming approach and has issues with micro batch size (mostly non flexible) which can cause latency overheads when there is some back-pressure. Note that Storm has been replaced by Heron in Twitter due its many limitations.

Apache Spark offers an efficient approach for batch processing via micro batching applied over the data set in a parallel manner but suffers from higher latency overheads in comparison to pure streaming frameworks (like Storm or Trident).

This limitation of Spark is addressed in Spark's streaming module (aka Spark Streaming) which is micro batching applied over continuously streaming data. Though Spark streaming approach also suffers from fixed micro batch size issue which adds to latency when there is back pressure.

Another requirement from a streaming framework is to provide operator level state management which is required to correctly re-instate the previous valid state of computational graph after error recovery.

Apache Samza provides a good choice when it comes to unique ID based event partitioning and continuous stream processing as well as micro batching. Though it does not always guarantee exactly once-delivery when a streaming source like Kafka is in use. The reason is that operator state management in that case will not be transactional due to Kafka writes being non-transactional.

Most recent entries in data streaming and analytics space are Apache Flink and Google Cloud dataflow which among other goodies also provide exactly once guarantee as well as a robust operator state management. In addition, they make the developer's life easy by providing higher level API abstractions to express business logic for computation and windowing.

Flink as a special mention provides APIs to separately deal with continuous streaming computations (Flink DataStream API) as well as batching requirements (Flink DataSet API).

Twitter Heron as mentioned earlier is the new replacement for Storm at Twitter and provides all expected benefits though its not open source.

.

Search This Blog