In this post, I'd briefly discuss the evolution timeline of data stream processing frameworks and current state. A famous paper by Stonebraker et el few years ago established some guidelines for truly streaming, fault tolerant computational frameworks. The mind map drawn in this post can be used as rough checklist to select/discard a streaming framework based on your application requirements.
Similarities
- Most of them achieve cluster mode execution via Yarn, Mesos and use Zookeper for cluster state management (for master-slave high availability).
- Many of them prepare a directed acyclic graph of computation which is presented to the master node in cluster which decides how to perform the computation in most efficient and parallel manner.
- Apache Kafka appears as a commonly employed approach as a fault tolerant event source. (though these frameworks provide connectors for many other data sources too.)
The Differences...
One of the
expectations in Stonebraker's paper from a stream processing framework was to provide "exactly once
delivery guarantee" which means on system failures the system
will make sure that already processed events will not be re-processed unless
they are explicitly replayed from the source.
As
we can see only few frameworks today guarantee that. Apache Storm does
not provide that guarantee and if exactly once processing is your
application's requirement then Storm will not be a good choice. Storm
leaves it for developers to correctly handle mutable state when there is recovery from error. Storm Trident overcomes that limitation but Trident is a micro-batching applied over streaming approach and has issues with micro
batch size (mostly non flexible) which can cause latency overheads when there is some
back-pressure. Note that Storm has been replaced by Heron in Twitter due its many limitations.
Apache Spark offers an efficient approach for batch processing via micro
batching applied over the data set in a parallel manner but suffers from
higher latency overheads in comparison to pure streaming frameworks (like Storm or Trident).
This
limitation of Spark is addressed in Spark's streaming module (aka Spark Streaming) which is
micro batching applied over continuously streaming data. Though Spark streaming
approach also suffers from fixed micro batch size issue which adds to
latency when there is back pressure.
Another
requirement from a streaming framework is to provide operator level
state management which is required to correctly re-instate the previous
valid state of computational graph after error recovery.
Apache Samza provides a good choice when it comes to unique ID based event
partitioning and continuous stream processing as well as micro batching.
Though it does not always guarantee exactly once-delivery when a
streaming source like Kafka is in use. The reason is that operator state
management in that case will not be transactional due to Kafka writes
being non-transactional.
Most recent entries in data streaming and analytics space are Apache Flink and Google Cloud dataflow which among other goodies also provide exactly once guarantee as well as a robust
operator state management. In addition, they make the developer's life
easy by providing higher level API abstractions to express business
logic for computation and windowing.
Flink
as a special mention provides APIs to separately deal with
continuous streaming computations (Flink DataStream API) as well as
batching requirements (Flink DataSet API).
Comments
Post a Comment