Skip to main content

Mindmap for data stream analysis frameworks



In this post, I'd briefly discuss the evolution timeline of data stream processing frameworks and current state. A famous paper by Stonebraker et el few years ago established some guidelines for  truly streaming, fault tolerant computational frameworks. The mind map drawn in this post can be used as rough checklist to select/discard a streaming framework based on your application requirements.

Similarities


While there are differences in how streaming approaches process events, there are some commonalities like:

  • Most of them achieve cluster mode execution via Yarn, Mesos and use Zookeper for cluster state management (for master-slave high availability).
  • Many of them prepare a directed acyclic graph of computation which is presented to the master node in cluster which decides how to perform the computation in most efficient and parallel manner.
  • Apache Kafka appears as a commonly employed approach as a fault tolerant event source. (though these frameworks provide connectors for many other data sources too.)


 

The Differences...




One of the expectations in Stonebraker's paper from a stream processing framework was to provide "exactly once delivery guarantee" which means on system failures the system will make sure that already processed events will not be re-processed unless they are explicitly replayed from the source. 

As we can see only few frameworks today guarantee that. Apache Storm does not provide that guarantee and if exactly once processing is your application's requirement then Storm will not be a good choice. Storm leaves it for developers to correctly handle mutable state when there is recovery from error. Storm Trident overcomes that limitation but Trident is a micro-batching applied over streaming approach and has issues with micro batch size (mostly non flexible) which can cause latency overheads when there is some back-pressure. Note that Storm has been replaced by Heron in Twitter due its many limitations.

Apache Spark offers an efficient approach for batch processing via micro batching applied over the data set in a parallel manner but suffers from higher latency overheads in comparison to pure streaming frameworks (like Storm or Trident).

This limitation of Spark is addressed in Spark's streaming module (aka Spark Streaming) which is micro batching applied over continuously streaming data. Though Spark streaming approach also suffers from fixed micro batch size issue which adds to latency when there is back pressure.

Another requirement from a streaming framework is to provide operator level state management which is required to correctly re-instate the previous valid state of computational graph after error recovery.

Apache Samza provides a good choice when it comes to unique ID based event partitioning and continuous stream processing as well as micro batching. Though it does not always guarantee exactly once-delivery when a streaming source like Kafka is in use. The reason is that operator state management in that case will not be transactional due to Kafka writes being non-transactional.

Most recent entries in data streaming and analytics space are Apache Flink and Google Cloud dataflow which among other goodies also provide exactly once guarantee as well as a robust operator state management. In addition, they make the developer's life easy by providing higher level API abstractions to express business logic for computation and windowing.

Flink as a special mention provides APIs to separately deal with continuous streaming computations (Flink DataStream API) as well as batching requirements (Flink DataSet API).

Twitter Heron as mentioned earlier is the new replacement for Storm at Twitter and provides all expected benefits though its not open source.

Comments

Popular posts from this blog

C++11 std::thread Construction and Assignment Tricks

C++11 arrived with std::thread and related classes built within the standard. With the move semantics in place, the construction and assignment practices of 'std threads' deserve a small write up. Let's define a 'Runnable' first: class Runnable {    public:       virtual void operator ()() = 0;       virtual ~Runnable(){} }; class Task : Runnable {      public:          void operator ()()          {               //do sth here..          } };                                                 //later..      #include <thread>     Task runnable;     std::thread t1; //not attached to any thread                                                 std::thread t2 ( runnable ); //attached and running   Copy Construction of std::thread is not allowed            std::thread t3 ( t2 ); // compilation error            std::thread t4 = t2; // compilation error error : use of deleted function std::thread (const std::th

C++ logging using Apache Log4cxx on Linux

I'd demonstrate the set up of Apache log4cxx on Ubuntu 12.x in following steps: 1. Download all required packages 2. Build Log4cxx via Apache Maven 3. Use the libraries in a NetBeans C++ project on Ubuntu Linux. Reference URL: http://www.yolinux.com/TUTORIALS/Log4cxx.html This URL is quite detailed and explains things for other *nix operating systems as well. I wanted to start with minimum steps required, hence this post. I have a Windows 7 desktop and have Ubuntu 12.x 64-bit running on it via Oracle Virtualbox. So Ubuntu is running as a guest on my Windows 7 host. [The reference URL mentions different versions of  'libaprXX' libs but we have to use 'libapr1-dev' and 'libaprutil-dev', will see that later.] Prerequisites (install all of the following one by one) autoconf and automake --> sudo apt-get install autoconf automake libxml2 and libxml2-devel ( http://xmlsoft.org/ ) --> sudo apt-get install libxml2 libxml2-dev gmp (

..Where Apache Camel meets Spring JMS

This post is relevant for developers who are aware of Apache Camel and use its JMS component to consume from a JMS broker. We are only discussing the message consumption part here. I'm assuming you are conversant with Camel's terminology of routes, endpoints, processors and EIPs. If not, following is a good place to start: Link --->  Apache Camel Home   Ever wondered what happens underneath when a Camel consumer route is defined to take messages from a JMS endpoint ? This post is an attempt to explain how Camel integrates with JMS via Spring APIs and how the business logic defined in a Camel route is invoked ? Following is a simplified sequence diagram showing the set-up process which will eventually lead to message consumption from JMS broker. . Creation of JMS endpoints in a camel context is initiated by definition (via Java/XML DSL). A JMS endpoint takes a number of configurable attributes as seen here Camel JMS Component . As soon as context is bootstrappe