Skip to main content

Mindmap for data stream analysis frameworks



In this post, I'd briefly discuss the evolution timeline of data stream processing frameworks and current state. A famous paper by Stonebraker et el few years ago established some guidelines for  truly streaming, fault tolerant computational frameworks. The mind map drawn in this post can be used as rough checklist to select/discard a streaming framework based on your application requirements.

Similarities


While there are differences in how streaming approaches process events, there are some commonalities like:

  • Most of them achieve cluster mode execution via Yarn, Mesos and use Zookeper for cluster state management (for master-slave high availability).
  • Many of them prepare a directed acyclic graph of computation which is presented to the master node in cluster which decides how to perform the computation in most efficient and parallel manner.
  • Apache Kafka appears as a commonly employed approach as a fault tolerant event source. (though these frameworks provide connectors for many other data sources too.)


 

The Differences...




One of the expectations in Stonebraker's paper from a stream processing framework was to provide "exactly once delivery guarantee" which means on system failures the system will make sure that already processed events will not be re-processed unless they are explicitly replayed from the source. 

As we can see only few frameworks today guarantee that. Apache Storm does not provide that guarantee and if exactly once processing is your application's requirement then Storm will not be a good choice. Storm leaves it for developers to correctly handle mutable state when there is recovery from error. Storm Trident overcomes that limitation but Trident is a micro-batching applied over streaming approach and has issues with micro batch size (mostly non flexible) which can cause latency overheads when there is some back-pressure. Note that Storm has been replaced by Heron in Twitter due its many limitations.

Apache Spark offers an efficient approach for batch processing via micro batching applied over the data set in a parallel manner but suffers from higher latency overheads in comparison to pure streaming frameworks (like Storm or Trident).

This limitation of Spark is addressed in Spark's streaming module (aka Spark Streaming) which is micro batching applied over continuously streaming data. Though Spark streaming approach also suffers from fixed micro batch size issue which adds to latency when there is back pressure.

Another requirement from a streaming framework is to provide operator level state management which is required to correctly re-instate the previous valid state of computational graph after error recovery.

Apache Samza provides a good choice when it comes to unique ID based event partitioning and continuous stream processing as well as micro batching. Though it does not always guarantee exactly once-delivery when a streaming source like Kafka is in use. The reason is that operator state management in that case will not be transactional due to Kafka writes being non-transactional.

Most recent entries in data streaming and analytics space are Apache Flink and Google Cloud dataflow which among other goodies also provide exactly once guarantee as well as a robust operator state management. In addition, they make the developer's life easy by providing higher level API abstractions to express business logic for computation and windowing.

Flink as a special mention provides APIs to separately deal with continuous streaming computations (Flink DataStream API) as well as batching requirements (Flink DataSet API).

Twitter Heron as mentioned earlier is the new replacement for Storm at Twitter and provides all expected benefits though its not open source.

Comments

Popular posts from this blog

C++ logging using Apache Log4cxx on Linux

I'd demonstrate the set up of Apache log4cxx on Ubuntu 12.x in following steps: 1. Download all required packages 2. Build Log4cxx via Apache Maven 3. Use the libraries in a NetBeans C++ project on Ubuntu Linux. Reference URL: http://www.yolinux.com/TUTORIALS/Log4cxx.html This URL is quite detailed and explains things for other *nix operating systems as well. I wanted to start with minimum steps required, hence this post. I have a Windows 7 desktop and have Ubuntu 12.x 64-bit running on it via Oracle Virtualbox. So Ubuntu is running as a guest on my Windows 7 host. [The reference URL mentions different versions of  'libaprXX' libs but we have to use 'libapr1-dev' and 'libaprutil-dev', will see that later.] Prerequisites (install all of the following one by one) autoconf and automake --> sudo apt-get install autoconf automake libxml2 and libxml2-devel ( http://xmlsoft.org/ ) --> sudo apt-get install libxml2 libxml2-dev gmp ( ...

C++11 std::thread Construction and Assignment Tricks

C++11 arrived with std::thread and related classes built within the standard. With the move semantics in place, the construction and assignment practices of 'std threads' deserve a small write up. Let's define a 'Runnable' first: class Runnable {    public:       virtual void operator ()() = 0;       virtual ~Runnable(){} }; class Task : Runnable {      public:          void operator ()()          {               //do sth here..          } };                                           ...

C++ Initialization Subtleties (significance of -Werror compiler option)

Modern C++ continuous integration build pipelines might produce huge logs which may be easily overlooked. Among other errors/warnings, a potential risk caused by invalid narrowing assignments might be lurking in those dark corners... This write up is a little reminder about an essential feature in modern C++ compilers and can help defend against that specific problem. Prior to C++11, the following was a valid assignment and still is even in C++20 or higher ... unsigned int unsignedSinceBirth = 10; unsignedSinceBirth = -10;  // assigning a negative value to an unsigned container printUnsignedVal(unsignedSinceBirth); which when compiled using these options "g++ -std=c++20 <cpp files> -o <executable-name>" does not even emit a warning. And an executable is generated successfully. The output of running that code is an arbitrary unexpected value. Modern C++ (post Cpp11) allow uniform initialization syntax which can help the compiler detect this situation as follows: ...