Real-time analytics can keep you up-to-date on what’s happening right now, such as how many people are currently reading your new blog post and whether someone just liked your latest Facebook status. For most use cases, real time is a nice-to-have feature that won’t provide any crucial insights. However, sometimes real time is a must.
Let’s say that you run a big ad agency. Real-time analytics can keep you posted on whether your latest online ad campaign—that your client paid tons of money for—is actually working, and if not, you can make immediate changes before the budget gets spent any further. Another use case is providing real-time analytics for your own app—it looks good, and your users may require it.
There are quite a few real-time platforms out there. A lot of them are newcomers, and the differences between them aren’t clear at all. The least we can do, is present all the options for you to choose from, so here are five real-time streaming platforms for Big Data.
Flink is an open-source streaming platform capable of running near real-time, fault tolerate processing pipelines, scalable to millions of events per second. Flink enables the execution of batch and stream processing.
Spark is an open-source data-processing framework that is really hot at the moment. Because Spark runs in-memory on clusters, and it isn’t tied to Hadoop’s MapReduce two-stage paradigm, it has lightning-fast performance. Spark can run as a standalone or on top of Hadoop YARN, where it can read data directly from HDFS. In addition to its in-memory processing, graph processing, and machine learning, Spark can also handle streaming. Companies like Yahoo, Intel, Baidu, Trend Micro, and Groupon are already using it.
Storm is a distributed real-time computation system that claims to do for streaming what Hadoop did for batch processing. It can be used for real-time analytics, machine learning, continuous computation, and more. The cool thing is that it was designed to be used with any programming language. It runs on top of Hadoop YARN and can be used with Flume to store data on HDFS. Storm is already used by the likes of WebMD, Yelp, and Spotify.
Samza is a distributed stream-processing framework that is based on Apache Kafka and YARN. It provides a simple callback-based API that’s similar to MapReduce, and it includes snapshot management and fault tolerance in a durable and scalable way.
Kinesis is Amazon’s service for real-time processing of streaming data on the cloud. It’s deeply integrated with other Amazon services via connectors, such as S3, Redshift, and DynamoDB, for a complete Big Data architecture. Kinesis also includes Kinesis Client Library (KCL) that allows you to build applications and use stream data for dashboards, alerts, or even dynamic pricing.
The big firms don’t just sit and twiddle their thumbs while the Big Data keeps growing. IBM InfoSphere Streams, Microsoft StreamInsight, and Informatica Vibe Data Stream are just a few of the commercial enterprise-grade solutions that are available for real-time processing. Chances are that if your company already uses, for example, Microsoft-based products, you’ll stay with your vendor and add on StreamInsight to your roaster. Otherwise, you’ll probably go for one of the open-source solutions.