Readers of our blog should know by now that Apache Hadoop is great for offline batch processing of Big Data. But what about online streaming data? What if you’re running a ticker for the stock exchange or a real-time analytics dashboard? You might think that collecting streaming data is only relevant for big enterprises, but you don’t have to be The New York Stock Exchange to collect real-time data. Before you jump into the stream, here are 4 tips to get you started.
Streaming is not for Everything
Counting Facebook likes is perfect for real-time, but there are plenty of operations which can’t be done online, operations such as join, sort, or top. Also, streaming data is great for quick on the surface analysis. If you need deeper analysis like pattern recognition or trend prediction you’ll have to head back to your data warehouse for good old batch processing.
Try Existing Tools
Like Facebook, plenty of companies develop in-house tools to process streaming data. These tools require a lot of development, maintenance, and testing. So before going back to the coding board, research existing tools. Storm, "doing for real-time processing what Hadoop did for batch processing", is a popular choice for example. Tools that collect logs like Flume, Scribe, FluentD, and logstash could also do the trick and require less changes in your application. There’s also Kinesis, Amazon’s service for real-time processing of streaming data on the cloud. Give it a try.
Storage Not Included
Streaming solutions like Storm don’t store data. You’ll need to integrate with a database to keep all that data and aggregations for later analysis. Storm can run on top of Hadoop YARN while being used with Flume to store data on HDFS, or be integrated with Redis and other databases.
Maintenance is a b...
Distributed real-time systems like Storm require hardcore administration since they need to be up and running at all times, so maintenance isn't easy. You should definitely consider a managed service on the cloud like Kinesis to save some headaches or just go ahead and put your sysadmins on call.
Streaming data could give you immediate insights in real-time. Before you start collecting it, verify that your data processing can work online, research existing tools, prepare to store the data, and have personnel to maintain the system and keep it online at all times. Enjoy the stream.